Project Description

NTextCat is a text classification utility (tool and API).
The primary target is language identification. So it helps you to recognize (identify) the language of a given text snippet.

 
Sponsor features you wish to have. ONLINE DEMO

Languages available out of the box:

  • Wikipedia-based:
    • Default: 280+ languages (and flavors) encoded in UTF-8, UTF-16(LE), UTF-32(LE). Additionally 83 most popular lanugages encoded in their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-Legacy__All-Utf8
    • 280+ languages (and flavors) encoded in UTF-8 only. Directory: Wikipedia-All-Utf8
    • 83 most popular languages encoded in UTF-8, UTF-16(LE), UTF-32(LE) and their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-LegacyAndUtf8
    • 83 most popular languages encoded in UTF-8 only. Directory: Wikipedia-MostCommon-Utf8
  • 74 languages from original TextCat tool (full list). Directory: TextCat\LM

Recommended input: a snippet of text with at least 5 words (though it works quite OK with just a couple of words). Try it out yourself.

For technical help and suggestions please create a discussion.
To file a bug please create an issue.
For support, consultancy and custom features implementation please contact me directly through Codeplex
or through email: ivan.akcheurovthe name is copy-pasteable

Features:

  1. .Net Framework 4.0 support
  2. Compatible with Mono 2.10.x). Mono 2.10.5 is shipped with Ubuntu 11.10 and Mono 2.10.8.1 with Ubuntu 12.04 and 12.10 (Linux expert's reviews are highly appreciated). More info on Mono support.
  3. Pure .Net application (C#)
  4. Command line interface is available. Mono.Options was used.
  5. SQL Server 2012 integration via user-defined functions.

Roadmap

(if you need particular items, please vote for them or discuss them):
  1. Driver Mode (stdout-CLI for other applications)? -- command line interface for communication with other programming environments via stdout (for Java, C/C++, Python, etc.)
  2. Web Service API? (JSON/XML based, HTTP based).
Low Priority:
  1. Silverlight Support?
  2. Driver mode client (description above) for Java and C/C++ (help needed).
  3. Web service client for Java (help needed).

Interface available:

  • A console application which is capable of training (creating language models) and recognizing which language a snippet of text belongs to.
How to identify language from command line
  • A library that you can reference from your application to empower it with language identification capabilities.
How to identify language using managed API

Last edited Jun 7, 2013 at 9:00 PM by IvanAkcheurov, version 76