NTextCat is text classification utility (tool and API).
Primary target is language identification
. So it helps you to recognize (identify) the language of text (or binary) snippet.
NTextCat is inspired by famous Perl utility for language identification: TextCatONLINE DEMO
Languages available out of the box:
- Default: 280+ languages (and flavors) encoded in UTF-8, UTF-16(LE), UTF-32(LE). Additionally 83 most popular lanugages encoded in their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-Legacy__All-Utf8
- 280+ languages (and flavors) encoded in UTF-8 only. Directory: Wikipedia-All-Utf8
- 83 most popular languages encoded in UTF-8, UTF-16(LE), UTF-32(LE) and their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-LegacyAndUtf8
- 83 most popular languages encoded in UTF-8 only. Directory: Wikipedia-MostCommon-Utf8
- 74 languages from original TextCat tool (full list). Directory: TextCat\LM
Recommended input: snippet of text with more than 50 words
For technical help and suggestions please create a discussion
To file a bug please create an issue
For support, consultancy
and custom features implementation
please contact me directly through Codeplex
or through email: ivan.akcheurov
- .Net Framework 4.0 support
- .Net Framework 3.5 Client Profile support (compatible with Mono 2.6.7). Mono 2.6.7 is also shipped with Ubuntu 10.10 and 11.04 (Linux expert's review is highly appreciated).
- Pure .Net application (C#)
- Tool is able to use language models produced by original TextCat tool.
- Command line interface application which has the same API as original TextCat tool (same switches, same default values). Mono.Options was used.
- SQL Server 2008/2012 integration via user-defined functions.
- "Driver Mode" -- command line interface for communication with other programming environments via stdout (for Java, C/C++, Python, etc.)
- Web Service (SOAP, ReST)
- Increase precision on short snippets (currently recommended input length is more than 50 words).
- Silverlight support.
- Driver mode client (description above) for Java and C/C++ (help needed).
- Web service client for Java (help needed)
Current implementation status:
How to identify language using command line interface
- Console application which is capable of training (creating language models) and classifying new snippet of text into one or more classes of known languages.
How to identify language using managed API
- Library that you can reference from your application to empower it with language identification capabilities