NTextCat is a text classification utility (tool and API).
The primary target is language identification
. So it helps you to recognize (identify) the language of a given text snippet.
Languages available out of the box:
- Default: 280+ languages (and flavors) encoded in UTF-8, UTF-16(LE), UTF-32(LE). Additionally 83 most popular lanugages encoded in their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-Legacy__All-Utf8
- 280+ languages (and flavors) encoded in UTF-8 only. Directory: Wikipedia-All-Utf8
- 83 most popular languages encoded in UTF-8, UTF-16(LE), UTF-32(LE) and their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-LegacyAndUtf8
- 83 most popular languages encoded in UTF-8 only. Directory: Wikipedia-MostCommon-Utf8
- 74 languages from original TextCat tool (full list). Directory: TextCat\LM
Recommended input: a snippet of text with at least 5 words
(though it works quite OK with just a couple of words). Try it out yourself.
For technical help and suggestions please create a discussion
To file a bug please create an issue
For support, consultancy
and custom features implementation
please contact me directly through Codeplex
or through email: ivan.akcheurov
- .Net Framework 4.0 support
- Compatible with Mono 2.10.x). Mono 2.10.5 is shipped with Ubuntu 11.10 and Mono 188.8.131.52 with Ubuntu 12.04 and 12.10 (Linux expert's reviews are highly appreciated). More info on Mono support.
- Pure .Net application (C#)
- Command line interface is available. Mono.Options was used.
- SQL Server 2012 integration via user-defined functions.
(if you need particular items, please vote for them or discuss them)
- Driver Mode (stdout-CLI for other applications)? -- command line interface for communication with other programming environments via stdout (for Java, C/C++, Python, etc.)
- Web Service API? (JSON/XML based, HTTP based).
- Silverlight Support?
- Driver mode client (description above) for Java and C/C++ (help needed).
- Web service client for Java (help needed).
How to identify language from command line
- A console application which is capable of training (creating language models) and recognizing which language a snippet of text belongs to.
How to identify language using managed API
- A library that you can reference from your application to empower it with language identification capabilities.