Project Description
NTextCat is text classification utility (tool and API).
Primary target is language identification. So it helps you to recognize (identify) the language of text (or binary) snippet.
NTextCat is inspired by famous Perl utility for language identification: TextCat

ONLINE DEMO

Languages available out of the box:
  • Wikipedia-based:
    • Default: 280+ languages (and flavors) encoded in UTF-8, UTF-16(LE), UTF-32(LE). Additionally 83 most popular lanugages encoded in their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-Legacy__All-Utf8
    • 280+ languages (and flavors) encoded in UTF-8 only. Directory: Wikipedia-All-Utf8
    • 83 most popular languages encoded in UTF-8, UTF-16(LE), UTF-32(LE) and their respective "legacy" encodings (e.g. 1252, Big5, etc.). Directory: Wikipedia-MostCommon-LegacyAndUtf8
    • 83 most popular languages encoded in UTF-8 only. Directory: Wikipedia-MostCommon-Utf8
  • 74 languages from original TextCat tool (full list). Directory: TextCat\LM

Recommended input: snippet of text with more than 50 words.

For technical help and suggestions please create a discussion.
To file a bug please create an issue.
For support, consultancy and custom features implementation please contact me directly through Codeplex or through email: ivan.akcheurovthe name is copy-pasteable

 
Sponsor features you wish to have. Features:
  1. .Net Framework 4.0 support
  2. .Net Framework 3.5 Client Profile support (compatible with Mono 2.6.7). Mono 2.6.7 is also shipped with Ubuntu 10.10 and 11.04 (Linux expert's review is highly appreciated).
  3. Pure .Net application (C#)
  4. Tool is able to use language models produced by original TextCat tool.
  5. Command line interface application which has the same API as original TextCat tool (same switches, same default values). Mono.Options was used.
  6. SQL Server 2008/2012 integration via user-defined functions.
Roadmap:
  1. "Driver Mode" -- command line interface for communication with other programming environments via stdout (for Java, C/C++, Python, etc.)
  2. Web Service (SOAP, ReST)
  3. Increase precision on short snippets (currently recommended input length is more than 50 words).
Low Priority:
  1. Silverlight support.
  2. Driver mode client (description above) for Java and C/C++ (help needed).
  3. Web service client for Java (help needed)

Current implementation status:
  • Console application which is capable of training (creating language models) and classifying new snippet of text into one or more classes of known languages.
How to identify language using command line interface
  • Library that you can reference from your application to empower it with language identification capabilities
How to identify language using managed API

Last edited Oct 13, 2012 at 5:17 PM by IvanAkcheurov, version 66