Wrong results in ClassifyText? (newbie question)

Oct 18, 2012 at 3:34 PM

Hi,

I'm probably doing something wrong because my scenario is very simple.

I'm trying to get the language of "i love my life" and instead of getting "en" i get "rn","rw", "az", "kj", "ch". 

I've created a new project (console, .net 4) and added the following references: 

IvanAkcheurov.Commons

IvanAkcheurov.NClassify

IvanAkcheurov.NTextCat.Lib

IvanAkcheurov.NTextCat.Lib.Legacy

 

my code is:

LanguageIdentifier languageIdentifier = new LanguageIdentifier(@"LanguageModels\Wikipedia-All-Utf8", 50);
LanguageIdentifier.LanguageIdentifierSettings settings = new LanguageIdentifier.LanguageIdentifierSettings(50, 0, long.MaxValue, 1.05, 5);
IEnumerable<Tuple<string, double>> languages1 = languageIdentifier.ClassifyText("i love my life", settings);

so, what i'm doing wrong? :)

Thanks,

Omri

Coordinator
Oct 20, 2012 at 10:22 AM

Hi Omri,

Your code is OK!

There are two things:

  1. The recommended length of text snippets is at least 50 words (and you have only 4);
  2. I'm going to make a release this weekend (or at least on the next week) which is able to recognize the language of your text correctly (even with having only 4 words). However to get a reliable identification generally it is suggested to put at least 10 words.

The main problem of current routines is that they operate on the level of bytes (encoding). The new ones are operating on the level of characters (so there will be only ClassifyText methods and no ClassifyBytes) and they are much more adequate.

Another issue is that 4 words is still too little. Mostly it is an issue of purity of training data which I'm going to tackle quite soon too.

If you use NTextCat in commercial environment I can see what I can do specifically for your situation (maybe introduce dictionary-based algorithms of identification, retrain classifiers on your data and evaluate it, etc.). If so please contact me http://www.codeplex.com/site/users/contact/IvanAkcheurov?OriginalUrl=http%3a%2f%2fwww.codeplex.com%2fsite%2fusers%2fview%2fIvanAkcheurov

If not, then for now my advice is to put longer text snippets and use the new release I'm going to make. Reducing the number of languages to choose from is also an option.

Best Regards,
Ivan

Marked as answer by IvanAkcheurov on 1/7/2014 at 3:46 PM
Coordinator
Oct 24, 2012 at 4:32 PM

Hi Omri,

Fortunately you've revealed the list of languages that happened to be quite short: English, Hebrew, German and France.
I have tested your snippet "i love my life" and the one of mine "you got me" and it reliably says English if recognized among the 4 languages.
The main problem with Wikipedia is garbage -- many non-English Wikipedia articles contain English text and phrases in it. That doesn't matter for identifying documents. But it makes a problem for very short snippets when identifying among 100+ languages. Therefore I would advise to remove unnecessary languages.
Solution:
  1. Please wait until the new release has been released (I think it is going to be v0.2.0 on this weekend).
  2. Keep only 5 profiles: "fr", "en", "he", "de", "simple"
  3. Use NaiveBayesLanguageIdentifier.
  4. Treat "simple" and "en" as English (colloquial and formal respectively).
You may want to try this solution with the existing release v0.1.7 (just delete all language models but the 5 mentioned above), however identifiers from the future release generally show substantially better results.
Please post your feedback when you have a chance to try out the solution with the future release v0.2.0.
If you have further questions please post'em here.
Thanks for using NTextCat and good luck with your project!
Best Regards,
Ivan
Marked as answer by IvanAkcheurov on 1/7/2014 at 3:46 PM
Oct 30, 2012 at 9:55 AM

Thanks!

I'll try it

 

Omri