Language Detection Time performance

Sep 27, 2011 at 3:45 PM

Hi,

How can I improve language detection time performance?

Currently the language detection process is too slow.

I'm using Wikipedia-Experimental-UTF8Only method and remove, from the full list of file symbols, all of the languages I do not need, it improved the time performance and now it is about 7 times faster.

But I need it to go faster!

 

Are there any configurations that might affect the detection time performance?

Are there any time performance differences between LM and Wikipedia-Experimental-UTF8Only methods?

What else can be done to improve time performance? My goal is to minimize detection time by half.

 

Thanks

Coordinator
Sep 29, 2011 at 6:22 AM

Hi Ofry,

In addition to optimization techniques described in http://ntextcat.codeplex.com/workitem/567 
you can also truncate your input string. Usually 500-1000 characters of input string is enough to identify its language. However you should test that chosen truncation length brings only insignificant negligible degrade of precision.
Russian articles you've sent me are ~12k characters long and Chinese ones are ~2k characters long. Later are recognized 2.4 times faster.

In addition, I've managed to get recognition itself even faster -- 1.5 times -- with reimplementing generic Bag from PowerCollections. Simple Dictionary based bag is 3 times faster than one from Power Collections (at least in my case).
I will play with specific bag implementation and expect to get recognition 2.5 times faster than it is in NTextCat 0.1.5.
I will release these fixes soon (issue 587: http://ntextcat.codeplex.com/workitem/587). 

Best Regards,
Ivan


Sep 30, 2011 at 5:17 PM

When do you estimate a version with the fixes will be released?

 

Thanks,

Ofry

Coordinator
Oct 4, 2011 at 8:56 AM

This or next week.

Currently fix makes things 1.6-1.7 times faster for snippets more than 10k characters long (thus faster recognition among small number of languages, e.g. 20, and much faster learning).
Intel i5 2500, 20 languages, 10k characters. Recognition took ~11ms (vs 18ms of v0.1.5).

But it is only 1.1 times faster for recognition of 1k characters snippet among 20 languages.
Intel i5 2500, 20 languages, 1k characters. Recognition took ~1.9ms (vs ~2.1 of v0.1.5).

Best Regards,
Ivan

Oct 9, 2011 at 12:23 PM

What is the minimal number of words you recommend using to detect the correct language?

If I have a very long text and I do not want to send it all for better performance, what would be better, sending the first X words or sending the X most repeating words (using a histogram)?

Thanks

Coordinator
Jan 7, 2014 at 10:49 PM
Hi,
Sending the first X words would be a better option in my opinion but you need to check this assumption on your data.
Best,
Ivan