This project is read-only.
1

Closed

Language Detection time performance

description

Closed by

comments

gatoramo wrote Sep 27, 2011 at 2:53 PM

Can performance be improved by removing unused languages symbol files?

IvanAkcheurov wrote Sep 29, 2011 at 7:40 AM

Hi,
Can performance be improved by removing unused languages symbol files?
Yes it can. Here is more information regarding it: http://ntextcat.codeplex.com/workitem/567
Brief tests show that if you leave 20 language models out of 280 available by default, then recognition gets ~1.3 times faster.

Are there time performance differences between LM and Wikipedia-Experimental-UTF8Only methods? If so, which method has the better time performance?
Not much. Methods are the same. Difference is the language models used to recognize input string. LM should be faster because it has fewer language models. I would recommend using Wikipedia-Experimental-UTF8Only. I'll make it the default choice when I'm finished with language-encoding mappings.
I will reimplement PowerCollection's bag. Preliminary tests show 1.5 times faster language detection of 12k characters string. I'll try to get it 2.5 times faster with additional tricks.

For other readers: There is discussion thread related to performance optimizations: http://ntextcat.codeplex.com/discussions/273996