Custom "Language" Models

Mar 19, 2012 at 6:24 AM

Hi Ivan,
I want to use NTextCat infrastructure in order to recognize other features in addition to languages.
I'm trying to understand the structure of the language models files and by that creating custom "language" models of my own.

Could you please explain what is the structure of the language models files and how do I create some of my own ?

Thanks,

Ofer

Coordinator
Mar 29, 2012 at 6:45 PM
Edited Mar 29, 2012 at 6:51 PM

Hi Ofer,

I have written an article "How to recognize to which domain the document belongs".
It gives answers about categorization of text in general.

Structure of language models (LMs) is very easy. Each LM is a text file encoded with cp1250. Each line corresponds to one ngram.

Structure of the line:
<ngram><delimiter><count><newline>

Each part in more detail:

1)       ngram – sequence of 1 or more bytes which constitute ngram (LMs are byte-based). Those bytes can be anything except of the following: ‘0’..’9’, ‘\r’, ‘\n’, ‘\t’, ‘ ‘ (space).

2)       delimiter – pair of ‘\t’ and space (space is for compatibility with original TextCat tool);

3)       count – integer written in text form. It representing a number of times ngram has occurred.

4)       newline – pair of ‘\r’ and ‘\n’

Sometimes you can find ‘_’ (underscore) in ngram. It denotes either the beginning or the end of a word (more precisely, of a sequence of those bytes that can be a part of ngram), or the underscore character found in original text. It is done for compatibility with original TextCat tool too. BTW that’s why main exe file is named NTextCatLegacy.exe now. It should retain same functionality as original TextCat as close as possible.

And that’s it.

In future, I want to create character based language models (so that ngram will contain characters and whole LM file will be an UTF8 text file).

Additionally I want to put more metadata about LM in its file so it might contain XML header in future.

Hope it answers your question. Though I'm not quite sure what your "other features" mean :)

Best Regards,
Ivan

Marked as answer by IvanAkcheurov on 1/7/2014 at 3:47 PM