Mar 29, 2012 at 6:45 PM
Edited Mar 29, 2012 at 6:51 PM
I have written an article "How to recognize to which domain the document
It gives answers about categorization of text in general.
Structure of language models (LMs) is very easy. Each LM is a text file encoded with cp1250. Each line corresponds to one ngram.
Structure of the line:
Each part in more detail:
ngram – sequence of 1 or more bytes which constitute ngram (LMs are byte-based). Those bytes can be anything except of the following:
‘\n’, ‘\t’, ‘ ‘ (space).
delimiter – pair of ‘\t’ and space (space is for compatibility with original TextCat tool);
count – integer written in text form. It representing a number of times ngram has occurred.
newline – pair of ‘\r’ and
Sometimes you can find ‘_’ (underscore) in ngram. It denotes either the beginning or the end of a word (more precisely, of a sequence of those bytes that can be a part of ngram), or the underscore character
found in original text. It is done for compatibility with original TextCat tool too. BTW that’s why main exe file is named NTextCatLegacy.exe now. It should retain same functionality as original TextCat as close as possible.
And that’s it.
In future, I want to create character based language models (so that ngram will contain characters and whole LM file will be an UTF8 text file).
Additionally I want to put more metadata about LM in its file so it might contain XML header in future.
Hope it answers your question. Though I'm not quite sure what your "other features" mean :)