How to recognize to which domain the document belongs
This task is known as automatic document categorization into one or multiple predefined domains.
NTextCat is able to classify/categorize text. Possible categories could be world’s known languages or domains e.g. Sports, Finance, Politics, etc.
This article explains how NTextCat can help you to automatically recognize which category document belongs to. Document can be anything: email, news article, blog post, Twitter message, etc.
We need the following items:
1) Feature extractor
2) Model creator
3) Distance measure between models
We need to transform each document into its features. The features could be anything that can describe text in numbers. Here I describe the bag of words approach. It suggests that we take a document and count all distinct words in it. We can omit non-meaningful
words like: a, the, to, which, etc. Each pair “word-count” is a feature of document. Please take a look at interface IBag<T> in NTextCat/NClassify project.
So we need to implement the following interface:
interface IFeatureExtractor<TSource, TFeature>
IEnumerable<TFeature> GetFeatures(TSource obj);
TSource can be a string and TFeature can be a pair “word-count” (e.g. Tuple<string, int>).
Easiest naïve implementation could be something like this:
class BagOfWordsFeatureExtractor :
private static readonly
HashSet<string> _stopWords =
int>> GetFeatures(string document)
var words = GetFeatureStream(document);
var features = words.WhereNot(_stopWords.Contains).GroupBy(w => w, (key, group) =>
//Tuple<string, int> bagOfWords = this.GetFeatures("some interesting document").ToArray();
//int totalNumberOfWords = bagOfWords.Sum(t => t.Item2);
//IEnumerable<Tuple<string, double>> distribution = bagOfWords.Select(t => Tuple.Create(t.Item1, (double) t.Item2/totalNumberOfWords));
IEnumerable<string> GetFeatureStream(string documents)
"[^a-zA-Z]").Where(w => w.Length >= 2).Select(w => w.ToLowerInvariant());
private const string RawStopwords =
I’d suggest using a probability distribution of words in a document as a document model. The distribution is almost the same thing as the bag with only difference that a number associated with a word is not just a count how many times the word occurred
in the document but the count divided by a total number of words in the document. In C#
Fortunately there is IDistribution<T> interface and Distribution<T> implementation in NTextCat/NClassify project. Distribution<T> constructor takes IBag<T> as a parameter.
Distance measure between models
Assumption is that documents belonging to the same category usually have similar features. How similar they are is determined with distance measure. We can say that a word distribution model is a vector so that we can measure distance between two distributions
with Sine Distance measure. More similar two distributions are less distance between them is.
You can also use custom already available VectorDistanceCalculator for these purposes.
A categorizer’s idea is very simple: take a model of unseen document and tell which category it most likely belongs too.
Interface looks like this:
interface ICategorizedClassifier<TItem, TCategory>
double>> Classify(TItem item);
So given a document it returns mapping of a category to a measure of likelihood how the document belongs to the category. I avoid using the word “probability” here as the measure is a broader concept. The only thing it guarantees is that if you
the measure for category A is higher than the measure for category B, then A is more likely than B. However you
cannot say that if measure(A)/measure (B) = 2, then A is twice as likely than B. So measure is not guaranteed to be a probability. Though it might.
You can use KnnMonoCategorizedClassifier<string, string> which implements ICategorizedClassifier<string, string>.
We carefully select documents representing each domain. For our purposes we can merge all documents belonging to the same domain in one single compound document and train a model (feature extractor + model creator). Then we create categorizer supplying
it with a mapping of a category (domain) to a domain model. Now we are ready to recognize which domain a previously unseen document belongs to.
var trainingDocuments =
var featureExtractor = new
var trainedModels = new
foreach (var trainingItem
var distribution = CreateModel(featureExtractor, trainingItem.Value);
var classifier =
var resultSports = classifier.Classify(CreateModel(featureExtractor,
"Fitch Ratings on Wednesday said Britain's latest budget proposals show commitment to its existing deficit
reduction strategy and do not impact its AAA credit rating.")).ToArray();
var resultFinance = classifier.Classify(CreateModel(featureExtractor,
"Ryan Flannigan strikes a four off the last ball to help Scotland claim a four-wicket win over Canada in the fifth-place play-off at the qualifying tournament for the ICC World Twenty20 in Dubai.")).ToArray();