How to recognize to which domain the document belongs

This task is known as automatic document categorization into one or multiple predefined domains.

NTextCat is able to classify/categorize text. Possible categories could be world’s known languages or domains e.g. Sports, Finance, Politics, etc.

This article explains how NTextCat can help you to automatically recognize which category document belongs to. Document can be anything: email, news article, blog post, Twitter message, etc.

We need the following items:

1)      Feature extractor

2)      Model creator

3)      Distance measure between models

4)      Categorizer

Feature extractor

We need to transform each document into its features. The features could be anything that can describe text in numbers. Here I describe the bag of words approach. It suggests that we take a document and count all distinct words in it. We can omit non-meaningful words like: a, the, to, which, etc. Each pair “word-count” is a feature of document. Please take a look at interface IBag<T> in NTextCat/NClassify project.

So we need to implement the following interface:

    public interface IFeatureExtractor<TSource, TFeature>

    {

        IEnumerable<TFeature> GetFeatures(TSource obj);

    }

 

TSource can be a string and TFeature can be a pair “word-count” (e.g. Tuple<string, int>).

Easiest naïve implementation could be something like this:

    public class BagOfWordsFeatureExtractor : IFeatureExtractor<string, Tuple<string, int>>

    {

        private static readonly HashSet<string> _stopWords = new HashSet<string>(GetFeatureStream(RawStopwords));

        public IEnumerable<Tuple<string, int>> GetFeatures(string document)

        {

            var words = GetFeatureStream(document);

            var features = words.WhereNot(_stopWords.Contains).GroupBy(w => w, (key, group) => Tuple.Create(key, group.Count()));

            return features;

            //Tuple<string, int>[] bagOfWords = this.GetFeatures("some interesting document").ToArray();

            //int totalNumberOfWords = bagOfWords.Sum(t => t.Item2);

            //IEnumerable<Tuple<string, double>> distribution = bagOfWords.Select(t => Tuple.Create(t.Item1, (double) t.Item2/totalNumberOfWords));

        }

 

        private static IEnumerable<string> GetFeatureStream(string documents)

        {

            return Regex.Split(documents, "[^a-zA-Z]").Where(w => w.Length >= 2).Select(w => w.ToLowerInvariant());

        }

 

        private const string RawStopwords =

@"a,the,from,to,etc”;

    }

 

 

Model creator

I’d suggest using a probability distribution of words in a document as a document model. The distribution is almost the same thing as the bag with only difference that a number associated with a word is not just a count how many times the word occurred in the document but the count divided by a total number of words in the document. In C#

Fortunately there is IDistribution<T> interface and Distribution<T> implementation in NTextCat/NClassify project. Distribution<T> constructor takes IBag<T> as a parameter.

Distance measure between models

Assumption is that documents belonging to the same category usually have similar features. How similar they are is determined with distance measure. We can say that a word distribution model is a vector so that we can measure distance between two distributions with Sine Distance measure. More similar two distributions are less distance between them is.

You can also use custom already available VectorDistanceCalculator for these purposes.

Categorizer

A categorizer’s idea is very simple: take a model of unseen document and tell which category it most likely belongs too.

Interface looks like this:

    public interface ICategorizedClassifier<TItem, TCategory>

    {

        IEnumerable<Tuple<TCategory, double>> Classify(TItem item);

    }

 

So given a document it returns mapping of a category to a measure of likelihood how the document belongs to the category. I avoid using the word “probability” here as the measure is a broader concept. The only thing it guarantees is that if you the measure for category A is higher than the measure for category B, then A is more likely than B. However you cannot say that if measure(A)/measure (B) = 2, then A is twice as likely than B. So measure is not guaranteed to be a probability. Though it might.

You can use KnnMonoCategorizedClassifier<string, string> which implements ICategorizedClassifier<string, string>.

Full cycle

We carefully select documents representing each domain. For our purposes we can merge all documents belonging to the same domain in one single compound document and train a model (feature extractor + model creator). Then we create categorizer supplying it with a mapping of a category (domain) to a domain model. Now we are ready to recognize which domain a previously unseen document belongs to.

            var trainingDocuments =

                new Dictionary<string, string>

                    {

                        { "sports", File.ReadAllText("..\\..\\TestData\\Sports.txt") },

                        { "economy", File.ReadAllText("..\\..\\TestData\\Economy.txt") },

                    };

            var featureExtractor = new BagOfWordsFeatureExtractor();

            var trainedModels = new Dictionary<IDistribution<string>, string>();

            foreach (var trainingItem in trainingDocuments)

            {

                var distribution = CreateModel(featureExtractor, trainingItem.Value);

                trainedModels.Add(distribution, trainingItem.Key);

            }

            var classifier =

                new KnnMonoCategorizedClassifier<IDistribution<string>, string>(new VectorDistanceCalculator<string>(), trainedModels);

            var resultSports = classifier.Classify(CreateModel(featureExtractor,

                "Fitch Ratings on Wednesday said Britain's latest budget proposals show commitment to its existing deficit reduction strategy and do not impact its AAA credit rating.")).ToArray();

            Assert.GreaterOrEqual(resultSports.Length, 1);

            Assert.AreEqual(resultSports[0].Item1, "economy");

            var resultFinance = classifier.Classify(CreateModel(featureExtractor,

                "Ryan Flannigan strikes a four off the last ball to help Scotland claim a four-wicket win over Canada in the fifth-place play-off at the qualifying tournament for the ICC World Twenty20 in Dubai.")).ToArray();

            Assert.GreaterOrEqual(resultFinance.Length, 1);

            Assert.Equals(resultFinance[0].Item1, "sports");

 

 

Last edited Mar 29, 2012 at 6:10 PM by IvanAkcheurov, version 1

Comments

No comments yet.