Text Classification Practical Session

Introduction

In this practical session, our goals are to:

Explore clustering algorithms applied to document classification;
Build a naive Bayes spam classifier.

1. Document Classification using Clustering

1.1 Setup

For this part of the practical session, we will need the scikit-learn Python module. Scikit-learn provides a machine learning toolkit. In the following sections, we will make use of the k-Means and Hierarchical Clustering algorithms for document classification. Please, install scikit-learn, as instructed here .

1.2 Problem description

We are given a collection of documents, each of which belongs to one of two categories. For instance, a part of the documents could be news articles and the other part could be sci-fi literature. Our task is to write a cluster-based classifier which separates these documents into two classes (clusters). So, if we consider our previous example, we would ideally like to have all the news articles in one cluster and all the sci-fi texts in another cluster.

This problem can be split into four subtasks:

Preprocessing: extract numerical features from the documents, to give them as input to the clustering algorithms. For this problem, one main assumption is that the type of vocabulary used in the two categories of documents is different. So, one approach is to:
- define what "words" are: to obtain meaningful results, the punctuation and some "stop-words" should be ignored. (import "stopwords" from Corpora using nltk.download())
- Extract the n most frequent words in the whole corpus (i.e. out of all of the documents, in both categories). These n words will represent the n features used by the clustering algorithm.
  
  For this exercise, we use n equals 100.
Document representation: for each document, count how many times each of the n most frequent words appears and store these values in an array. In the end, assemble the arrays to obtain a matrix X of dimension m x n, where:
- m represents the total number of documents (from both categories)
- n represents the number of features
- X[i][j] represents the number of times the jth overall most frequent word appears in document i.
Clustering: apply the K-means and the Agglomerative Clustering algorithms to the dataset obtained in the previous point. For the Agglomerative Clustering, try all three possible linkage types ( ‘ward’ , ‘complete’ and ‘average’ ).
Plotting: after having obtained the clusters, we need to provide an intuitive manner of visualizing them. In order to meaningfully visualize 100-dimensional data, we should reduce/project it into a 2-dimensional space (a plane). But how should we choose the ‘best’ (most significant) plane onto which to project the data? This is exactly what Principal Component Analysis does. Run the data ( X ) through a scikit-learn PCA with 2 components to obtain a 2-dimensional reduced version which can then easily be plotted.

1.3 Your Turn

Your task is to fill in the missing code in the provided skeleton script. Please download the script and have a look at it. We suggest working with the documents in the “romance” and “government” categories from the Brown corpus . However, you are welcome to experiment with other categories from the Brown corpus as well.

Before beginning to implement your solution, try to answer the following:

Do you believe that clustering in general is a good approach for document classification?
Is the approach described above reasonable for the problem at hand? Why or why not?
Give an example of a document classification problem where this approach would probably not be suitable.
How would you measure the performance of the four clustering algorithms? ( Some metrics were included in the provided skeleton script)

1.4 Results Discussion

We are now ready to evaluate the performance of the four clustering algorithms (K-means and Agglomerative Clustering with ‘ward’, ‘complete’ and ‘average’ linkage types).

Run your script and have a look at the four obtained plots.
Compare the outcomes of the four algorithms. How do you explain the differences?

2. Naive Bayes Spam Classifier

In this exercise, we will build a mail spam classifier, using the Naive Bayes classification approach. Our task is to decide whether an email is legitimate (“ham”) or not (“spam”). The exercise is inspired from Chapter 3 of Machine Learning for Hackers , which you are encouraged to have a look at.

2.1 Dataset Download

Before diving in, we need a dataset to work with. You are provided with a dataset [16 Mb] containing a few thousands of emails of three categories: “spam”, “easy-ham” and “hard-ham”. Hard-ham is more difficult to distinguish from spam than the easy-ham. For example, hard-ham emails may include HTML tags, which are typically present in spam messages. The messages from each category are already split in two folders to facilitate creating training and test sets.

2.2 Approach

Our approach is similar to the one in the first exercise, with one crucial difference. This time, we are using supervised learning methods, so a part of the data and tags will be used for training the classifier. The NLTK book contains a similar example concerning movie review classification . Feel free to read through its explanations before continuing.

This classification problem can be split into three subtasks:

Preprocessing

read the data, filter words (stop-list, frequencies), represent documents (messages here).
Then, the messages are split into training and test sets (we are performing supervised learning).

Training

Once the data has been split, we need to extract the features from the training and test sets. The features extracted from the training sets will be passed on to the Naive Bayes classifier in the training step. The features in the test sets will be passed on to the classifier for the evaluation.

In this case, the feature associated to a message is the set of words occurring in the message body. Even if headers contain potentially useful information for the task at hand, our simplified spam filter will here ignore the headers of the messages and only focus on the email body.

The next step is to train a Naive Bayes classifier , using the features extracted from the three training sets (spam, easy-ham and hard-ham). In NLTK, training a Naive Bayes classifier is straightforward: call the train function, passing it the appropriate set of features.

Evaluation

In this last step, the trained Naive Bayes classifier will be tested against the “gold” standard, provided by the features extracted from the three test sets. In order to evaluate the classifier, we will use NLTK’s accuracy function .

2.3 Your Turn

As in the previous section, a skeleton script where you need to fill in the blanks is provided.

Follow the indications in the previous section and in the skeleton script, in order to implement the Naive Bayes spam classifier.
What is the accuracy of the classifier on the three test sets?
What would you do in order to improve the accuracy for the ‘spam’ category?