Text Classification Practical Session

Introduction

In this practical session, our goals are to:

1. Document Classification using Clustering

1.1 Setup

For this part of the practical session, we will need the scikit-learn  Python module. Scikit-learn provides a machine learning toolkit. In the following sections, we will make use of the k-Means  and Hierarchical Clustering  algorithms for document classification. Please, install scikit-learn, as instructed here .

1.2 Problem description

We are given a collection of documents, each of which belongs to one of two categories. For instance, a part of the documents could be news articles and the other part could be sci-fi literature. Our task is to write a cluster-based classifier which separates these documents into two classes (clusters). So, if we consider our previous example, we would ideally like to have all the news articles in one cluster and all the sci-fi texts in another cluster.

This problem can be split into four subtasks:

  1. Preprocessing: extract numerical features from the documents, to give them as input to the clustering algorithms. For this problem, one main assumption is that the type of vocabulary used in the two categories of documents is different. So, one approach is to:
  2. Document representation: for each document, count how many times each of the n most frequent words appears and store these values in an array. In the end, assemble the arrays to obtain a matrix X of dimension m x n, where:

  3. Clustering: apply the K-means  and the Agglomerative Clustering  algorithms to the dataset obtained in the previous point. For the Agglomerative Clustering, try all three possible linkage types ( ‘ward’ ,  ‘complete’  and  ‘average’ ).

  4. Plotting: after having obtained the clusters, we need to provide an intuitive manner of visualizing them. In order to meaningfully visualize 100-dimensional data, we should reduce/project it into a 2-dimensional space (a plane). But how should we choose the ‘best’ (most significant) plane onto which to project the data? This is exactly what Principal Component Analysis does. Run the data ( X  ) through a scikit-learn PCA  with 2 components to obtain a 2-dimensional reduced version which can then easily be plotted.

1.3 Your Turn

Your task is to fill in the missing code in the provided skeleton script. Please download the script and have a look at it. We suggest working with the documents in the “romance” and “government” categories from the Brown corpus . However, you are welcome to experiment with other categories from the Brown corpus as well.

Before beginning to implement your solution, try to answer the following:

1.4 Results Discussion

We are now ready to evaluate the performance of the four clustering algorithms (K-means and Agglomerative Clustering with ‘ward’, ‘complete’ and ‘average’ linkage types).

2. Naive Bayes Spam Classifier

In this exercise, we will build a mail spam classifier, using the Naive Bayes classification approach. Our task is to decide whether an email is legitimate (“ham”) or not (“spam”). The exercise is inspired from Chapter 3 of Machine Learning for Hackers , which you are encouraged to have a look at.

2.1 Dataset Download

Before diving in, we need a dataset to work with. You are provided with a dataset [16 Mb] containing a few thousands of emails of three categories: “spam”, “easy-ham” and “hard-ham”. Hard-ham is more difficult to distinguish from spam than the easy-ham. For example, hard-ham emails may include HTML tags, which are typically present in spam messages. The messages from each category are already split in two folders to facilitate creating training and test sets.  

2.2 Approach

Our approach is similar to the one in the first exercise, with one crucial difference. This time, we are using supervised learning methods, so a part of the data and tags will be used for training the classifier. The NLTK book contains a similar example concerning movie review classification . Feel free to read through its explanations before continuing.

This classification problem can be split into three subtasks:

  1. Preprocessing
  1. Training

In this case,  the feature associated to a message is the set of words occurring in the message body. Even if headers contain potentially useful information for the task at hand, our simplified spam filter will here ignore the headers of the messages and only focus on the email body.

  1. Evaluation

2.3 Your Turn

As in the previous section, a skeleton script where you need to fill in the blanks is provided.