In this practical session, our goals are to:
For this part of the practical session, we will need the scikit-learn Python module. Scikit-learn provides a machine learning toolkit. In the following sections, we will make use of the k-Means and Hierarchical Clustering algorithms for document classification. Please, install scikit-learn, as instructed here .
We are given a collection of documents, each of which belongs to one of two categories. For instance, a part of the documents could be news articles and the other part could be sci-fi literature. Our task is to write a cluster-based classifier which separates these documents into two classes (clusters). So, if we consider our previous example, we would ideally like to have all the news articles in one cluster and all the sci-fi texts in another cluster.
This problem can be split into four subtasks:
define what "words" are:
to obtain meaningful results,
the punctuation and some "stop-words" should be ignored. (import "stopwords" from Corpora using nltk.download()
)
Extract the n most frequent words in the whole corpus (i.e. out of all of the documents, in both categories). These n words will represent the n features used by the clustering algorithm.
For this exercise, we use n equals 100.
Clustering: apply the K-means and the Agglomerative Clustering algorithms to the dataset obtained in the previous point. For the Agglomerative Clustering, try all three possible linkage types ( ‘ward’ , ‘complete’ and ‘average’ ).
Plotting: after having obtained the clusters, we need to provide an intuitive manner of visualizing them. In order to meaningfully visualize 100-dimensional data, we should reduce/project it into a 2-dimensional space (a plane). But how should we choose the ‘best’ (most significant) plane onto which to project the data? This is exactly what Principal Component Analysis does. Run the data ( X ) through a scikit-learn PCA with 2 components to obtain a 2-dimensional reduced version which can then easily be plotted.
Your task is to fill in the missing code in the provided skeleton script. Please download the script and have a look at it. We suggest working with the documents in the “romance” and “government” categories from the Brown corpus . However, you are welcome to experiment with other categories from the Brown corpus as well.
Before beginning to implement your solution, try to answer the following:
We are now ready to evaluate the performance of the four clustering algorithms (K-means and Agglomerative Clustering with ‘ward’, ‘complete’ and ‘average’ linkage types).
In this exercise, we will build a mail spam classifier, using the Naive Bayes classification approach. Our task is to decide whether an email is legitimate (“ham”) or not (“spam”). The exercise is inspired from Chapter 3 of Machine Learning for Hackers , which you are encouraged to have a look at.
Before diving in, we need a dataset to work with. You are provided with a dataset [16 Mb] containing a few thousands of emails of three categories: “spam”, “easy-ham” and “hard-ham”. Hard-ham is more difficult to distinguish from spam than the easy-ham. For example, hard-ham emails may include HTML tags, which are typically present in spam messages. The messages from each category are already split in two folders to facilitate creating training and test sets.
Our approach is similar to the one in the first exercise, with one crucial difference. This time, we are using supervised learning methods, so a part of the data and tags will be used for training the classifier. The NLTK book contains a similar example concerning movie review classification . Feel free to read through its explanations before continuing.
This classification problem can be split into three subtasks:
In this case, the feature associated to a message is the set of words occurring in the message body. Even if headers contain potentially useful information for the task at hand, our simplified spam filter will here ignore the headers of the messages and only focus on the email body.
train
function, passing it the appropriate set of features.
As in the previous section, a skeleton script where you need to fill in the blanks is provided.