Text Classification: an exploration of different representations and learning algorithms

This practical sessions was produced using Jupyter. If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory: simply proceed as usual in your favorite Python environment.

Introduction

The aim of this practical session is to get yourself acquainted with the different models that can be used for NLP classification tasks and to get some exposure to the different statistical machine learning and deep learning packages.

As we have seen in the second practical session, a great example of a highly-biased task is spam classification. The goal of the task is to classify whether a given piece of text (e.g., email, sms message) is spam or not. Since there are only 2 classes (spam or not), we call such tasks binary classification. However, text classification tasks can also include multi-label tasks such as news article topic prediction. Following these two tasks, this practical session has 3 sections:

  1. Brief Data Analysis & Processing:
    • What does the label distribution look like?
    • Train/test splitting
  2. Classical Algorithms:
    • Bag-of-Words vs. TF-IDF
    • filtering: removing stopwords
    • Logistic Regression vs. NaiveBayes
  3. Beyond Feature Engineering:
    • LSTM for spam filtering

Acknowledgements

Huge thanks to Reza and Mehmet for the following inspirations, a big part of the code is recycled from their notebooks!

Content Warning: this exercise's data may contain explicit words.


Setting up your environment

While you can download the following packages with pip to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project. If you'd like to, here is a quick tutorial on virtual environments that you can checkout with an EPFL email.

Alternatively you can use the EPFL jupyter notebook service noto, however you will have to pip install some specific packages such as torchtext.

  1. First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6, per scikit-learn and torch requirements. If you are using the a Jupyter Notebook, make sure the interpreter points to the correct python executable.
  1. Then install the following packages into your venv:
pip install -U ipykernel
pip install -U ipywidgets
pip install -U pip setuptools wheel
pip install -U pandas
pip install -U matplotlib
pip install -U scikit-learn
pip install -U seaborn
pip install -U nltk
pip install -U torch
pip install -U torchtext==0.10.0
pip install -U torchdata
  1. Next, import the necessary packages:

Note: If this part of the code hangs, simply restart your kernel and rerun, sometimes importing packages multiple times can create a problem!

In [ ]:
# 1) Importing necessary packages:
# general
import os
import string
import random
from collections import Counter
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

print("1")
# dataset + processing
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split #, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#
# from nltk import word_tokenize
# from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

print("2")
# classification models
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.exceptions import ConvergenceWarning

print("3")
# metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# LSTM part's packages
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy import data # nlp library of Pytorch
import warnings as wrn
wrn.filterwarnings('ignore')

print("4")
# 2) Setting the seed:
seed = 42
os.environ['PYTHONHASHSEED']=str(seed)
random.seed(seed)
np.random.seed(seed)

print("5")
In [ ]:
# 3) Downloading stopwords from NLTK (if you haven't done it before!)
nltk_stopwords_downloaded = True
if not nltk_stopwords_downloaded:
    nltk.download('stopwords')  

Now you are ready to start the exercises!

1) Brief Data Analysis & Processing

a) Spam Dataset

To solve and evaluate the spam task, we will use the same annotated English sms corpus from Kaggle as the first practical session. You can download the data here. Simply put it in the same folder as the notebook you are running. As we have done the analysis and pre-processing in the first practical session, we will quickly run the same steps. Remember that the labels in the spam dataset are ill-balanced, heavily skewed towards the ham label.

In [ ]:
df = pd.read_csv("spam.csv", header=0, names=['label','sms'], usecols=[0,1])
# df.tail()
df.head()

As you can see there are 2 types of classes: "ham" & "spam". Let's take a look at their distribution.

In [ ]:
df['label'] = df['label'].replace(to_replace={'ham': 0, 'spam': 1})
print("Label percentages are:")
print(df.label.value_counts(normalize=True))

Then, to make sure that we don't overfit our models to the data, we split the data into train and test sets. We use the very convenient train_test_split function from scikit-learn. The test_size parameter allows us to choose what percentage of the data should be in the test set. $x$ is the sms message, while $y$ is the corresponding label to the sms.

In [ ]:
train_split_random_state = 11
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
    df['sms'].values,
    df['label'].values,
    test_size=0.2, 
    random_state=train_split_random_state
)

print("Spam amount in train set: {} out of {} datapoints".format((y_train_spam == 1).sum(), len(y_train_spam)))
print("Spam percentage in train set: {}%".format(round((y_train_spam == 1).sum() / float(len(y_train_spam)) * 100, 4)))

print("Size of train set is: ", len(y_train_spam))
print("Size of test set is: ", len(y_test_spam))

b) 20 newsgroup Dataset

To solve and evaluate the news topic classification task, we will use the 20 newsgroup dataset that has 19K articles in 20 different news groups. We download the data through scikit-learn, so you don't have to manually download it. Let's take a look at the label distribution in this dataset.

In [ ]:
news = fetch_20newsgroups(subset='all')
print("Number of articles: ", len(news.data))
print("Number of different categories: ", len(news.target_names))

To look at the label distribution of the 20 classes, we plot a pie chart.

In [ ]:
news_labels = news.target_names
sizes = [Counter(news.target)[i] for i in range(len(news_labels))]
plt.figure(figsize=(10,8))
plt.pie(sizes, labels=news_labels, autopct='%1.1f%%')
plt.show()

Q: What do you notice about the label distribution of the news dataset compared to the spam one? Does this change your plan on which metric to use to evaluate the classifiers we will test in the next section?

A: TODO - your answer here!

We also divide this dataset into train and test sets.

In [ ]:
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(
    news.data,
    news.target, 
    test_size=0.2, 
    random_state=train_split_random_state
)

2) Choosing Features & a Classifier

Feature engineering is when NLP task-specific knowledge can come in handy, and make it more likely for a simple classifier to learn the task. This requires us to index tokens and create meaningful representations out of them.

First we have to create a vocabulary. Some of the indexing themes you have seen in class include:

  • tokenization: splitting the text into units called tokens, which is required before indexing
  • stopwords: common words that can be filtered

To represent a document as a vector however, we need more than just indexing, such as a vector space that represents the words:

  • Bag-of-Words model: a single document can be considered as a bag of words and how many times each word occured, without caring about the order of the words. The word occurence counting is also called term frequency. You can think if this as a vector over all of the vocabulary where the entries are how many times that term has occured.
  • TF-IDF: term frequency–inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Luckily scikit-learn provides a Pipeline class where we can put in the correct order the vectorizer and classifier. Take as an example the vectorizer TF-IDF and the first classifier you can think of such as the Naive Bayes classifier (BernouilliNB). We can do the following to train and predict with the model on a binary task.

Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', BernoulliNB(alpha=0.2))])

Note that for a multi-label problem you can use MultinomialNB. In LogisticRegression, to specify multi-label, you can switch the multi_class parameter value from ovr (one-vs-rest) to auto.

In [ ]:
spam_classifier_dict = {
    "bagofwords+binaryNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', BernoulliNB(alpha=0.2))]),
    "bagofwords+binaryLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}

news_classifier_dict = {
    "bagofwords+multiNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', MultinomialNB(alpha=0.2))]),
    "bagofwords+multiLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}

Now that we have our classifiers we can train and validate them with cross-validation to see if the vectorizer and classifier combination does well on the task. Here we make sure to further separate the train dataset into several train and validation splits. This way the original test set is unused to prevent overfitting during feature engineering and classification algorithm exploration.

In [ ]:
train_validation_random_state = [1,5,10,15,20]

def train(
    classifier, 
    X_train, 
    y_train, 
    rnd_state_input , 
    test_split_size=0.1, 
):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=ConvergenceWarning)
        X_train, X_val, y_train, y_val = train_test_split(
            X_train,
            y_train,
            test_size=test_split_size,
            random_state=rnd_state_input
        )
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_val)
        if rnd_state_input == 5:
            print("\t|| k=5 Accuracy: {}% ".format(accuracy_score(y_val, y_pred)))
            print("\t|| k=5 Precision: {}% ".format(precision_score(y_val, y_pred, average='macro')))
            print("\t|| k=5 Recall: {}% ".format(recall_score(y_val, y_pred, average='macro')))
            print("\t|| k=5 F1: {}% ".format(f1_score(y_val, y_pred, average='macro')))
        return classifier, classifier.score(X_val, y_val)

def plot_confusion_matrix(classifier, X_test, y_test, labels):
    y_pred = classifier.predict(X_test)

    confusion_mat = confusion_matrix(y_test, y_pred)
    confusion_mat = normalize(confusion_mat , axis=1 , norm='l1' )
    # Plot confusion_matrix
    fig, ax = plt.subplots(figsize=(10,8))
    sns.heatmap(confusion_mat, annot=True, cmap = "flare", fmt ="0.2f", xticklabels=labels, yticklabels=labels)

    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

Now we are going to train each model on the task and do a k=5 cross validation.

In [ ]:
spam_classifier_dict = {
    "bagofwords+binaryNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', BernoulliNB(alpha=0.2))]),
    "bagofwords+binaryLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}

news_classifier_dict = {
    "bagofwords+multiNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', MultinomialNB(alpha=0.2))]),
    "bagofwords+multiLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}

print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model, 
            X_train=X_train_spam, 
            y_train=y_train_spam, 
            rnd_state_input=k
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on spam: ", mean_score)
    plot_confusion_matrix(classifier, X_test_spam, y_test_spam, [1,0])

print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ") 
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model, 
            X_train=X_train_news, 
            y_train=y_train_news, 
            rnd_state_input=k
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on news: ", mean_score)
    plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)

Q: An open ended question - Given that these findings are limited to the Bag-of-Words vectorization, what other vectorization methods could you use? What are some additional indexing themes that could help or hurt each task given the preprocessing and analysis we have done in the first section?

A: TODO - your answer here!

Q: An open ended question - Which model seems to be extremely poorly? Why do you think this might be the case?

A: TODO - your answer here!

In [ ]:
spam_classifier_dict = {
    "stopwords+tfidf+binaryNB": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', BernoulliNB(alpha=0.005))]),
    "stopwords+tfidf+binaryLogistic": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}

news_classifier_dict = {
    "stopwords+tfidf+multiNB": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', MultinomialNB(alpha=0.005))]),
    "stopwords+tfidf+multiLogistic": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}

print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model, 
            X_train=X_train_spam, 
            y_train=y_train_spam, 
            rnd_state_input=k
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on spam: ", mean_score)
    plot_confusion_matrix(classifier, X_test_spam, y_test_spam, [1,0])

print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ") 
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model, 
            X_train=X_train_news, 
            y_train=y_train_news, 
            rnd_state_input=k
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on news: ", mean_score)
    plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)

Q: What do you notice in the change of results? Do you find that the feature augmentation of TF-IDF has helped the task you expected it to help? Do you find that the feature augmentaion of TF-IDF has hurt the task you expected it to hurt?

A: TODO - your answer here!

3) Beyond feature engineering - LSTMs

A bag-of-words style of representation combined with a classifier often misses the order of sentences. Given the following sentences, can you see how this may be problematic?

I went to the bank to take a swim.

I went to the bank to withdraw money.

The meaning of the token bank is modulated by its context. To overcome this problem you have seen in class that you could learn a vector space representation of the vocabulary, in which word representation are taught to be closer (through a cosine distance objective) according to the context window in which they are used. Even in this situation, the word's distributional semantics are limited by the window size.

Instead we can make the classifier take the input text as a sequence. This line of machine learning algorithms are called Recurrent Neural Networks (RNNs). One popular implementation of such algorithms that you will see next week are LSTMs (long-term short-term memory).

Let's implement one in the popular deep learning framework pytorch! PyTorch has a text processor subpackage called torchtext that allows for easy indexing of tokens.

In [ ]:
TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)
fields = [("type",LABEL),('text',TEXT)]

training_data = data.TabularDataset(
    path="spam.csv",
    format="csv",
    fields=fields,
    skip_header=True
)
print(vars(training_data.examples[0]))

train_data,valid_data = training_data.split(
    split_ratio=0.75,
    random_state=random.seed(42)
)
TEXT.build_vocab(
    train_data,
    min_freq=5
)

LABEL.build_vocab(train_data)
print("Size of text vocab:",len(TEXT.vocab))
print("Size of label vocab:",len(LABEL.vocab))
TEXT.vocab.freqs.most_common(10)
In [ ]:
# Creating GPU variable
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

BATCH_SIZE = 64
# NOTE: BucketIterator batches the similar length of samples and reduces the need of padding tokens.
train_iterator,validation_iterator = data.BucketIterator.splits(
    (train_data,valid_data),
    batch_size = BATCH_SIZE,
    # Sort key is how to sort the samples
    sort_key = lambda x:len(x.text),
    sort_within_batch = True,
    device = device
)
In [ ]:
class LSTMClassifier(nn.Module):
    
    def __init__(
        self,
        vocab_size=len(TEXT.vocab),
        embedding_dim=100,
        hidden_dim=64,
        output_dim=1,
        n_layers=2,
        bidirectional=True,
        dropout=0.2
    ):
        
        super(LSTMClassifier,self).__init__()
        
        # Embedding layer converts integer sequences to vector sequences
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        
        # LSTM layer process the vector sequences 
        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers = n_layers,
                            bidirectional = bidirectional,
                            dropout = dropout,
                            batch_first = True
                           )
        
        # Dense layer to predict 
        self.fc = nn.Linear(hidden_dim * 2,output_dim)
        # Prediction activation function
        self.sigmoid = nn.Sigmoid()
        
    
    def forward(self,text,text_lengths):
        embedded = self.embedding(text)
        
        # Thanks to packing, LSTM don't see padding tokens 
        # and this makes our model better
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(),batch_first=True)
        
        packed_output,(hidden_state,cell_state) = self.lstm(packed_embedded)
        
        # Concatenating the final forward and backward hidden states
        hidden = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim = 1)
        
        dense_outputs=self.fc(hidden)

        #Final activation function
        outputs=self.sigmoid(dense_outputs)
        
        return outputs

LSTM_model = LSTMClassifier()
print(LSTM_model)

LSTM_model = LSTM_model.to(device)
optimizer = optim.Adam(LSTM_model.parameters(),lr=1e-4)
criterion = nn.BCELoss() # Binary Cross Entropy Loss
criterion = criterion.to(device)
In [ ]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

def evaluate(model,iterator,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    # deactivate the dropouts
    model.eval()
    
    # Sets require_grad flat False
    with torch.no_grad():
        for batch in iterator:
            text,text_lengths = batch.text
            
            predictions = model(text,text_lengths).squeeze()
              
            #compute loss and accuracy
            loss = criterion(predictions, batch.type)
            acc = binary_accuracy(predictions, batch.type)
            
            #keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def train(model,iterator,optimizer,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    model.train()
    
    for batch in iterator:
        
        # cleaning the cache of optimizer
        optimizer.zero_grad()
        text,text_lengths = batch.text
        
        # forward propagation and squeezing
        predictions = model(text,text_lengths).squeeze()
        
        # computing loss / backward propagation
        loss = criterion(predictions,batch.type)
        loss.backward()
        
        # accuracy
        acc = binary_accuracy(predictions, batch.type)
        
        # updating params
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    # It'll return the means of loss and accuracy
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
In [ ]:
EPOCH_NUMBER = 25
for epoch in range(1,EPOCH_NUMBER+1):
    train_loss,train_acc = train(LSTM_model,train_iterator,optimizer,criterion)
    valid_loss,valid_acc = evaluate(LSTM_model,validation_iterator,criterion)
    
    # Showing statistics
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    print()

As you can see the LSTM can reach a similar performance on this simple spam classification task. We invite you to further investigate how LSTM's do on multilabeled classification tasks, and with tasks where the input text length varies, such as the one in the news dataset.