Text Classification: an exploration of different representations and learning algorithms

This practical sessions was produced using Jupyter. If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory: simply proceed as usual in your favorite Python environment.

Introduction

The aim of this practical session is to get yourself acquainted with the different models that can be used for NLP classification tasks and to get some exposure to the different statistical machine learning and deep learning packages.

As we have seen in the second practical session, a great example of a highly-biased task is spam classification. The goal of the task is to classify whether a given piece of text (e.g., email, sms message) is spam or not. Since there are only 2 classes (spam or not), we call such tasks binary classification. However, text classification tasks can also include multi-label tasks such as news article topic prediction. Following these two tasks, this practical session has 3 sections:

  1. Brief Data Analysis & Processing:
    • What does the label distribution look like?
    • Train/test splitting
  2. Classical Algorithms:
    • Bag-of-Words vs. TF-IDF
    • filtering: removing stopwords
    • Logistic Regression vs. NaiveBayes
  3. Beyond Feature Engineering:
    • LSTM for spam filtering

Acknowledgements

Huge thanks to Reza and Mehmet for the following inspirations, a big part of the code is recycled from their notebooks!

Content Warning: this exercise's data may contain explicit words.


Setting up your environment

If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.

While you can download the following packages with pip to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.

First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python executable.

For example we use conda to manage our environments, and we do the following to create a new one and activate it:

conda create --name inlp-venv
conda activate inlp-venv

Then install the following packages (this might take around 2 minutes):

pip install -U ipykernel
pip install -U pandas
pip install -U matplotlib
pip install -U scikit-learn
pip install -U seaborn
pip install -U nltk
pip install -U torch
pip install -U torchdata
pip install -U spacy
pip install -U torchtext==0.17.2

python -m spacy download en_core_web_sm

If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:

In [ ]:
# !pip install -U ipykernel
# !pip install -U pandas
# !pip install -U matplotlib
# !pip install -U scikit-learn
# !pip install -U seaborn
# !pip install -U nltk
# !pip install -U torch
# !pip install -U torchdata
# !pip install -U spacy
# !pip install -U torchtext==0.17.2

# !python -m spacy download en_core_web_sm

Next, import the necessary packages:

Note: If this part of the code hangs more than a minute, restart your kernel and rerun, sometimes importing packages multiple times can create a problem!

In [ ]:
# 1) Importing necessary packages:
# general
import os
import string
import random
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

print("1")
# dataset + processing
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split  # , KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords

print("2")
# classification models
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.exceptions import ConvergenceWarning

print("3")
# metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
)

print("4")
# LSTM part's torch packages
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
import spacy
import warnings as wrn
wrn.filterwarnings("ignore")

print("5")
# 2) Setting the seed:
seed = 42
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)

print("All good!")
In [ ]:
# 3) Downloading stopwords from NLTK (if you haven't done it before!)
nltk_stopwords_downloaded = True
if not nltk_stopwords_downloaded:
    nltk.download("stopwords")

Now you are ready to start the exercises!

1) Brief Data Analysis & Processing

a) Spam Dataset

To solve and evaluate the spam task, we will use the same annotated English sms corpus from Kaggle as the first practical session. You can download the data here. Simply put it in the same folder as the notebook you are running. As we have done the analysis and pre-processing in the first practical session, we will quickly run the same steps. Remember that the labels in the spam dataset are ill-balanced, heavily skewed towards the ham label.

In [ ]:
df = pd.read_csv("spam.csv", header=0, names=["label", "sms"], usecols=[0, 1])
# df.tail()
df.head()

As you can see there are 2 types of classes: "ham" & "spam". Let's take a look at their distribution.

In [ ]:
df["label"] = df["label"].replace(to_replace={"ham": 0.0, "spam": 1.0})
print("Label percentages are:")
print(df.label.value_counts(normalize=True))

Then, to make sure that we don't overfit our models to the data, we split the data into train and test sets. We use the very convenient train_test_split function from scikit-learn. The test_size parameter allows us to choose what percentage of the data should be in the test set. $x$ is the sms message, while $y$ is the corresponding label to the sms.

In [ ]:
train_split_random_state = 11
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
    df["sms"].values,
    df["label"].values,
    test_size=0.2,
    random_state=train_split_random_state,
)

print(
    "Spam amount in train set: {} out of {} datapoints".format(
        (y_train_spam == 1).sum(), len(y_train_spam)
    )
)
print(
    "Spam percentage in train set: {}%".format(
        round((y_train_spam == 1).sum() / float(len(y_train_spam)) * 100, 4)
    )
)

print("Size of train set is: ", len(y_train_spam))
print("Size of test set is: ", len(y_test_spam))

b) 20 newsgroup Dataset

To solve and evaluate the news topic classification task, we will use the 20 newsgroup dataset that has 19K articles in 20 different news groups. We download the data through scikit-learn, so you don't have to manually download it. Let's take a look at the label distribution in this dataset.

In [ ]:
news = fetch_20newsgroups(subset="all")
print("Number of articles: ", len(news.data))
print("Number of different categories: ", len(news.target_names))

To look at the label distribution of the 20 classes, we plot a pie chart.

In [ ]:
news_labels = news.target_names
sizes = [Counter(news.target)[i] for i in range(len(news_labels))]
plt.figure(figsize=(10, 8))
plt.pie(sizes, labels=news_labels, autopct="%1.1f%%")
plt.show()

Q: What do you notice about the label distribution of the news dataset compared to the spam one? Does this change your plan on which metric to use to evaluate the classifiers we will test in the next section?

A: TODO - your answer here!

Q: What is the average length of text in each task and how is length variance? What part of this classification pipeline do you think would be most affected by this artifact?

A: TODO - your answer here!

We also divide this dataset into train and test sets.

In [ ]:
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(
    news.data, news.target, test_size=0.2, random_state=train_split_random_state
)

2) Choosing Features & a Classifier

Feature engineering is when NLP task-specific knowledge can come in handy, and make it more likely for a simple classifier to learn the task. This requires us to index tokens and create meaningful representations out of them.

First we have to create a vocabulary. Some of the indexing themes you have seen in class include:

  • tokenization: splitting the text into units called tokens, which is required before indexing
  • stopwords: common words that can be filtered

To represent a document as a vector however, we need more than just indexing, such as a vector space that represents the words:

  • Bag-of-Words model: a single document can be considered as a bag of words and how many times each word occured, without caring about the order of the words. The word occurence counting is also called term frequency. You can think if this as a vector over all of the vocabulary where the entries are how many times that term has occured.
  • TF-IDF: term frequency–inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Luckily scikit-learn provides a Pipeline class where we can put in the correct order the vectorizer and classifier. Take as an example the vectorizer TF-IDF and the first classifier you can think of such as the Naive Bayes classifier (BernouilliNB). We can do the following to train and predict with the model on a binary task.

Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', BernoulliNB(alpha=0.2))])

Note that for a multi-label problem you can use MultinomialNB. In LogisticRegression, to specify multi-label, you can switch the multi_class parameter value from ovr (one-vs-rest) to auto.

In [ ]:
spam_classifier_dict = {
    "bagofwords+binaryNB": Pipeline(
        [("vectorizer", CountVectorizer()), ("classifier", BernoulliNB(alpha=0.2))]
    ),
    "bagofwords+binaryLogistic": Pipeline(
        [
            ("vectorizer", CountVectorizer()),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
            ),
        ]
    ),
}

news_classifier_dict = {
    "bagofwords+multiNB": Pipeline(
        [("vectorizer", CountVectorizer()), ("classifier", MultinomialNB(alpha=0.2))]
    ),
    "bagofwords+multiLogistic": Pipeline(
        [
            ("vectorizer", CountVectorizer()),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
            ),
        ]
    ),
}

Now that we have our classifiers we can train and validate them with cross-validation to see if the vectorizer and classifier combination does well on the task. Here we make sure to further separate the train dataset into several train and validation splits. This way the original test set is unused to prevent overfitting during feature engineering and classification algorithm exploration.

In [ ]:
train_validation_random_state = [1, 5, 10, 15, 20]


def train(
    classifier,
    X_train,
    y_train,
    rnd_state_input,
    test_split_size=0.1,
):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=ConvergenceWarning)
        X_train, X_val, y_train, y_val = train_test_split(
            X_train, y_train, test_size=test_split_size, random_state=rnd_state_input
        )
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_val)
        if rnd_state_input == 5:
            print("\t|| k=5 Accuracy: {}% ".format(accuracy_score(y_val, y_pred)))
            print(
                "\t|| k=5 Precision: {}% ".format(
                    precision_score(y_val, y_pred, average="macro")
                )
            )
            print(
                "\t|| k=5 Recall: {}% ".format(
                    recall_score(y_val, y_pred, average="macro")
                )
            )
            print("\t|| k=5 F1: {}% ".format(f1_score(y_val, y_pred, average="macro")))
        return classifier, classifier.score(X_val, y_val)


def plot_confusion_matrix(classifier, X_test, y_test, labels, is_small_plot=False):
    y_pred = classifier.predict(X_test)

    confusion_mat = confusion_matrix(y_test, y_pred)
    confusion_mat = normalize(confusion_mat, axis=1, norm="l1")
    # Plot confusion_matrix
    fig, ax = None, None
    if is_small_plot:
        fig, ax = plt.subplots(figsize=(5, 4))
    else:
        fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(
        confusion_mat,
        annot=True,
        cmap="flare",
        fmt="0.2f",
        xticklabels=labels,
        yticklabels=labels,
    )

    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    plt.show()

Now we are going to train each model on the task and do a k=5 cross validation.

In [ ]:
spam_classifier_dict = {
    "bagofwords+binaryNB": Pipeline(
        [("vectorizer", CountVectorizer()), ("classifier", BernoulliNB(alpha=0.2))]
    ),
    "bagofwords+binaryLogistic": Pipeline(
        [
            ("vectorizer", CountVectorizer()),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
            ),
        ]
    ),
}

news_classifier_dict = {
    "bagofwords+multiNB": Pipeline(
        [("vectorizer", CountVectorizer()), ("classifier", MultinomialNB(alpha=0.2))]
    ),
    "bagofwords+multiLogistic": Pipeline(
        [
            ("vectorizer", CountVectorizer()),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
            ),
        ]
    ),
}

print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model,
            X_train=X_train_spam,
            y_train=y_train_spam,
            rnd_state_input=k,
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on spam: ", mean_score)
    plot_confusion_matrix(
        classifier, X_test_spam, y_test_spam, [1, 0], is_small_plot=True
    )

print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model,
            X_train=X_train_news,
            y_train=y_train_news,
            rnd_state_input=k,
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on news: ", mean_score)
    plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)

Q: An open ended question - Given that these findings are limited to the Bag-of-Words vectorization, what other vectorization methods could you use? What are some additional indexing themes that could help or hurt each task given the preprocessing and analysis we have done in the first section?

A: TODO - your answer here!

Q: An open ended question - Which model seems to be extremely poorly? Why do you think this might be the case?

A: TODO - your answer here!

In [ ]:
spam_classifier_dict = {
    "stopwords+tfidf+binaryNB": Pipeline(
        [
            ("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
            ("classifier", BernoulliNB(alpha=0.005)),
        ]
    ),
    "stopwords+tfidf+binaryLogistic": Pipeline(
        [
            ("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
            ),
        ]
    ),
}

news_classifier_dict = {
    "stopwords+tfidf+multiNB": Pipeline(
        [
            ("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
            ("classifier", MultinomialNB(alpha=0.005)),
        ]
    ),
    "stopwords+tfidf+multiLogistic": Pipeline(
        [
            ("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
            (
                "classifier",
                LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
            ),
        ]
    ),
}

print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model,
            X_train=X_train_spam,
            y_train=y_train_spam,
            rnd_state_input=k,
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on spam: ", mean_score)
    plot_confusion_matrix(
        classifier, X_test_spam, y_test_spam, [1, 0], is_small_plot=True
    )

print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
    print("~~~~~~~~~~~~~~~~~~~~")
    print(model_name + " : ")
    all_cross_val_scores = []
    for k in train_validation_random_state:
        classifier, score = train(
            classifier=model,
            X_train=X_train_news,
            y_train=y_train_news,
            rnd_state_input=k,
        )
        all_cross_val_scores.append(score)
    all_cross_val_scores_np = np.array(all_cross_val_scores)
    mean_score = all_cross_val_scores_np.mean()
    print("Mean accuracy score on news: ", mean_score)
    plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)

Q: What do you notice in the change of results? Do you find that the feature augmentation of TF-IDF has helped the task you expected it to help? Do you find that the feature augmentaion of TF-IDF has hurt the task you expected it to hurt?

A: TODO - your answer here!

3) Beyond feature engineering - LSTMs

A bag-of-words style of representation combined with a classifier often misses the order of sentences. Given the following sentences, can you see how this may be problematic?

I went to the bank to take a swim.

I went to the bank to withdraw money.

The meaning of the token bank is modulated by its context. To overcome this problem you have seen in class that you could learn a vector space representation of the vocabulary, in which word representation are taught to be closer (through a cosine distance objective) according to the context window in which they are used. Even in this situation, the word's distributional semantics are limited by the window size.

Instead we can make the classifier take the input text as a sequence. This line of machine learning algorithms are called Recurrent Neural Networks (RNNs). One popular implementation of such algorithms that you will see next week are LSTMs (long-term short-term memory).

Let's implement one in the popular deep learning framework pytorch! PyTorch has a text processor subpackage called torchtext that allows for easy indexing of tokens.

NOTE: Feel free to ignore the numpy warning.

Dataset processing

In [ ]:
# Define a tokenizer
nlp = spacy.load("en_core_web_sm")
tokenizer = get_tokenizer("spacy")


# Custom Dataset class
class SpamDataset(Dataset):
    def __init__(self, dataframe, text_transform, label_transform):
        self.data = dataframe
        # function to process the text
        self.text_transform = text_transform
        # function to process the label
        # self.label_transform = label_transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]["sms"]
        label = self.data.iloc[idx]["label"]
        return (
            self.text_transform(text),
            label,
            # self.label_transform(label)
        )


# Define transforms
def text_transform(text):
    return torch.tensor([vocab[token] for token in tokenizer(text)])


# NOTE: already transformed in the beginning of the notebook
# def label_transform(label):
#     return 1.0 if label == "spam" else 0.0


# Build vocabulary
def yield_tokens(data_iter):
    for _, row in data_iter.iterrows():
        yield tokenizer(row["sms"])


vocab = build_vocab_from_iterator(yield_tokens(df), min_freq=5, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Split data
train_df, valid_df = train_test_split(df, test_size=0.25, random_state=42)

# Create datasets
train_dataset = SpamDataset(train_df, text_transform, None)
valid_dataset = SpamDataset(valid_df, text_transform, None)

# Check vocab size
print("Size of text vocab:", len(vocab))

# Example of accessing the first example
print(train_dataset[0])
print(type(train_dataset[0][1]))

Dataloader creation

In [ ]:
# Creating GPU variable
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 64

# Custom collate function to handle padding within each batch
def collate_fn(batch):
    # Separate text and label from the batch
    texts, labels = zip(*batch)

    # Pad the sequences to the length of the longest sequence in the batch
    text_batch = pad_sequence(texts, batch_first=True, padding_value=vocab["<pad>"])
    label_batch = torch.tensor(labels, dtype=torch.float32)

    return text_batch.to(device), label_batch.to(device)

# Creating DataLoaders for training and validation
train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)

valid_loader = DataLoader(
    valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn
)

# Example of accessing the first batch
for text_batch, label_batch in train_loader:
    print("Text batch shape:", text_batch.shape)
    print("Label batch shape:", label_batch.shape)
    break

Classifier creation

In [ ]:
class LSTMClassifier(nn.Module):
    def __init__(
        self,
        vocab_size=len(vocab),
        embedding_dim=100,
        hidden_dim=64,
        output_dim=1,
        n_layers=2,
        bidirectional=True,
        dropout=0.2,
    ):
        super(LSTMClassifier, self).__init__()

        # Embedding layer converts integer sequences to vector sequences
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM layer processes the vector sequences
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout,
            batch_first=True,
        )

        # Dense layer to predict
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        # Prediction activation function
        self.sigmoid = nn.Sigmoid()

    def forward(self, text):
        embedded = self.embedding(text)

        # Run the padded sequence directly through LSTM
        lstm_out, (hidden_state, cell_state) = self.lstm(embedded)

        # Concatenate the final forward and backward hidden states
        hidden = torch.cat((hidden_state[-2, :, :], hidden_state[-1, :, :]), dim=1)

        dense_outputs = self.fc(hidden)

        # Final activation function
        outputs = self.sigmoid(dense_outputs)

        return outputs


# Instantiate the model, optimizer, and loss function
LSTM_model = LSTMClassifier()
print(LSTM_model)

LSTM_model = LSTM_model.to(device)
optimizer = optim.Adam(LSTM_model.parameters(), lr=1e-4)
criterion = nn.BCELoss()  # Binary Cross Entropy Loss
criterion = criterion.to(device)

Train & evaluation utilities

In [ ]:
def binary_accuracy(preds, y):
    # round predictions to the closest integer
    rounded_preds = torch.round(preds)

    correct = (rounded_preds == y).float()
    acc = correct.sum() / len(correct)
    return acc


def evaluate(model, eval_loader, criterion):

    epoch_loss = 0.0
    epoch_acc = 0.0

    # deactivate the dropouts
    model.eval()

    # Sets require_grad flat False
    with torch.no_grad():
        for text_batch, label_batch in eval_loader:
            predictions = model(text_batch).squeeze(1)

            # compute loss and accuracy
            loss = criterion(predictions, label_batch)
            acc = binary_accuracy(predictions, label_batch)

            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(eval_loader), epoch_acc / len(eval_loader)


def train(model, train_loader, optimizer, criterion):

    epoch_loss = 0.0
    epoch_acc = 0.0

    model.train()

    for text_batch, label_batch in train_loader:

        # cleaning the cache of optimizer
        optimizer.zero_grad()

        # forward propagation and squeezing
        predictions = model(text_batch).squeeze(1)

        # computing loss / backward propagation
        loss = criterion(predictions, label_batch)
        loss.backward()

        # accuracy
        acc = binary_accuracy(predictions, label_batch)

        # updating params
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    # It'll return the means of loss and accuracy
    return epoch_loss / len(train_loader), epoch_acc / len(train_loader)

Training

This part may take ~4-5 min

In [ ]:
EPOCH_NUMBER = 25
for epoch in range(1, EPOCH_NUMBER + 1):
    train_loss, train_acc = train(LSTM_model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(LSTM_model, valid_loader, criterion)

    # Showing statistics
    print(f"\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%")
    print(f"\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%")
    print()

As you can see the LSTM can reach a similar performance on this simple spam classification task.

Now implement the same model for the news dataset task, where we need to deal with having more than 2 labels and varying input text lengths and see how it compares to the baselines.