This practical sessions was produced using Jupyter. If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory: simply proceed as usual in your favorite Python environment.
The aim of this practical session is to get yourself acquainted with the different models that can be used for NLP classification tasks and to get some exposure to the different statistical machine learning and deep learning packages.
As we have seen in the second practical session, a great example of a highly-biased task is spam classification. The goal of the task is to classify whether a given piece of text (e.g., email, sms message) is spam or not. Since there are only 2 classes (spam or not), we call such tasks binary classification. However, text classification tasks can also include multi-label tasks such as news article topic prediction. Following these two tasks, this practical session has 3 sections:
Huge thanks to Reza and Mehmet for the following inspirations, a big part of the code is recycled from their notebooks!
If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.
While you can download the following packages with pip
to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.
First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python
executable.
For example we use conda to manage our environments, and we do the following to create a new one and activate it:
conda create --name inlp-venv
conda activate inlp-venv
Then install the following packages (this might take around 2 minutes):
pip install -U ipykernel
pip install -U pandas
pip install -U matplotlib
pip install -U scikit-learn
pip install -U seaborn
pip install -U nltk
pip install -U torch
pip install -U torchdata
pip install -U spacy
pip install -U torchtext==0.17.2
python -m spacy download en_core_web_sm
If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:
# !pip install -U ipykernel
# !pip install -U pandas
# !pip install -U matplotlib
# !pip install -U scikit-learn
# !pip install -U seaborn
# !pip install -U nltk
# !pip install -U torch
# !pip install -U torchdata
# !pip install -U spacy
# !pip install -U torchtext==0.17.2
# !python -m spacy download en_core_web_sm
Next, import the necessary packages:
Note: If this part of the code hangs more than a minute, restart your kernel and rerun, sometimes importing packages multiple times can create a problem!
# 1) Importing necessary packages:
# general
import os
import string
import random
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
print("1")
# dataset + processing
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split # , KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
print("2")
# classification models
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.exceptions import ConvergenceWarning
print("3")
# metrics
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
)
print("4")
# LSTM part's torch packages
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
import spacy
import warnings as wrn
wrn.filterwarnings("ignore")
print("5")
# 2) Setting the seed:
seed = 42
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
print("All good!")
# 3) Downloading stopwords from NLTK (if you haven't done it before!)
nltk_stopwords_downloaded = True
if not nltk_stopwords_downloaded:
nltk.download("stopwords")
Now you are ready to start the exercises!
To solve and evaluate the spam task, we will use the same annotated English sms corpus from Kaggle as the first practical session. You can download the data here. Simply put it in the same folder as the notebook you are running. As we have done the analysis and pre-processing in the first practical session, we will quickly run the same steps. Remember that the labels in the spam dataset are ill-balanced, heavily skewed towards the ham label.
df = pd.read_csv("spam.csv", header=0, names=["label", "sms"], usecols=[0, 1])
# df.tail()
df.head()
As you can see there are 2 types of classes: "ham" & "spam". Let's take a look at their distribution.
df["label"] = df["label"].replace(to_replace={"ham": 0.0, "spam": 1.0})
print("Label percentages are:")
print(df.label.value_counts(normalize=True))
Then, to make sure that we don't overfit our models to the data, we split the data into train and test sets. We use the very convenient train_test_split function from scikit-learn. The test_size parameter allows us to choose what percentage of the data should be in the test set. $x$ is the sms message, while $y$ is the corresponding label to the sms.
train_split_random_state = 11
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
df["sms"].values,
df["label"].values,
test_size=0.2,
random_state=train_split_random_state,
)
print(
"Spam amount in train set: {} out of {} datapoints".format(
(y_train_spam == 1).sum(), len(y_train_spam)
)
)
print(
"Spam percentage in train set: {}%".format(
round((y_train_spam == 1).sum() / float(len(y_train_spam)) * 100, 4)
)
)
print("Size of train set is: ", len(y_train_spam))
print("Size of test set is: ", len(y_test_spam))
To solve and evaluate the news topic classification task, we will use the 20 newsgroup dataset that has 19K articles in 20 different news groups. We download the data through scikit-learn, so you don't have to manually download it. Let's take a look at the label distribution in this dataset.
news = fetch_20newsgroups(subset="all")
print("Number of articles: ", len(news.data))
print("Number of different categories: ", len(news.target_names))
To look at the label distribution of the 20 classes, we plot a pie chart.
news_labels = news.target_names
sizes = [Counter(news.target)[i] for i in range(len(news_labels))]
plt.figure(figsize=(10, 8))
plt.pie(sizes, labels=news_labels, autopct="%1.1f%%")
plt.show()
Q: What do you notice about the label distribution of the news dataset compared to the spam one? Does this change your plan on which metric to use to evaluate the classifiers we will test in the next section?
A: TODO - your answer here!
Q: What is the average length of text in each task and how is length variance? What part of this classification pipeline do you think would be most affected by this artifact?
A: TODO - your answer here!
We also divide this dataset into train and test sets.
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(
news.data, news.target, test_size=0.2, random_state=train_split_random_state
)
Feature engineering is when NLP task-specific knowledge can come in handy, and make it more likely for a simple classifier to learn the task. This requires us to index tokens and create meaningful representations out of them.
First we have to create a vocabulary. Some of the indexing themes you have seen in class include:
To represent a document as a vector however, we need more than just indexing, such as a vector space that represents the words:
Luckily scikit-learn provides a Pipeline
class where we can put in the correct order the vectorizer and classifier. Take as an example the vectorizer TF-IDF and the first classifier you can think of such as the Naive Bayes classifier (BernouilliNB). We can do the following to train and predict with the model on a binary task.
Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', BernoulliNB(alpha=0.2))])
Note that for a multi-label problem you can use MultinomialNB. In LogisticRegression, to specify multi-label, you can switch the multi_class
parameter value from ovr
(one-vs-rest) to auto
.
spam_classifier_dict = {
"bagofwords+binaryNB": Pipeline(
[("vectorizer", CountVectorizer()), ("classifier", BernoulliNB(alpha=0.2))]
),
"bagofwords+binaryLogistic": Pipeline(
[
("vectorizer", CountVectorizer()),
(
"classifier",
LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
),
]
),
}
news_classifier_dict = {
"bagofwords+multiNB": Pipeline(
[("vectorizer", CountVectorizer()), ("classifier", MultinomialNB(alpha=0.2))]
),
"bagofwords+multiLogistic": Pipeline(
[
("vectorizer", CountVectorizer()),
(
"classifier",
LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
),
]
),
}
Now that we have our classifiers we can train and validate them with cross-validation to see if the vectorizer and classifier combination does well on the task. Here we make sure to further separate the train dataset into several train and validation splits. This way the original test set is unused to prevent overfitting during feature engineering and classification algorithm exploration.
train_validation_random_state = [1, 5, 10, 15, 20]
def train(
classifier,
X_train,
y_train,
rnd_state_input,
test_split_size=0.1,
):
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=ConvergenceWarning)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=test_split_size, random_state=rnd_state_input
)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_val)
if rnd_state_input == 5:
print("\t|| k=5 Accuracy: {}% ".format(accuracy_score(y_val, y_pred)))
print(
"\t|| k=5 Precision: {}% ".format(
precision_score(y_val, y_pred, average="macro")
)
)
print(
"\t|| k=5 Recall: {}% ".format(
recall_score(y_val, y_pred, average="macro")
)
)
print("\t|| k=5 F1: {}% ".format(f1_score(y_val, y_pred, average="macro")))
return classifier, classifier.score(X_val, y_val)
def plot_confusion_matrix(classifier, X_test, y_test, labels, is_small_plot=False):
y_pred = classifier.predict(X_test)
confusion_mat = confusion_matrix(y_test, y_pred)
confusion_mat = normalize(confusion_mat, axis=1, norm="l1")
# Plot confusion_matrix
fig, ax = None, None
if is_small_plot:
fig, ax = plt.subplots(figsize=(5, 4))
else:
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
confusion_mat,
annot=True,
cmap="flare",
fmt="0.2f",
xticklabels=labels,
yticklabels=labels,
)
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
Now we are going to train each model on the task and do a k=5 cross validation.
spam_classifier_dict = {
"bagofwords+binaryNB": Pipeline(
[("vectorizer", CountVectorizer()), ("classifier", BernoulliNB(alpha=0.2))]
),
"bagofwords+binaryLogistic": Pipeline(
[
("vectorizer", CountVectorizer()),
(
"classifier",
LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
),
]
),
}
news_classifier_dict = {
"bagofwords+multiNB": Pipeline(
[("vectorizer", CountVectorizer()), ("classifier", MultinomialNB(alpha=0.2))]
),
"bagofwords+multiLogistic": Pipeline(
[
("vectorizer", CountVectorizer()),
(
"classifier",
LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
),
]
),
}
print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_spam,
y_train=y_train_spam,
rnd_state_input=k,
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on spam: ", mean_score)
plot_confusion_matrix(
classifier, X_test_spam, y_test_spam, [1, 0], is_small_plot=True
)
print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_news,
y_train=y_train_news,
rnd_state_input=k,
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on news: ", mean_score)
plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)
Q: An open ended question - Given that these findings are limited to the Bag-of-Words vectorization, what other vectorization methods could you use? What are some additional indexing themes that could help or hurt each task given the preprocessing and analysis we have done in the first section?
A: TODO - your answer here!
Q: An open ended question - Which model seems to be extremely poorly? Why do you think this might be the case?
A: TODO - your answer here!
spam_classifier_dict = {
"stopwords+tfidf+binaryNB": Pipeline(
[
("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
("classifier", BernoulliNB(alpha=0.005)),
]
),
"stopwords+tfidf+binaryLogistic": Pipeline(
[
("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
(
"classifier",
LogisticRegression(solver="saga", multi_class="ovr", max_iter=200),
),
]
),
}
news_classifier_dict = {
"stopwords+tfidf+multiNB": Pipeline(
[
("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
("classifier", MultinomialNB(alpha=0.005)),
]
),
"stopwords+tfidf+multiLogistic": Pipeline(
[
("vectorizer", TfidfVectorizer(stop_words=stopwords.words("english"))),
(
"classifier",
LogisticRegression(solver="saga", multi_class="auto", max_iter=200),
),
]
),
}
print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_spam,
y_train=y_train_spam,
rnd_state_input=k,
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on spam: ", mean_score)
plot_confusion_matrix(
classifier, X_test_spam, y_test_spam, [1, 0], is_small_plot=True
)
print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_news,
y_train=y_train_news,
rnd_state_input=k,
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on news: ", mean_score)
plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)
Q: What do you notice in the change of results? Do you find that the feature augmentation of TF-IDF has helped the task you expected it to help? Do you find that the feature augmentaion of TF-IDF has hurt the task you expected it to hurt?
A: TODO - your answer here!
A bag-of-words style of representation combined with a classifier often misses the order of sentences. Given the following sentences, can you see how this may be problematic?
I went to the bank to take a swim.
I went to the bank to withdraw money.
The meaning of the token bank is modulated by its context. To overcome this problem you have seen in class that you could learn a vector space representation of the vocabulary, in which word representation are taught to be closer (through a cosine distance objective) according to the context window in which they are used. Even in this situation, the word's distributional semantics are limited by the window size.
Instead we can make the classifier take the input text as a sequence. This line of machine learning algorithms are called Recurrent Neural Networks (RNNs). One popular implementation of such algorithms that you will see next week are LSTMs (long-term short-term memory).
Let's implement one in the popular deep learning framework pytorch! PyTorch has a text processor subpackage called torchtext that allows for easy indexing of tokens.
NOTE: Feel free to ignore the numpy warning.
# Define a tokenizer
nlp = spacy.load("en_core_web_sm")
tokenizer = get_tokenizer("spacy")
# Custom Dataset class
class SpamDataset(Dataset):
def __init__(self, dataframe, text_transform, label_transform):
self.data = dataframe
# function to process the text
self.text_transform = text_transform
# function to process the label
# self.label_transform = label_transform
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
text = self.data.iloc[idx]["sms"]
label = self.data.iloc[idx]["label"]
return (
self.text_transform(text),
label,
# self.label_transform(label)
)
# Define transforms
def text_transform(text):
return torch.tensor([vocab[token] for token in tokenizer(text)])
# NOTE: already transformed in the beginning of the notebook
# def label_transform(label):
# return 1.0 if label == "spam" else 0.0
# Build vocabulary
def yield_tokens(data_iter):
for _, row in data_iter.iterrows():
yield tokenizer(row["sms"])
vocab = build_vocab_from_iterator(yield_tokens(df), min_freq=5, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
# Split data
train_df, valid_df = train_test_split(df, test_size=0.25, random_state=42)
# Create datasets
train_dataset = SpamDataset(train_df, text_transform, None)
valid_dataset = SpamDataset(valid_df, text_transform, None)
# Check vocab size
print("Size of text vocab:", len(vocab))
# Example of accessing the first example
print(train_dataset[0])
print(type(train_dataset[0][1]))
# Creating GPU variable
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 64
# Custom collate function to handle padding within each batch
def collate_fn(batch):
# Separate text and label from the batch
texts, labels = zip(*batch)
# Pad the sequences to the length of the longest sequence in the batch
text_batch = pad_sequence(texts, batch_first=True, padding_value=vocab["<pad>"])
label_batch = torch.tensor(labels, dtype=torch.float32)
return text_batch.to(device), label_batch.to(device)
# Creating DataLoaders for training and validation
train_loader = DataLoader(
train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)
valid_loader = DataLoader(
valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn
)
# Example of accessing the first batch
for text_batch, label_batch in train_loader:
print("Text batch shape:", text_batch.shape)
print("Label batch shape:", label_batch.shape)
break
class LSTMClassifier(nn.Module):
def __init__(
self,
vocab_size=len(vocab),
embedding_dim=100,
hidden_dim=64,
output_dim=1,
n_layers=2,
bidirectional=True,
dropout=0.2,
):
super(LSTMClassifier, self).__init__()
# Embedding layer converts integer sequences to vector sequences
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# LSTM layer processes the vector sequences
self.lstm = nn.LSTM(
embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout,
batch_first=True,
)
# Dense layer to predict
self.fc = nn.Linear(hidden_dim * 2, output_dim)
# Prediction activation function
self.sigmoid = nn.Sigmoid()
def forward(self, text):
embedded = self.embedding(text)
# Run the padded sequence directly through LSTM
lstm_out, (hidden_state, cell_state) = self.lstm(embedded)
# Concatenate the final forward and backward hidden states
hidden = torch.cat((hidden_state[-2, :, :], hidden_state[-1, :, :]), dim=1)
dense_outputs = self.fc(hidden)
# Final activation function
outputs = self.sigmoid(dense_outputs)
return outputs
# Instantiate the model, optimizer, and loss function
LSTM_model = LSTMClassifier()
print(LSTM_model)
LSTM_model = LSTM_model.to(device)
optimizer = optim.Adam(LSTM_model.parameters(), lr=1e-4)
criterion = nn.BCELoss() # Binary Cross Entropy Loss
criterion = criterion.to(device)
def binary_accuracy(preds, y):
# round predictions to the closest integer
rounded_preds = torch.round(preds)
correct = (rounded_preds == y).float()
acc = correct.sum() / len(correct)
return acc
def evaluate(model, eval_loader, criterion):
epoch_loss = 0.0
epoch_acc = 0.0
# deactivate the dropouts
model.eval()
# Sets require_grad flat False
with torch.no_grad():
for text_batch, label_batch in eval_loader:
predictions = model(text_batch).squeeze(1)
# compute loss and accuracy
loss = criterion(predictions, label_batch)
acc = binary_accuracy(predictions, label_batch)
# keep track of loss and accuracy
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(eval_loader), epoch_acc / len(eval_loader)
def train(model, train_loader, optimizer, criterion):
epoch_loss = 0.0
epoch_acc = 0.0
model.train()
for text_batch, label_batch in train_loader:
# cleaning the cache of optimizer
optimizer.zero_grad()
# forward propagation and squeezing
predictions = model(text_batch).squeeze(1)
# computing loss / backward propagation
loss = criterion(predictions, label_batch)
loss.backward()
# accuracy
acc = binary_accuracy(predictions, label_batch)
# updating params
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
# It'll return the means of loss and accuracy
return epoch_loss / len(train_loader), epoch_acc / len(train_loader)
This part may take ~4-5 min
EPOCH_NUMBER = 25
for epoch in range(1, EPOCH_NUMBER + 1):
train_loss, train_acc = train(LSTM_model, train_loader, optimizer, criterion)
valid_loss, valid_acc = evaluate(LSTM_model, valid_loader, criterion)
# Showing statistics
print(f"\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%")
print(f"\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%")
print()
As you can see the LSTM can reach a similar performance on this simple spam classification task.
Now implement the same model for the news dataset task, where we need to deal with having more than 2 labels and varying input text lengths and see how it compares to the baselines.