This practical sessions was produced using Jupyter. If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory: simply proceed as usual in your favorite Python environment.
The aim of this practical session is to get yourself acquainted with the different models that can be used for NLP classification tasks and to get some exposure to the different statistical machine learning and deep learning packages.
As we have seen in the second practical session, a great example of a highly-biased task is spam classification. The goal of the task is to classify whether a given piece of text (e.g., email, sms message) is spam or not. Since there are only 2 classes (spam or not), we call such tasks binary classification. However, text classification tasks can also include multi-label tasks such as news article topic prediction. Following these two tasks, this practical session has 3 sections:
Huge thanks to Reza and Mehmet for the following inspirations, a big part of the code is recycled from their notebooks!
While you can download the following packages with pip
to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project. If you'd like to, here is a quick tutorial on virtual environments that you can checkout with an EPFL email.
Alternatively you can use the EPFL jupyter notebook service noto, however you will have to pip install
some specific packages such as torchtext.
python
executable.pip install -U ipykernel
pip install -U ipywidgets
pip install -U pip setuptools wheel
pip install -U pandas
pip install -U matplotlib
pip install -U scikit-learn
pip install -U seaborn
pip install -U nltk
pip install -U torch
pip install -U torchtext==0.10.0
pip install -U torchdata
Note: If this part of the code hangs, simply restart your kernel and rerun, sometimes importing packages multiple times can create a problem!
# 1) Importing necessary packages:
# general
import os
import string
import random
from collections import Counter
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
print("1")
# dataset + processing
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split #, KFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#
# from nltk import word_tokenize
# from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
print("2")
# classification models
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
import warnings
from sklearn.exceptions import ConvergenceWarning
print("3")
# metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# LSTM part's packages
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.legacy import data # nlp library of Pytorch
import warnings as wrn
wrn.filterwarnings('ignore')
print("4")
# 2) Setting the seed:
seed = 42
os.environ['PYTHONHASHSEED']=str(seed)
random.seed(seed)
np.random.seed(seed)
print("5")
# 3) Downloading stopwords from NLTK (if you haven't done it before!)
nltk_stopwords_downloaded = True
if not nltk_stopwords_downloaded:
nltk.download('stopwords')
Now you are ready to start the exercises!
To solve and evaluate the spam task, we will use the same annotated English sms corpus from Kaggle as the first practical session. You can download the data here. Simply put it in the same folder as the notebook you are running. As we have done the analysis and pre-processing in the first practical session, we will quickly run the same steps. Remember that the labels in the spam dataset are ill-balanced, heavily skewed towards the ham label.
df = pd.read_csv("spam.csv", header=0, names=['label','sms'], usecols=[0,1])
# df.tail()
df.head()
As you can see there are 2 types of classes: "ham" & "spam". Let's take a look at their distribution.
df['label'] = df['label'].replace(to_replace={'ham': 0, 'spam': 1})
print("Label percentages are:")
print(df.label.value_counts(normalize=True))
Then, to make sure that we don't overfit our models to the data, we split the data into train and test sets. We use the very convenient train_test_split function from scikit-learn. The test_size parameter allows us to choose what percentage of the data should be in the test set. $x$ is the sms message, while $y$ is the corresponding label to the sms.
train_split_random_state = 11
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(
df['sms'].values,
df['label'].values,
test_size=0.2,
random_state=train_split_random_state
)
print("Spam amount in train set: {} out of {} datapoints".format((y_train_spam == 1).sum(), len(y_train_spam)))
print("Spam percentage in train set: {}%".format(round((y_train_spam == 1).sum() / float(len(y_train_spam)) * 100, 4)))
print("Size of train set is: ", len(y_train_spam))
print("Size of test set is: ", len(y_test_spam))
To solve and evaluate the news topic classification task, we will use the 20 newsgroup dataset that has 19K articles in 20 different news groups. We download the data through scikit-learn, so you don't have to manually download it. Let's take a look at the label distribution in this dataset.
news = fetch_20newsgroups(subset='all')
print("Number of articles: ", len(news.data))
print("Number of different categories: ", len(news.target_names))
To look at the label distribution of the 20 classes, we plot a pie chart.
news_labels = news.target_names
sizes = [Counter(news.target)[i] for i in range(len(news_labels))]
plt.figure(figsize=(10,8))
plt.pie(sizes, labels=news_labels, autopct='%1.1f%%')
plt.show()
Q: What do you notice about the label distribution of the news dataset compared to the spam one? Does this change your plan on which metric to use to evaluate the classifiers we will test in the next section?
A: TODO - your answer here!
SOLUTION: The labels are not as biased as the spam classification dataset. All classes are roughly uniformly distributed. ```
We also divide this dataset into train and test sets.
X_train_news, X_test_news, y_train_news, y_test_news = train_test_split(
news.data,
news.target,
test_size=0.2,
random_state=train_split_random_state
)
Feature engineering is when NLP task-specific knowledge can come in handy, and make it more likely for a simple classifier to learn the task. This requires us to index tokens and create meaningful representations out of them.
First we have to create a vocabulary. Some of the indexing themes you have seen in class include:
To represent a document as a vector however, we need more than just indexing, such as a vector space that represents the words:
Luckily scikit-learn provides a Pipeline
class where we can put in the correct order the vectorizer and classifier. Take as an example the vectorizer TF-IDF and the first classifier you can think of such as the Naive Bayes classifier (BernouilliNB). We can do the following to train and predict with the model on a binary task.
Pipeline([('vectorizer', TfidfVectorizer()), ('classifier', BernoulliNB(alpha=0.2))])
Note that for a multi-label problem you can use MultinomialNB. In LogisticRegression, to specify multi-label, you can switch the multi_class
parameter value from ovr
(one-vs-rest) to auto
.
spam_classifier_dict = {
"bagofwords+binaryNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', BernoulliNB(alpha=0.2))]),
"bagofwords+binaryLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}
news_classifier_dict = {
"bagofwords+multiNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', MultinomialNB(alpha=0.2))]),
"bagofwords+multiLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}
Now that we have our classifiers we can train and validate them with cross-validation to see if the vectorizer and classifier combination does well on the task. Here we make sure to further separate the train dataset into several train and validation splits. This way the original test set is unused to prevent overfitting during feature engineering and classification algorithm exploration.
train_validation_random_state = [1,5,10,15,20]
def train(
classifier,
X_train,
y_train,
rnd_state_input ,
test_split_size=0.1,
):
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=ConvergenceWarning)
X_train, X_val, y_train, y_val = train_test_split(
X_train,
y_train,
test_size=test_split_size,
random_state=rnd_state_input
)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_val)
if rnd_state_input == 5:
print("\t|| k=5 Accuracy: {}% ".format(accuracy_score(y_val, y_pred)))
print("\t|| k=5 Precision: {}% ".format(precision_score(y_val, y_pred, average='macro')))
print("\t|| k=5 Recall: {}% ".format(recall_score(y_val, y_pred, average='macro')))
print("\t|| k=5 F1: {}% ".format(f1_score(y_val, y_pred, average='macro')))
return classifier, classifier.score(X_val, y_val)
def plot_confusion_matrix(classifier, X_test, y_test, labels):
y_pred = classifier.predict(X_test)
confusion_mat = confusion_matrix(y_test, y_pred)
confusion_mat = normalize(confusion_mat , axis=1 , norm='l1' )
# Plot confusion_matrix
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(confusion_mat, annot=True, cmap = "flare", fmt ="0.2f", xticklabels=labels, yticklabels=labels)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Now we are going to train each model on the task and do a k=5 cross validation.
spam_classifier_dict = {
"bagofwords+binaryNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', BernoulliNB(alpha=0.2))]),
"bagofwords+binaryLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}
news_classifier_dict = {
"bagofwords+multiNB": Pipeline([('vectorizer', CountVectorizer()), ('classifier', MultinomialNB(alpha=0.2))]),
"bagofwords+multiLogistic": Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}
print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_spam,
y_train=y_train_spam,
rnd_state_input=k
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on spam: ", mean_score)
plot_confusion_matrix(classifier, X_test_spam, y_test_spam, [1,0])
print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_news,
y_train=y_train_news,
rnd_state_input=k
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on news: ", mean_score)
plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)
Q: An open ended question - Given that these findings are limited to the Bag-of-Words vectorization, what other vectorization methods could you use? What are some additional indexing themes that could help or hurt each task given the preprocessing and analysis we have done in the first section?
A: TODO - your answer here!
SOLUTION: We can use stopwords filtration to improve the vectorizer, especially for the latter scores on the news task. We could expect TF-IDF to work well for the news class as it has longer documents that relate to each other. ```
Q: An open ended question - Which model seems to be extremely poorly? Why do you think this might be the case?
A: TODO - your answer here!
spam_classifier_dict = {
"stopwords+tfidf+binaryNB": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', BernoulliNB(alpha=0.005))]),
"stopwords+tfidf+binaryLogistic": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', LogisticRegression(solver="saga" , multi_class="ovr", max_iter=200))])
}
news_classifier_dict = {
"stopwords+tfidf+multiNB": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', MultinomialNB(alpha=0.005))]),
"stopwords+tfidf+multiLogistic": Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english'))), ('classifier', LogisticRegression(solver="saga" , multi_class="auto", max_iter=200))])
}
print("// Spam Binary Task Evaluation //")
for model_name, model in spam_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_spam,
y_train=y_train_spam,
rnd_state_input=k
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on spam: ", mean_score)
plot_confusion_matrix(classifier, X_test_spam, y_test_spam, [1,0])
print("_______________________________________________________________")
print("// News Multi-label Task Evaluation //")
for model_name, model in news_classifier_dict.items():
print("~~~~~~~~~~~~~~~~~~~~")
print(model_name + " : ")
all_cross_val_scores = []
for k in train_validation_random_state:
classifier, score = train(
classifier=model,
X_train=X_train_news,
y_train=y_train_news,
rnd_state_input=k
)
all_cross_val_scores.append(score)
all_cross_val_scores_np = np.array(all_cross_val_scores)
mean_score = all_cross_val_scores_np.mean()
print("Mean accuracy score on news: ", mean_score)
plot_confusion_matrix(classifier, X_test_news, y_test_news, news_labels)
Q: What do you notice in the change of results? Do you find that the feature augmentation of TF-IDF has helped the task you expected it to help? Do you find that the feature augmentaion of TF-IDF has hurt the task you expected it to hurt?
A: TODO - your answer here!
SOLUTION:
A bag-of-words style of representation combined with a classifier often misses the order of sentences. Given the following sentences, can you see how this may be problematic?
I went to the bank to take a swim.
I went to the bank to withdraw money.
The meaning of the token bank is modulated by its context. To overcome this problem you have seen in class that you could learn a vector space representation of the vocabulary, in which word representation are taught to be closer (through a cosine distance objective) according to the context window in which they are used. Even in this situation, the word's distributional semantics are limited by the window size.
Instead we can make the classifier take the input text as a sequence. This line of machine learning algorithms are called Recurrent Neural Networks (RNNs). One popular implementation of such algorithms that you will see next week are LSTMs (long-term short-term memory).
Let's implement one in the popular deep learning framework pytorch! PyTorch has a text processor subpackage called torchtext that allows for easy indexing of tokens.
TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)
fields = [("type",LABEL),('text',TEXT)]
training_data = data.TabularDataset(
path="spam.csv",
format="csv",
fields=fields,
skip_header=True
)
print(vars(training_data.examples[0]))
train_data,valid_data = training_data.split(
split_ratio=0.75,
random_state=random.seed(42)
)
TEXT.build_vocab(
train_data,
min_freq=5
)
LABEL.build_vocab(train_data)
print("Size of text vocab:",len(TEXT.vocab))
print("Size of label vocab:",len(LABEL.vocab))
TEXT.vocab.freqs.most_common(10)
# Creating GPU variable
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
BATCH_SIZE = 64
# NOTE: BucketIterator batches the similar length of samples and reduces the need of padding tokens.
train_iterator,validation_iterator = data.BucketIterator.splits(
(train_data,valid_data),
batch_size = BATCH_SIZE,
# Sort key is how to sort the samples
sort_key = lambda x:len(x.text),
sort_within_batch = True,
device = device
)
class LSTMClassifier(nn.Module):
def __init__(
self,
vocab_size=len(TEXT.vocab),
embedding_dim=100,
hidden_dim=64,
output_dim=1,
n_layers=2,
bidirectional=True,
dropout=0.2
):
super(LSTMClassifier,self).__init__()
# Embedding layer converts integer sequences to vector sequences
self.embedding = nn.Embedding(vocab_size,embedding_dim)
# LSTM layer process the vector sequences
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers = n_layers,
bidirectional = bidirectional,
dropout = dropout,
batch_first = True
)
# Dense layer to predict
self.fc = nn.Linear(hidden_dim * 2,output_dim)
# Prediction activation function
self.sigmoid = nn.Sigmoid()
def forward(self,text,text_lengths):
embedded = self.embedding(text)
# Thanks to packing, LSTM don't see padding tokens
# and this makes our model better
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(),batch_first=True)
packed_output,(hidden_state,cell_state) = self.lstm(packed_embedded)
# Concatenating the final forward and backward hidden states
hidden = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim = 1)
dense_outputs=self.fc(hidden)
#Final activation function
outputs=self.sigmoid(dense_outputs)
return outputs
LSTM_model = LSTMClassifier()
print(LSTM_model)
LSTM_model = LSTM_model.to(device)
optimizer = optim.Adam(LSTM_model.parameters(),lr=1e-4)
criterion = nn.BCELoss() # Binary Cross Entropy Loss
criterion = criterion.to(device)
def binary_accuracy(preds, y):
#round predictions to the closest integer
rounded_preds = torch.round(preds)
correct = (rounded_preds == y).float()
acc = correct.sum() / len(correct)
return acc
def evaluate(model,iterator,criterion):
epoch_loss = 0.0
epoch_acc = 0.0
# deactivate the dropouts
model.eval()
# Sets require_grad flat False
with torch.no_grad():
for batch in iterator:
text,text_lengths = batch.text
predictions = model(text,text_lengths).squeeze()
#compute loss and accuracy
loss = criterion(predictions, batch.type)
acc = binary_accuracy(predictions, batch.type)
#keep track of loss and accuracy
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
def train(model,iterator,optimizer,criterion):
epoch_loss = 0.0
epoch_acc = 0.0
model.train()
for batch in iterator:
# cleaning the cache of optimizer
optimizer.zero_grad()
text,text_lengths = batch.text
# forward propagation and squeezing
predictions = model(text,text_lengths).squeeze()
# computing loss / backward propagation
loss = criterion(predictions,batch.type)
loss.backward()
# accuracy
acc = binary_accuracy(predictions, batch.type)
# updating params
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
# It'll return the means of loss and accuracy
return epoch_loss / len(iterator), epoch_acc / len(iterator)
EPOCH_NUMBER = 25
for epoch in range(1,EPOCH_NUMBER+1):
train_loss,train_acc = train(LSTM_model,train_iterator,optimizer,criterion)
valid_loss,valid_acc = evaluate(LSTM_model,validation_iterator,criterion)
# Showing statistics
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
print()
As you can see the LSTM can reach a similar performance on this simple spam classification task. We invite you to further investigate how LSTM's do on multilabeled classification tasks, and with tasks where the input text length varies, such as the one in the news dataset.