This practical sessions was produced using Jupyter. If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory: simply proceed as usual in your favorite Python environment.
The aim of this practical session is to get yourself acquainted with the ill-balanced nature of NLP classification tasks and to get some exposure to the scikit-learn and spaCy packages.
A great example of a highly-biased task is spam classification. The goal of the task is to classify whether a given piece of text (e.g., email, sms message) is spam or not. Since there are only 2 classes (spam or not), we call such tasks binary classification.
In reality, the amount of examples we have of non-spam emails is much larger than the ones that are spam. Therefore, we need proper evaluation techniques to deal with such a class imbalance.
Fun fact: if a piece of text is not spam, it's popularly called ham!
While you can download the following packages with pip
to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.
First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python
executable.
For example we use conda to manage our environments, and we do the following to create a new one and activate it:
conda create --name inlp-venv
conda activate inlp-venv
Then install the following packages (this might take around 5 minutes):
pip install -U ipykernel
pip install -U pandas
pip install -U matplotlib
pip install -U scikit-learn
pip install -U seaborn
pip install -U spacy
pip install -U spacy-transformers
pip install -U spacy-lookups-data
python -m spacy download en_core_web_trf
If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:
# !pip install -U ipykernel
# !pip install -U pandas
# !pip install -U matplotlib
# !pip install -U scikit-learn
# !pip install -U seaborn
# !pip install -U spacy
# !pip install -U spacy-transformers
# !pip install -U spacy-lookups-data
# !python -m spacy download en_core_web_trf
Next, import the necessary packages:
Note: if this part of the code hangs, simply restart your kernel and rerun, sometimes importing packages multiple times can create a problem
# Importing necessary packages:
import spacy
import pandas as pd
import numpy as np
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
precision_recall_curve,
)
from sklearn.model_selection import train_test_split, KFold
from sklearn import metrics
# Setting the seed:
seed = 42
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
# Setting print options for readability:
pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)
pd.set_option("max_colwidth", None)
Now you are ready to start the exercises!
df = pd.read_csv("spam.csv", header=0, names=["label", "sms"], usecols=[0, 1])
df.head()
df.tail()
As you can see there are 2 types of classes: "ham" & "spam". Let's take a look at their distribution:
print("Label counts are:")
print(df.label.value_counts())
print("______________________")
print("Label percentages are:")
print(df.label.value_counts(normalize=True))
Q: How is this dataset's label distribution imbalanced? Which datapoints are more common?
A: TODO - your answer here!
Let's do some minor data processing. We map the labels {"ham" $\implies$ 0} and {"spam" $\implies$ 1} as our goal is to identify spam messages.
df["label"] = df["label"].replace(to_replace={"ham": 0, "spam": 1})
df.head()
Then, to make sure that we don't overfit our models to the data, we split the data into train and test sets. We use the very convenient train_test_split function from scikit-learn. The test_size parameter allows us to choose what percentage of the data should be in the test set. $x$ is the sms message, while $y$ is the corresponding label to the sms.
Remember, when you are prototyping a model, never use the test data. The goal of the test data is to simulate an independently sampled dataset, that cannot be seen during training and model designing. Extracted statistics should come from the training set only.
x_train, x_test, y_train, y_test = train_test_split(
df["sms"].values, df["label"].values, test_size=0.2, random_state=seed
)
print(
"Spam amount in train set: {} out of {} datapoints".format(
(y_train == 1).sum(), len(y_train)
)
)
print(
"Spam percentage in train set: {}%".format(
round((y_train == 1).sum() / float(len(y_train)) * 100, 4)
)
)
print("Size of train set is: ", len(y_train))
print("Size of test set is: ", len(y_test))
Our first task is to evaluate simple classifiers to see how well they do on the spam classification task. Note that while there is a text input, none of these classifiers actually care about the content of the string. Remember that a prior in probability is one's belief about a quantity given some past evidence.
rand_binom_uniform_classifier: The first model uniformly randomly classifies the label for a given piece of text, meaning the prior is $p=0.5$.
rand_binom_biased_classifier: The second model binomally samples a label, you should chose which prior $p$ you think makes the most sense!
always_false_classifier: The third model returns always false, basically assuming that no text can ever be spam.
Don't forget to change the value prior_p $=0.5$! Chose the value informatively! Think which statistics can help you with this.
##############################
# TODO: write a prior you think that makes the most sense in the next line for the rand_binom_biased_classifier:
prior_p = 0.5 # <-------- CHANGE THIS VALUE!
##############################
def rand_binom_uniform_classifier(text):
"""
Unfiormly randomly picks a binary label.
Returns ints 0 or 1.
"""
return np.random.binomial(n=1, p=0.5)
def rand_binom_biased_classifier(text):
"""
Randomly picks a binary label according to a prior_p.
Returns ints 0 or 1.
"""
return np.random.binomial(n=1, p=prior_p)
def always_false_classifier(text):
"""
Always returns int 0.
"""
return 0
def predict(model, dataset):
"""
Applies model prediction on every sentence.
"""
return [model(text) for text in dataset]
Now we are going to use the accuracy_score metric provided by scikit-learn:
rand_binom_uniform_preds = predict(rand_binom_uniform_classifier, x_test)
rand_binom_biased_preds = predict(rand_binom_biased_classifier, x_test)
always_false_preds = predict(always_false_classifier, x_test)
for model_name, model in [
("Rand Uniform", rand_binom_uniform_classifier),
("Rand Binom p={}".format(prior_p), rand_binom_biased_classifier),
("Always False", always_false_classifier),
]:
test_preds = predict(model, x_test)
test_accuracy = accuracy_score(
y_test, test_preds
) # NOTE: here we pass the gold labels and the predicted labels to calculate how well our model is doing
print(
"Model: {} || Accuracy: {}%".format(model_name, round(test_accuracy * 100, 4))
)
Wow! Looks like the "Always False" classifier is amazing at this task, even better than randomly choosing. Over 85% sounds like a pretty good result. Can we call it quits, and go home now? Not so fast... Answer the following questions.
Q: Which prior did you choose and why do you think it fits the problem? If you chose your prior informatively, the Rand Binom classifier should be better than the Rand Uniform classifier. Why is the biased random classifier doing better than the uniform one?
A: TODO - your answer here!
Q: How would using testing data to design the model be "bad practice"? Give an example.
A: TODO - your answer here!
Q: Why is the accuracy score not the correct metric to use in this task? Which classifier is a good example of this? What other metrics can be used and why would they be a better fit for this task?
A: TODO - your answer here!
Now that you have convinced yourself that accuracy is not the sole metric we should be using. For the other possible scores we let you explore the following metrics conveniently provided by scikit-learn:
Note that some of these function also has a zero_division parameter as the denominator of these metrics can be validly 0. In that case the default is to also output 0, but a warning may be raised. Feel free to ignore the warning.
TIP: Can you not remember what precision and recall is? Remember it with the start of the word:
Precision: PREcision is TP divided by PREdicted positive:
$ \frac{TP}{TP+FP}$
Recall: REcAll is TP divided by REAl positive: TP/(TP+FN)
$ \frac{TP}{TP+FN}$
source: stats.stackexchange
for model_name, model in [
("Rand Uniform", rand_binom_uniform_classifier),
("Rand Binom p={}".format(prior_p), rand_binom_biased_classifier),
("Always False", always_false_classifier),
]:
test_preds = predict(model, x_test)
test_accuracy = accuracy_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_f1 = f1_score(y_test, test_preds)
cf_train_matrix = confusion_matrix(y_test, test_preds)
# print("________________________________________________________________")
print("Model: {}".format(model_name))
print("\t|| Accuracy: {}% ".format(test_accuracy))
print("\t|| Precision: {}% ".format(test_precision))
print("\t|| Recall: {}% ".format(test_recall))
print("\t|| F1: {}% ".format(test_f1))
plt.figure(figsize=(2.5, 2))
sns.heatmap(cf_train_matrix, annot=True, fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("Real Labels")
plt.show()
Q: What do precision, recall, or the confusion metrics tell us about the "Always False" model? Why don't we see this phenomenon with the random classifiers?
A: TODO - your answer here!
Q: What would be a more ideal confusion matrix result?
A: TODO - your answer here!
So far, our models haven't taken into account the actual text. In this next task we build the most simple AI algorithm: a big IF statement that classifies the text as spam if a "spam word" is being used. To do so we first have to break the sentence down into words. Assuming we only have English sentences, we can separate them by space or punctuation. This action of breaking a sentence into tokens is called tokenization. For now instead of using the spaCy implementation, let's create our own simple tokenizer that turns a sentence into a list of tokens by splitting on empty spaces.
Consider the sentence: "I am not a spam message, don't confuse me with it!" There are many ways we could break it into tokens:
You could even tokenize sentence into subword elements. For example, it won't be very interesting, but you could even break it into characters!:
def simple_tokenizer(text):
return text.split(" ")
def custom_tokenizer(text):
pass # TODO: OPTIONAL, have fun! Implement your own tokenizer that's more elegant than the above one
print(simple_tokenizer("I am not a spam message, don't confuse me with it!"))
print(custom_tokenizer("I am not a spam message, don't confuse me with it!"))
Let's take a look at what our tokenizer does on our dataset:
x_tokenized_test = [simple_tokenizer(sentence) for sentence in x_test]
for i in [5, 6, 2]:
print("__________________________________________________________________")
print("Label: ", y_test[i])
print("Sentence: ", x_test[i])
print("Tokenized version: ", x_tokenized_test[i])
# TODO: Change or fill more with words you think indicate a spam message!
spam_words = ["money", "drugs", "winner"]
Now let's implement our keyword matching algorithm:
def big_if_classifier(text):
tokenized_text = simple_tokenizer(text)
for token in tokenized_text:
if token in spam_words:
return 1
return 0
Next, we evaluate the big if classifier:
for model_name, model in [("Big If", big_if_classifier)]:
test_preds = predict(model, x_test)
test_accuracy = accuracy_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_f1 = f1_score(y_test, test_preds)
print("Model: {}".format(model_name))
print("\t|| Accuracy: {}% ".format(test_accuracy))
print("\t|| Precision: {}% ".format(test_precision))
print("\t|| Recall: {}% ".format(test_recall))
print("\t|| F1: {}% ".format(test_f1))
cf_train_matrix = confusion_matrix(y_test, test_preds)
plt.figure(figsize=(2.5, 2))
sns.heatmap(cf_train_matrix, annot=True, fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("Real Labels")
plt.show()
Q: What are some of the shortcomings of this big-if classifier? How could you mitigate such problems?
A: TODO - your answer here!
The following classifier will be a classifier trained with spacy, a widely used package in NLP industry. For now focus on the evaluation & some cool aspects of this package.
You may have noticed that our tokenization is too simple. "money" could be written right before a comma, and apostrophes might also create duplicate tokens that actually refer to the same word. Spacy provides a tokenizer for different languages that takes care of such edge cases.
And that's not only it! The framework is organized as a pipeline that takes a plain string text and turns it into a featurized Doc. Document here doesn't mean multiple sentences. You can think of Doc as an augmentation of the plain text into meaningful linguistic features that can be better signals for classification tasks. Here is a possible set of components to make up the spacy nlp pipeline:
Notice that the tokenizer is separate. This is because for a given language, spaCy has only one tokenizer. As said on the website:
[...] while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc
So let's create an English pipeline with only the tokenizer by disabling the rest of the components.
First, we need to load the language specific nlp pipeline with spacy.load(). Then there are 2 ways to pass text into spacy.
nlp(text)
nlp.pipe(texts)
# Reformatting data for spacy text categorizer
train_data = []
for i in range(200):
label = y_train[i]
input = x_train[i]
if i < 5:
print("Input message: ", input)
print("Label: ", label)
print("----")
if label == 1:
train_data.append((input, {"cats": {"spam": 1, "ham": 0}}))
else:
train_data.append((input, {"cats": {"spam": 0, "ham": 1}}))
from spacy.training.example import Example
from spacy.util import minibatch
import random
from thinc.api import Adam
# Train the model
losses = {}
dropout = 0.1 # NOTE: to prevent overfitting
learning_rate = 0.002 # e.g. 0.001 2e-5
# Load a transformer-based model
nlp = spacy.load("en_core_web_trf")
# Add a text categorizer to the pipeline if not already present
textcat = None
if "textcat" not in nlp.pipe_names:
textcat = nlp.add_pipe("textcat", last=True)
else:
textcat = nlp.get_pipe("textcat")
# Add labels to the text categorizer
textcat.add_label("spam")
textcat.add_label("ham")
# Remove components from the pipeline if you wish to
print("The default English nlp pipe components are: {}.\n".format(nlp.pipe_names))
nlp.remove_pipe("lemmatizer")
nlp.remove_pipe("tagger")
nlp.remove_pipe("parser")
nlp.remove_pipe("ner")
nlp.initialize()
optimizer = Adam(
learn_rate=learning_rate,
beta1=0.9,
beta2=0.999,
eps=1e-08,
L2=1e-6,
# grad_clip=1.0,
use_averages=True,
L2_is_weight_decay=True,
)
for i in range(4): # NOTE: change the number of iterations as needed
random.shuffle(train_data)
batches = minibatch(train_data, size=8)
for batch in batches:
texts, annotations = zip(*batch)
examples = [
Example.from_dict(nlp.make_doc(text), ann)
for text, ann in zip(texts, annotations)
]
nlp.update(examples, sgd=optimizer, drop=dropout, losses=losses)
print(f"Losses at iteration {i}: {losses}")
# Save the trained model if you wish!
# nlp.to_disk("inlp_textcat_model")
doc = nlp("I am looking for kitchen appliances.")
print(doc.cats)
doc = nlp("Wanna grab pizza tonight? I feel like going to Dieci.")
print(doc.cats)
doc = nlp("For only 1000$, you can also be a winner!")
print(doc.cats)
doc = nlp("Send 30000$ to this email!")
print(doc.cats)
Wow! The predicted probabilities for each label seem quite reasonable. Now let's test this on the whole test dataset:
def predict_spacy(model, dataset):
preds = []
for doc in model.pipe(dataset):
res_dict = doc.cats
pred = max(res_dict, key=res_dict.get) # argmax
if pred == "spam":
pred = 1
else:
pred = 0
preds.append(pred)
return preds
Evaluating with the spacy classifier takes longer than our random baselines, so note that the following may take around a minute:
for model_name, model in [("Spacy", nlp)]:
test_preds = predict_spacy(model, x_test)
test_accuracy = accuracy_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_f1 = f1_score(y_test, test_preds)
cf_train_matrix = confusion_matrix(y_test, test_preds)
print("________________________________________________________________")
print("Model: {}".format(model_name))
print("\t|| Accuracy: {}% ".format(test_accuracy))
print("\t|| Precision: {}% ".format(test_precision))
print("\t|| Recall: {}% ".format(test_recall))
print("\t|| F1: {}% ".format(test_f1))
plt.figure(figsize=(2.5, 2))
sns.heatmap(cf_train_matrix, annot=True, fmt="d")
plt.xlabel("Predicted Labels")
plt.ylabel("Real Labels")
plt.show()
Q: What can you notice about the precision/recall/F1/confusion matrix results? What is particularly improved compared to the "Big-If" classifier?
A: TODO - your answer here!
Now that we have many models under our belt, let's compare the statistical significance of the 4 classifiers:
As you learned it lecture, one way to measure stastical significance is by doing a k-cross validation. Once again, scikit-learn has a convenient KFold class that splits a given dataset input and output into k folds. Here we will use k $=5$ which gives us a ratio of 0.2 validation set. We hold out the test set, and do this procedure only on the train set.
Note that the following may take ~10 minutes to run as the spacy model gets retrained with 200 examples every new k-fold.
k = 5 # NOTE: You can change the value of k via the argument here!
k_fold = KFold(k)
kfold_iteration = enumerate(k_fold.split(x_train, y_train))
metrics = {
"Rand Uniform": {
"acc": [],
"pre": [],
"rec": [],
},
"Rand Binom p={}".format(prior_p): {
"acc": [],
"pre": [],
"rec": [],
},
"Always False": {
"acc": [],
"pre": [],
"rec": [],
},
"Big If": {
"acc": [],
"pre": [],
"rec": [],
},
"Spacy": {
"acc": [],
"pre": [],
"rec": [],
},
}
for k, (train_idx, val_idx) in kfold_iteration:
print("K-cross validation iteration {}".format(k))
x_tr, x_val = x_train[train_idx], x_train[val_idx]
y_tr, y_val = y_train[train_idx], y_train[val_idx]
for model_name, model in [
("Rand Uniform", rand_binom_uniform_classifier),
("Rand Binom p={}".format(prior_p), rand_binom_biased_classifier),
("Always False", always_false_classifier),
("Big If", big_if_classifier),
]:
val_preds = predict(model, x_val)
metrics[model_name]["acc"].append(accuracy_score(y_val, val_preds))
metrics[model_name]["pre"].append(precision_score(y_val, val_preds))
metrics[model_name]["rec"].append(recall_score(y_val, val_preds))
for model_name, model in [("Spacy", nlp)]:
# Reformatting data for spacy text categorizer
train_data = []
for i in range(200):
label = y_tr[i]
input = x_tr[i]
# if i < 5:
# print("Input message: ", input)
# print("Label: ", label)
# print("----")
if label == 1:
train_data.append((input, {"cats": {"spam": 1, "ham": 0}}))
else:
train_data.append((input, {"cats": {"spam": 0, "ham": 1}}))
# Train the model
losses = {}
dropout = 0.1 # NOTE: to prevent overfitting
learning_rate = 0.002 # e.g. 0.001 2e-5
# Load a transformer-based model
nlp = spacy.load("en_core_web_trf")
# Add a text categorizer to the pipeline if not already present
textcat = None
if "textcat" not in nlp.pipe_names:
textcat = nlp.add_pipe("textcat", last=True)
else:
textcat = nlp.get_pipe("textcat")
# Add labels to the text categorizer
textcat.add_label("spam")
textcat.add_label("ham")
# Remove components from the pipeline if you wish to
print(
"The default English nlp pipe components are: {}.\n".format(nlp.pipe_names)
)
nlp.remove_pipe("lemmatizer")
nlp.remove_pipe("tagger")
nlp.remove_pipe("parser")
nlp.remove_pipe("ner")
nlp.initialize()
optimizer = Adam(
learn_rate=learning_rate,
beta1=0.9,
beta2=0.999,
eps=1e-08,
L2=1e-6,
# grad_clip=1.0,
use_averages=True,
L2_is_weight_decay=True,
)
for i in range(4): # NOTE: change the number of iterations as needed
random.shuffle(train_data)
batches = minibatch(train_data, size=8)
for batch in batches:
texts, annotations = zip(*batch)
examples = [
Example.from_dict(nlp.make_doc(text), ann)
for text, ann in zip(texts, annotations)
]
nlp.update(examples, sgd=optimizer, drop=dropout, losses=losses)
print(f"Losses at iteration {i}: {losses}")
val_preds = predict_spacy(nlp, x_val)
metrics[model_name]["acc"].append(accuracy_score(y_val, val_preds))
metrics[model_name]["pre"].append(precision_score(y_val, val_preds))
metrics[model_name]["rec"].append(recall_score(y_val, val_preds))
acc_means = []
acc_stds = []
pre_means = []
pre_stds = []
rec_means = []
rec_stds = []
model_names = list(metrics.keys())
for model in model_names:
acc_means.append(np.mean(metrics[model]["acc"]))
acc_stds.append(np.std(metrics[model]["acc"]))
#
pre_means.append(np.mean(metrics[model]["pre"]))
pre_stds.append(np.std(metrics[model]["pre"]))
#
rec_means.append(np.mean(metrics[model]["rec"]))
rec_stds.append(np.std(metrics[model]["rec"]))
x = np.arange(len(model_names))
width = 0.25
fig, ax = plt.subplots()
rects1 = ax.bar(x, acc_means, width, yerr=acc_stds, label="Accuracy", edgecolor="black")
rects1 = ax.bar(
x + width, pre_means, width, yerr=pre_stds, label="Precision", edgecolor="black"
)
rects2 = ax.bar(
x + width * 2, rec_means, width, yerr=rec_stds, label="Recall", edgecolor="black"
)
ax.set_ylabel("Metric value")
ax.set_title("k=5-cross validation metric variance")
ax.set_xticks(x + width, model_names, rotation=30)
ax.legend()
fig.tight_layout()
plt.show()
Q: While you may not have enough time to run these experiments during today's practical session, what difference do you think you would notice when k = 2 vs. k = 5 vs. k=10?
A: TODO - answer here...