This part was produced using Jupyter.
If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory at all: simply proceed as usual in your favorite Python environment.
In this practical session, our goals are to:
This practical session is making use of the NLTK. Please refer to this part of first practical session for a setup.
If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.
While you can download the following packages with pip
to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.
First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python
executable.
For example we use conda to manage our environments, and we do the following to create a new one and activate it:
conda create --name inlp-venv
conda activate inlp-venv
Then install the following packages (this might take around 2 minutes):
pip install -U ipykernel
pip install -U nltk
If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:
# !pip install -U ipykernel
# !pip install -U nltk
And then download the tagsets in NLTK:
import nltk
nltk.download("tagsets")
nltk.download("tagsets_json")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("brown")
nltk.download("universal_tagset")
nltk.download("wordnet")
from nltk.corpus import brown
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
For this exercise, we will be using the basic functionality of the built-in PoS tagger from NLTK. All you need to know for this part can be found in section 1 of chapter 5 of the NLTK book.
Note: NLTK provides documentation for each tag, which can be queried using the tag, e.g.
nltk.help.upenn_tagset('RB')
or some regular expression, e.g.
nltk.help.upenn_tagset('NN.*')
import nltk
# some headline
headline = "British Left Waffles on Falkland Islands"
nltk.help.upenn_tagset()
# Tokenize and then tag the headline
tagged = nltk.pos_tag(nltk.word_tokenize(headline))
print(tagged)
# some help, maybe...
for word, tag in tagged:
nltk.help.upenn_tagset(tag)
In this exercise, we will be exploring the interplay between PoS tagging, lemmatization, and vocabulary reduction.
Imagine you have a large corpus and a large number of queries of the form “Give me all documents in which all of the words x, y and z appear”. One way to answer such queries efficiently is to index the documents in the corpus: for instance implement a function called index()
for which index(document, word)
returns non-zero if and only if word occurs in document at least once.
One way to implement this kind of index is with a matrix in which each line corresponds to a document and each column corresponds to a word. When the query above is executed, only the lines for which the entries on the x, y and z columns are simultaneously not zero are returned.
The size of this matrix is number of documents × number of distinct words that appear in the corpus. Clearly, the smaller the matrix, the faster the queries are executed. Ideally, we would like to reduce the number of words while not losing the ability to answer queries. One way to do this is to eliminate morphological variability — for instance, map both “goes” and “went” to the same column “go”. This is called "lemmatization". Part-of-speech tagging can help lemmatization so as not to confuse similar surface forms with different roles (e.g. not confuse "can" as in "a can of beer" as opposed to "I can do it".).
brown_tagged = brown.tagged_words(tagset="universal")
and see what you got; e.g.
brown_tagged[0:10]
Note that we are using the “universal” tagset because the original Brown tagset is somewhat unwieldy.
FreqDist
class:fdist_before = nltk.FreqDist(word.lower() for (word, tag) in brown_tagged)
This builds a frequency distribution from the words in the corpus (have a look!). To see how many distinct words are in the distribution, simply take the length:
len(fdist_before) ## should be 49815
brown_tagged
variable above —, lemmatize each token and add the lemma to a separate list.
For the part-of-speech format needed for lemmatization, you might find the following function useful (do not forget to import wordnet from nltk.corpus
):# converts a pos tag from the 'universal' format to wordnet format
def get_wordnet_pos(universal_tag):
if universal_tag == "VERB":
return wordnet.VERB
elif universal_tag == "ADJ":
return wordnet.ADJ
elif universal_tag == "ADV":
return wordnet.ADV
else:
return wordnet.NOUN
lem = WordNetLemmatizer()
# build a frequency distribution from the lowercase form of the lemmas
fdist_after = nltk.FreqDist(
lem.lemmatize(word.lower(), get_wordnet_pos(tag)) for (word, tag) in brown_tagged
)
# how many words exist after lemmatization (we obtained ~21% reduction)
print(len(fdist_after)) ## should be 39358
print(
100 * (len(fdist_before) - len(fdist_after)) / len(fdist_before)
) ## should be 21%
In this exercise, we will see how adding context can improve the performance of automatic part-of-speech tagging. From the NLTK point of view, everything you need to know can be found in section 5 of chapter 5 of the book.
We will begin with a simple unigram tagger and build it up to a slightly more complex tagger. Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
# import the tagged and untagged sentences
brown_tagged_sents = brown.tagged_sents(categories=["news", "fiction"])
print("Size of corpus: {0} sentences".format(len(brown_tagged_sents)))
# split the sentences into training and test sets
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
print("Size of training set: {0} sentencces".format(size))
# Tagged test sentences (used as reference)
test_sents = brown_tagged_sents[size:]
# Untagged test sentences
raw_sents = brown.sents(categories=["news", "fiction"])[size:]
raw_sents
. Observe that some words are not assigned a tag.
Why not?evaluate()
function may come in handy here)?# train the tagger and test it on unseen sentences
unigram_tagger = nltk.UnigramTagger(train_sents)
print("Unigram tagger unseen sentence example:")
print(unigram_tagger.tag(raw_sents[10]))
print(
"\nPerf. of unigram tagger alone: {0}".format(unigram_tagger.evaluate(test_sents))
)
An easy way to mitigate the problem of some words not being assigned tags is to have the unigram tagger fall back on a guesser when it does not know what tag to assign to a word.
As its name suggests, a guesser is a PoS Tagger that assigns a tag to any token (be it a correct word or not). The use of a guesser as a fallback can improve the robustness of the PoS tagging system (i.e. the system will associate a PoS tag to every word, even if it was not seen beforehand, or if it is not in the lexicon). Another scenario in which a guesser could be helpful is in the initialization phase of the Brill tagging algorithm. In addition, the guesser’s standalone performance could be used as a baseline, when evaluating more complex taggers.
For instance, perhaps the simplest guesser is the one that just assigns the same tag to every word it encounters. This is called a “default tagger” in NLTK.
test_sents
) defined in the previous section.
Do not combine it yet with other taggers at this point.
Report the accuracy. (Feel free to look at the examples in sections 4 and 5 from chapter 5 of the NLTK book.)t0 = nltk.DefaultTagger("NN")
print("Perf. default tagger alone: {0}".format(t0.evaluate(test_sents)))
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
print(
"Perf. unigram tagger, fallback on default tagger: {0}".format(
t1.evaluate(test_sents)
)
)
🎯 [to do: How would you build a better guesser, while still not relying on any further training data?]
Answer: We could make use of morphology, e.g. using regular expressions to take some decision on the tag to guess, as done now.
One step further than the default tagger could be to assign tags to tokens on the basis of matching patterns. In other words, the form of the token is considered when guessing its part-of-speech. For example, it might guess that any word ending in “ed” is the past participle of a verb, and any word ending with “'s” is a possessive noun. This is called a “regular expression tagger” and is already available in NLTK.
🎯 [to do:]
regexp_tagger = nltk.RegexpTagger(
[
(r"^-?[0-9]+(.[0-9]+)?$", "CD"), # cardinal numbers
(r"(The|the|A|a|An|an)$", "AT"), # articles
(r".*able$", "JJ"), # adjectives
(r".*ness$", "NN"), # nouns formed from adjectives
(r".*ly$", "RB"), # adverbs
(r".*s$", "NNS"), # plural nouns
(r".*ing$", "VBG"), # gerunds
(r".*ed$", "VBD"), # past tense verbs
(r".*", "NN"), # nouns (default)
]
)
print("Perf. RegExp tagger alone: {0}".format(regexp_tagger.evaluate(test_sents)))
# evaluate a unigram tagger with a backoff to a regexp tagger
uni_regexp = nltk.UnigramTagger(train_sents, backoff=regexp_tagger)
print(
"Perf. unigram tagger, fallback on regexp_tagger: {0}".format(
uni_regexp.evaluate(test_sents)
)
)
When performing PoS tagging using a unigram-based tagger, the larger context in which the words appear is ignored. Unigram taggers consider only one item of context, in isolation. This poses a significant limitation on any tagger in this class: the most that it can do is to assign the a priori most likely tag to a given token. In a sense, the tag of each word is decided before it is placed in a sentence. Therefore, the word “object” would be assigned the same part of speech every time, regardless if it appears as “the object” (should be a noun) or as “to object” (should be a verb).
An n-gram tagger picks the tag that is most likely in the given context of size n. An instance of an n-gram tagger is the bigram tagger, which considers groups of two tokens when deciding on the parts-of-speech. For example, we would expect a bigram tagger to find the correct tags for the two occurrences of the word “_objec_t” in the following sentence (as opposed to a unigram tagger):
“I did not object to the object”
Indeed, a NLTK unigram tagger trained on the Brown corpus yields:
sentence = "I did not object to the object"
tokens = nltk.word_tokenize(sentence)
print(unigram_tagger.tag(tokens))
print(t1.tag(tokens))
print(uni_regexp.tag(tokens))
No way to have a different tag for the two "object" words.
🎯 [to do:]
train_sents
) and check it in the above sentence (weird result expected, see below).test_sents
).bigram = nltk.BigramTagger(train_sents)
print(bigram.tag(tokens))
print("\nPerf. bigram tagger alone: {0}".format(bigram.accuracy(test_sents)))
You have probably found a surprising result on our sample sentence and very low accuracy for the bigram tagger, when run alone. If you run the trained bigram tagger on a sentence it has not seen during training (e.g. bigram.tag(raw_sents[1]), you will notice that a majority of words will be tagged as “None”. When the bigram tagger encounters a new word, it is unable to assign a tag. It will also be unable to tag the following word — even if it was seen during training — because it was never seen preceded by a “None” tag. From this point on, all the words up to the end of the sentence will be tagged as “None” for the same reason. This explains the low accuracy of the bigram tagger when run alone.
The problem that arises when using n-grams is that once we pass to larger values of n, more data is needed for training, as the specificity of the context increases. In other words, there are many more possible bigrams than unigrams and thus the word/tag combinations that we need to consider when tagging are less likely to have appeared in the training set, as n increases. This is known as the sparse data problem, and is quite pervasive in NLP.
This means that, when using n-gram taggers, we encounter the following tradeoff: having n large means more accurate predictions, at the cost of having to provide more training data; having a smaller n, leads to not so accurate predictions, but less necessary training data.
In order to overcome this limitation, we can train a bigram tagger that falls back on our “unigram+default” or “unigram+regexp” taggers defined in the previous section.
🎯 [to do:]
# evaluate a bigram tagger with a backoff to the unigram+dummy tagger
t2 = nltk.BigramTagger(train_sents, backoff=t1)
print(
"Perf. bigram tagger, fallback on unigram+default tagger: {0}".format(
t2.accuracy(test_sents)
)
)
# evaluate a trigram tagger with a backoff to the bigram+unigram+dummy tagger
t3 = nltk.TrigramTagger(train_sents, backoff=t2)
print(
"Perf. trigram tagger, fallback on bigram+unigram+default tagger: {0}".format(
t3.accuracy(test_sents)
)
)
print("")
# evaluate a bigram tagger with a backoff to the unigram+regexp tagger
t2_regexp = nltk.BigramTagger(train_sents, backoff=uni_regexp)
print(
"Perf. bigram tagger, fallback on unigram+regexp tagger: {0}".format(
t2_regexp.accuracy(test_sents)
)
)
# evaluate a trigram tagger with a backoff to the bigram+unigram+regexp tagger
t3_regexp = nltk.TrigramTagger(train_sents, backoff=t2_regexp)
print(
"Perf. trigram tagger, fallback on bigram+unigram+regexp tagger: {0}".format(
t3_regexp.accuracy(test_sents)
)
)
## and on our sample sentence:
print(t2.tag(tokens))
🎯 [to do: Code the Viterbi algorithm (in your most favorite programming language) and apply it to the toy example provided on slide 8 of the lecture on HMM: two symbols (H,T), two states (the two coins)]
A = [[ 0.4 0.6 ]; [0.9 0.1 ]],
B = [[ 0.49 0.51 ]; [ 0.85 0.15 ]],
I = [ 0.5 0.5 ].
🎯 [to do: What is (one of) the most probable sequence of states corresponding to the observation HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH
?]
Here (link) are some slides illustrating the Viterbi algorithm at work on a concreate simple example.
Regarding our question, the solution should be one of 32 equaly probable solutions:
211211 12 112111 12 1 12 112111121 12 121 12 1112
211211 12 112111 12 1 12 112111121 12 121 21 1112
211211 12 112111 12 1 12 112111121 21 121 12 1112
211211 12 112111 12 1 12 112111121 21 121 21 1112
211211 12 112111 12 1 21 112111121 12 121 12 1112
211211 12 112111 12 1 21 112111121 12 121 21 1112
211211 12 112111 12 1 21 112111121 21 121 12 1112
211211 12 112111 12 1 21 112111121 21 121 21 1112
211211 12 112111 21 1 12 112111121 12 121 12 1112
211211 12 112111 21 1 12 112111121 12 121 21 1112
211211 12 112111 21 1 12 112111121 21 121 12 1112
211211 12 112111 21 1 12 112111121 21 121 21 1112
211211 12 112111 21 1 21 112111121 12 121 12 1112
211211 12 112111 21 1 21 112111121 12 121 21 1112
211211 12 112111 21 1 21 112111121 21 121 12 1112
211211 12 112111 21 1 21 112111121 21 121 21 1112
211211 21 112111 12 1 12 112111121 12 121 12 1112
211211 21 112111 12 1 12 112111121 12 121 21 1112
211211 21 112111 12 1 12 112111121 21 121 12 1112
211211 21 112111 12 1 12 112111121 21 121 21 1112
211211 21 112111 12 1 21 112111121 12 121 12 1112
211211 21 112111 12 1 21 112111121 12 121 21 1112
211211 21 112111 12 1 21 112111121 21 121 12 1112
211211 21 112111 12 1 21 112111121 21 121 21 1112
211211 21 112111 21 1 12 112111121 12 121 12 1112
211211 21 112111 21 1 12 112111121 12 121 21 1112
211211 21 112111 21 1 12 112111121 21 121 12 1112
211211 21 112111 21 1 12 112111121 21 121 21 1112
211211 21 112111 21 1 21 112111121 12 121 12 1112
211211 21 112111 21 1 21 112111121 12 121 21 1112
211211 21 112111 21 1 21 112111121 21 121 12 1112
211211 21 112111 21 1 21 112111121 21 121 21 1112
Here are a C++ code, a Perl code and an Haskel code (from 2014 student Romain Edelman).