This part was produced using Jupyter.
If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory at all: simply proceed as usual in your favorite Python environment.
In this practical session, our goals are to:
This practical session is making use of the NLTK. Please refer to this part of first practical session for a setup.
If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.
While you can download the following packages with pip
to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.
First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python
executable.
For example we use conda to manage our environments, and we do the following to create a new one and activate it:
conda create --name inlp-venv
conda activate inlp-venv
Then install the following packages (this might take around 2 minutes):
pip install -U ipykernel
pip install -U nltk
If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:
# !pip install -U ipykernel
# !pip install -U nltk
And then download the tagsets in NLTK:
import nltk
nltk.download("tagsets")
nltk.download("tagsets_json")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("brown")
nltk.download("universal_tagset")
nltk.download("wordnet")
from nltk.corpus import brown
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer
For this exercise, we will be using the basic functionality of the built-in PoS tagger from NLTK. All you need to know for this part can be found in section 1 of chapter 5 of the NLTK book.
Note: NLTK provides documentation for each tag, which can be queried using the tag, e.g.
nltk.help.upenn_tagset('RB')
or some regular expression, e.g.
nltk.help.upenn_tagset('NN.*')
import nltk
# some headline
headline = "British Left Waffles on Falkland Islands"
nltk.help.upenn_tagset()
🎯 [to do: complete the above code to tag your headline phrase]
In this exercise, we will be exploring the interplay between PoS tagging, lemmatization, and vocabulary reduction.
Imagine you have a large corpus and a large number of queries of the form “Give me all documents in which all of the words x, y and z appear”. One way to answer such queries efficiently is to index the documents in the corpus: for instance implement a function called index()
for which index(document, word)
returns non-zero if and only if word occurs in document at least once.
One way to implement this kind of index is with a matrix in which each line corresponds to a document and each column corresponds to a word. When the query above is executed, only the lines for which the entries on the x, y and z columns are simultaneously not zero are returned.
The size of this matrix is number of documents × number of distinct words that appear in the corpus. Clearly, the smaller the matrix, the faster the queries are executed. Ideally, we would like to reduce the number of words while not losing the ability to answer queries. One way to do this is to eliminate morphological variability — for instance, map both “goes” and “went” to the same column “go”. This is called "lemmatization". Part-of-speech tagging can help lemmatization so as not to confuse similar surface forms with different roles (e.g. not confuse "can" as in "a can of beer" as opposed to "I can do it".).
brown_tagged = brown.tagged_words(tagset="universal")
and see what you got; e.g.
brown_tagged[0:10]
Note that we are using the “universal” tagset because the original Brown tagset is somewhat unwieldy.
FreqDist
class:fdist_before = nltk.FreqDist(word.lower() for (word, tag) in brown_tagged)
This builds a frequency distribution from the words in the corpus (have a look!). To see how many distinct words are in the distribution, simply take the length:
len(fdist_before) ## should be 49815
brown_tagged
variable above —, lemmatize each token and add the lemma to a separate list.
For the part-of-speech format needed for lemmatization, you might find the following function useful (do not forget to import wordnet from nltk.corpus
):# converts a pos tag from the 'universal' format to wordnet format
def get_wordnet_pos(universal_tag):
if universal_tag == "VERB":
return wordnet.VERB
elif universal_tag == "ADJ":
return wordnet.ADJ
elif universal_tag == "ADV":
return wordnet.ADV
else:
return wordnet.NOUN
🎯 [to do: write Python code to perform lemmatization of the Brown corpus and answer the above question.]
In this exercise, we will see how adding context can improve the performance of automatic part-of-speech tagging. From the NLTK point of view, everything you need to know can be found in section 5 of chapter 5 of the book.
We will begin with a simple unigram tagger and build it up to a slightly more complex tagger. Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.
# import the tagged and untagged sentences
brown_tagged_sents = brown.tagged_sents(categories=["news", "fiction"])
print("Size of corpus: {0} sentences".format(len(brown_tagged_sents)))
# split the sentences into training and test sets
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
print("Size of training set: {0} sentencces".format(size))
# Tagged test sentences (used as reference)
test_sents = brown_tagged_sents[size:]
# Untagged test sentences
raw_sents = brown.sents(categories=["news", "fiction"])[size:]
raw_sents
. Observe that some words are not assigned a tag.
Why not?evaluate()
function may come in handy here)?🎯 [to do: write Python code to perform unigram tagger learning on the Brown corpus and answer the above questions.]
An easy way to mitigate the problem of some words not being assigned tags is to have the unigram tagger fall back on a guesser when it does not know what tag to assign to a word.
As its name suggests, a guesser is a PoS Tagger that assigns a tag to any token (be it a correct word or not). The use of a guesser as a fallback can improve the robustness of the PoS tagging system (i.e. the system will associate a PoS tag to every word, even if it was not seen beforehand, or if it is not in the lexicon). Another scenario in which a guesser could be helpful is in the initialization phase of the Brill tagging algorithm. In addition, the guesser’s standalone performance could be used as a baseline, when evaluating more complex taggers.
For instance, perhaps the simplest guesser is the one that just assigns the same tag to every word it encounters. This is called a “default tagger” in NLTK.
test_sents
) defined in the previous section.
Do not combine it yet with other taggers at this point.
Report the accuracy. (Feel free to look at the examples in sections 4 and 5 from chapter 5 of the NLTK book.)🎯 [to do: How would you build a better guesser, while still not relying on any further training data?]
One step further than the default tagger could be to assign tags to tokens on the basis of matching patterns. In other words, the form of the token is considered when guessing its part-of-speech. For example, it might guess that any word ending in “ed” is the past participle of a verb, and any word ending with “'s” is a possessive noun. This is called a “regular expression tagger” and is already available in NLTK.
🎯 [to do:]
When performing PoS tagging using a unigram-based tagger, the larger context in which the words appear is ignored. Unigram taggers consider only one item of context, in isolation. This poses a significant limitation on any tagger in this class: the most that it can do is to assign the a priori most likely tag to a given token. In a sense, the tag of each word is decided before it is placed in a sentence. Therefore, the word “object” would be assigned the same part of speech every time, regardless if it appears as “the object” (should be a noun) or as “to object” (should be a verb).
An n-gram tagger picks the tag that is most likely in the given context of size n. An instance of an n-gram tagger is the bigram tagger, which considers groups of two tokens when deciding on the parts-of-speech. For example, we would expect a bigram tagger to find the correct tags for the two occurrences of the word “_objec_t” in the following sentence (as opposed to a unigram tagger):
“I did not object to the object”
Indeed, a NLTK unigram tagger trained on the Brown corpus yields:
sentence = "I did not object to the object"
tokens = nltk.word_tokenize(sentence)
print(unigram_tagger.tag(tokens))
print(t1.tag(tokens))
print(uni_regexp.tag(tokens))
No way to have a different tag for the two "object" words.
🎯 [to do:]
train_sents
) and check it in the above sentence (weird result expected, see below).test_sents
).You have probably found a surprising result on our sample sentence and very low accuracy for the bigram tagger, when run alone. If you run the trained bigram tagger on a sentence it has not seen during training (e.g. bigram.tag(raw_sents[1]), you will notice that a majority of words will be tagged as “None”. When the bigram tagger encounters a new word, it is unable to assign a tag. It will also be unable to tag the following word — even if it was seen during training — because it was never seen preceded by a “None” tag. From this point on, all the words up to the end of the sentence will be tagged as “None” for the same reason. This explains the low accuracy of the bigram tagger when run alone.
The problem that arises when using n-grams is that once we pass to larger values of n, more data is needed for training, as the specificity of the context increases. In other words, there are many more possible bigrams than unigrams and thus the word/tag combinations that we need to consider when tagging are less likely to have appeared in the training set, as n increases. This is known as the sparse data problem, and is quite pervasive in NLP.
This means that, when using n-gram taggers, we encounter the following tradeoff: having n large means more accurate predictions, at the cost of having to provide more training data; having a smaller n, leads to not so accurate predictions, but less necessary training data.
In order to overcome this limitation, we can train a bigram tagger that falls back on our “unigram+default” or “unigram+regexp” taggers defined in the previous section.
🎯 [to do:]
🎯 [to do: Code the Viterbi algorithm (in your most favorite programming language) and apply it to the toy example provided on slide 8 of the lecture on HMM: two symbols (H,T), two states (the two coins)]
A = [[ 0.4 0.6 ]; [0.9 0.1 ]],
B = [[ 0.49 0.51 ]; [ 0.85 0.15 ]],
I = [ 0.5 0.5 ].
🎯 [to do: What is (one of) the most probable sequence of states corresponding to the observation HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH
?]