Part-of-Speech Tagging Practical Session

This part was produced using Jupyter.
If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory at all: simply proceed as usual in your favorite Python environment.

Introduction

In this practical session, our goals are to:

  • better understand the role and mechanisms behind PoS tagging;
  • explore applications of PoS tagging such as dealing with ambiguity or vocabulary reduction;
  • get accustomed to the Viterbi algorithm through a concrete example.

This practical session is making use of the NLTK. Please refer to this part of first practical session for a setup.

Setting up your environment

If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.

While you can download the following packages with pip to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.

First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python executable.

For example we use conda to manage our environments, and we do the following to create a new one and activate it:

conda create --name inlp-venv
conda activate inlp-venv

Then install the following packages (this might take around 2 minutes):

pip install -U ipykernel
pip install -U nltk

If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:

In [ ]:
# !pip install -U ipykernel
# !pip install -U nltk

And then download the tagsets in NLTK:

In [ ]:
import nltk

nltk.download("tagsets")
nltk.download("tagsets_json")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")

nltk.download("brown")
nltk.download("universal_tagset")
nltk.download("wordnet")

from nltk.corpus import brown
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

1. Parts of Speech and Ambiguity

For this exercise, we will be using the basic functionality of the built-in PoS tagger from NLTK. All you need to know for this part can be found in section 1 of chapter 5 of the NLTK book.

  1. Search the web for "ambiguous headlines", to find such gems as: "British Left Waffles on Falkland Islands", and "Juvenile Court to Try Shooting Defendant". That website (link) has some good ones.
  2. Manually tag (by yourself) these headlines. Do you think the information about part-of-speech tags would help a first reading?
  3. Use the built-in NLTK PoS tagger on the headlines. Does it find the correct tags in these cases?

Note: NLTK provides documentation for each tag, which can be queried using the tag, e.g.

nltk.help.upenn_tagset('RB')

or some regular expression, e.g.

nltk.help.upenn_tagset('NN.*')
In [ ]:
import nltk

# some headline
headline = "British Left Waffles on Falkland Islands"

nltk.help.upenn_tagset()

🎯 [to do: complete the above code to tag your headline phrase]

2. Vocabulary size and variability reduction

In this exercise, we will be exploring the interplay between PoS tagging, lemmatization, and vocabulary reduction.

Imagine you have a large corpus and a large number of queries of the form “Give me all documents in which all of the words x, y and z appear”. One way to answer such queries efficiently is to index the documents in the corpus: for instance implement a function called index() for which index(document, word) returns non-zero if and only if word occurs in document at least once.

One way to implement this kind of index is with a matrix in which each line corresponds to a document and each column corresponds to a word. When the query above is executed, only the lines for which the entries on the x, y and z columns are simultaneously not zero are returned.

The size of this matrix is number of documents × number of distinct words that appear in the corpus. Clearly, the smaller the matrix, the faster the queries are executed. Ideally, we would like to reduce the number of words while not losing the ability to answer queries. One way to do this is to eliminate morphological variability — for instance, map both “goes” and “went” to the same column “go”. This is called "lemmatization". Part-of-speech tagging can help lemmatization so as not to confuse similar surface forms with different roles (e.g. not confuse "can" as in "a can of beer" as opposed to "I can do it".).

  • To begin, we will need a corpus to work with. Let's make use of the Brown Corpus. Load the list of all word-tag tuples from the corpus:
In [ ]:
brown_tagged = brown.tagged_words(tagset="universal")

and see what you got; e.g.

In [ ]:
brown_tagged[0:10]

Note that we are using the “universal” tagset because the original Brown tagset is somewhat unwieldy.

  • Next, we want to see how many distinct words appear in the corpus. Variations of the same word should be counted separately, e.g. "go" and "went" shall be counted separately here, but case should be ignored. One way to do this is to use the NLTK-supplied FreqDist class:
In [ ]:
fdist_before = nltk.FreqDist(word.lower() for (word, tag) in brown_tagged)

This builds a frequency distribution from the words in the corpus (have a look!). To see how many distinct words are in the distribution, simply take the length:

In [ ]:
len(fdist_before)  ## should be 49815
  • Now run the WordNet Lemmatizer on the tagged corpus. Note that this lemmatizer needs information about the part of speech of a token in order to correctly lemmatize it. One way of lemmatizing the entire corpus is to iterate through the list of (token, tag) pairs — brown_tagged variable above —, lemmatize each token and add the lemma to a separate list. For the part-of-speech format needed for lemmatization, you might find the following function useful (do not forget to import wordnet from nltk.corpus):
In [ ]:
# converts a pos tag from the 'universal' format to wordnet format
def get_wordnet_pos(universal_tag):
    if universal_tag == "VERB":
        return wordnet.VERB
    elif universal_tag == "ADJ":
        return wordnet.ADJ
    elif universal_tag == "ADV":
        return wordnet.ADV
    else:
        return wordnet.NOUN
  • Count the distinct lemmatized words, for example using FreqDist. Do not forget to convert everything to lowercase. What percentage reduction did you obtain?

🎯 [to do: write Python code to perform lemmatization of the Brown corpus and answer the above question.]

3. N-gram tagging

3.1 Unigram tagger

In this exercise, we will see how adding context can improve the performance of automatic part-of-speech tagging. From the NLTK point of view, everything you need to know can be found in section 5 of chapter 5 of the book.

We will begin with a simple unigram tagger and build it up to a slightly more complex tagger. Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

  • The first thing we want to do is to consider a corpus to work with and separate it into a training set and a test set. We recommend the Brown corpus again, but this time it is a good idea to restrict it to one or two of its categories, otherwise the experiments would be too slow (we choose news and fiction categories below):
In [ ]:
# import the tagged and untagged sentences
brown_tagged_sents = brown.tagged_sents(categories=["news", "fiction"])
print("Size of corpus: {0} sentences".format(len(brown_tagged_sents)))

# split the sentences into training and test sets
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
print("Size of training set: {0} sentencces".format(size))

# Tagged test sentences (used as reference)
test_sents = brown_tagged_sents[size:]

# Untagged test sentences
raw_sents = brown.sents(categories=["news", "fiction"])[size:]
  • Train a unigram tagger (on the training set of sentences) and run it on some of the sentences from raw_sents. Observe that some words are not assigned a tag. Why not?
  • What is the accuracy of the tagger on the test set (the evaluate() function may come in handy here)?

🎯 [to do: write Python code to perform unigram tagger learning on the Brown corpus and answer the above questions.]

3.2 Guesser

An easy way to mitigate the problem of some words not being assigned tags is to have the unigram tagger fall back on a guesser when it does not know what tag to assign to a word.

As its name suggests, a guesser is a PoS Tagger that assigns a tag to any token (be it a correct word or not). The use of a guesser as a fallback can improve the robustness of the PoS tagging system (i.e. the system will associate a PoS tag to every word, even if it was not seen beforehand, or if it is not in the lexicon). Another scenario in which a guesser could be helpful is in the initialization phase of the Brill tagging algorithm. In addition, the guesser’s standalone performance could be used as a baseline, when evaluating more complex taggers.

For instance, perhaps the simplest guesser is the one that just assigns the same tag to every word it encounters. This is called a default tagger in NLTK.

  • Instantiate a default tagger and use it to tag the test sentences (test_sents) defined in the previous section. Do not combine it yet with other taggers at this point. Report the accuracy. (Feel free to look at the examples in sections 4 and 5 from chapter 5 of the NLTK book.)
  • Then use the default tagger as a fallback for a Unigram tagger. Report the change in accuracy.

🎯 [to do: How would you build a better guesser, while still not relying on any further training data?]

One step further than the default tagger could be to assign tags to tokens on the basis of matching patterns. In other words, the form of the token is considered when guessing its part-of-speech. For example, it might guess that any word ending in “ed” is the past participle of a verb, and any word ending with “'s” is a possessive noun. This is called a regular expression tagger and is already available in NLTK.

🎯 [to do:]

  • 🎯 Instantiate a regular expression tagger and use it to tag the test sentences. You could use the same rules provided in the link above, or define some other rules of your own. Do not combine it yet with other taggers at this point. Report the accuracy.
  • 🎯 Now use the regular expression tagger as a fallback for a Unigram tagger. Report the change in accuracy.

3.3 Bigrams and More

When performing PoS tagging using a unigram-based tagger, the larger context in which the words appear is ignored. Unigram taggers consider only one item of context, in isolation. This poses a significant limitation on any tagger in this class: the most that it can do is to assign the a priori most likely tag to a given token. In a sense, the tag of each word is decided before it is placed in a sentence. Therefore, the word “object” would be assigned the same part of speech every time, regardless if it appears as “the object” (should be a noun) or as “to object” (should be a verb).

An n-gram tagger picks the tag that is most likely in the given context of size n. An instance of an n-gram tagger is the bigram tagger, which considers groups of two tokens when deciding on the parts-of-speech. For example, we would expect a bigram tagger to find the correct tags for the two occurrences of the word “_objec_t” in the following sentence (as opposed to a unigram tagger):

I did not object to the object

Indeed, a NLTK unigram tagger trained on the Brown corpus yields:

In [ ]:
sentence = "I did not object to the object"
tokens = nltk.word_tokenize(sentence)

print(unigram_tagger.tag(tokens))
print(t1.tag(tokens))
print(uni_regexp.tag(tokens))

No way to have a different tag for the two "object" words.

🎯 [to do:]

  • 🎯 Train a bigram tagger alone on the training set (train_sents) and check it in the above sentence (weird result expected, see below).
  • 🎯 Then, find its accuracy on the test set (test_sents).

You have probably found a surprising result on our sample sentence and very low accuracy for the bigram tagger, when run alone. If you run the trained bigram tagger on a sentence it has not seen during training (e.g. bigram.tag(raw_sents[1]), you will notice that a majority of words will be tagged as “None”. When the bigram tagger encounters a new word, it is unable to assign a tag. It will also be unable to tag the following word — even if it was seen during training — because it was never seen preceded by a “None” tag. From this point on, all the words up to the end of the sentence will be tagged as “None” for the same reason. This explains the low accuracy of the bigram tagger when run alone.

The problem that arises when using n-grams is that once we pass to larger values of n, more data is needed for training, as the specificity of the context increases. In other words, there are many more possible bigrams than unigrams and thus the word/tag combinations that we need to consider when tagging are less likely to have appeared in the training set, as n increases. This is known as the sparse data problem, and is quite pervasive in NLP.

This means that, when using n-gram taggers, we encounter the following tradeoff: having n large means more accurate predictions, at the cost of having to provide more training data; having a smaller n, leads to not so accurate predictions, but less necessary training data.

In order to overcome this limitation, we can train a bigram tagger that falls back on our “unigram+default” or “unigram+regexp” taggers defined in the previous section.

🎯 [to do:]

  • 🎯 Try this out now (an example for a bigram tagger with fallback on “unigram+default” can be found in the book, section 5.5.4 “Combining Taggers). Does this improve performance?
  • 🎯 What about adding a trigram tagger that falls back on the “bigram+unigram+default” or on “bigram+unigram+regexp”?

4. Viterbi Algorithm

🎯 [to do: Code the Viterbi algorithm (in your most favorite programming language) and apply it to the toy example provided on slide 8 of the lecture on HMM: two symbols (H,T), two states (the two coins)]

A = [[ 0.4 0.6 ]; [0.9 0.1 ]],
B = [[ 0.49 0.51 ]; [ 0.85 0.15 ]],
I = [ 0.5 0.5 ].

🎯 [to do: What is (one of) the most probable sequence of states corresponding to the observation HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH?]