Setting up your environment¶

If you have not setup an environment in the previous PSs, then follow these instructions. If not, you can skip to installing the required packages for this PS.

While you can download the following packages with pip to your computer directly, we recommend (but not require) you to use a virtual environment to not mess up the package versions for different project.

First make sure you have (a virtual environment (e.g., venv, virtualenv, conda), and that the environment has) a Python version >= 3.6 (we recommend the latest 3.12). If you are using a Jupyter Notebook, make sure the interpreter points to the correct python executable.

For example we use conda to manage our environments, and we do the following to create a new one and activate it:

conda create --name inlp-venv
conda activate inlp-venv

Then install the following packages (this might take around 2 minutes):

pip install -U ipykernel
pip install -U nltk

If you want to download them directly within the notebook you can uncomment the following cell and run it (not advised -- in case you might not have the right kernel open for this notebook). The idea is that anything that is preceded with ! will be run as a shell command:

In [1]:

# !pip install -U ipykernel
# !pip install -U nltk

And then download the tagsets in NLTK:

In [2]:

import nltk

nltk.download("tagsets")
nltk.download("tagsets_json")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")

nltk.download("brown")
nltk.download("universal_tagset")
nltk.download("wordnet")

from nltk.corpus import brown
from nltk.corpus import wordnet
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package tagsets to /home/chaps/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package tagsets_json to
[nltk_data]     /home/chaps/nltk_data...

[nltk_data]   Unzipping help/tagsets_json.zip.
[nltk_data] Downloading package punkt_tab to /home/chaps/nltk_data...

[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/chaps/nltk_data...

[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package brown to /home/chaps/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/chaps/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package wordnet to /home/chaps/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

1. Parts of Speech and Ambiguity¶

For this exercise, we will be using the basic functionality of the built-in PoS tagger from NLTK. All you need to know for this part can be found in section 1 of chapter 5 of the NLTK book.

Search the web for "ambiguous headlines", to find such gems as: "British Left Waffles on Falkland Islands", and "Juvenile Court to Try Shooting Defendant". That website (link) has some good ones.
Manually tag (by yourself) these headlines. Do you think the information about part-of-speech tags would help a first reading?
Use the built-in NLTK PoS tagger on the headlines. Does it find the correct tags in these cases?

Note: NLTK provides documentation for each tag, which can be queried using the tag, e.g.

nltk.help.upenn_tagset('RB')

or some regular expression, e.g.

nltk.help.upenn_tagset('NN.*')

In [3]:

import nltk

# some headline
headline = "British Left Waffles on Falkland Islands"

nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``

In [4]:

# Tokenize and then tag the headline
tagged = nltk.pos_tag(nltk.word_tokenize(headline))
print(tagged)

# some help, maybe...
for word, tag in tagged:
    nltk.help.upenn_tagset(tag)

[('British', 'JJ'), ('Left', 'NNP'), ('Waffles', 'NNP'), ('on', 'IN'), ('Falkland', 'NNP'), ('Islands', 'NNP')]
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...

2. Vocabulary size and variability reduction¶

In this exercise, we will be exploring the interplay between PoS tagging, lemmatization, and vocabulary reduction.

Imagine you have a large corpus and a large number of queries of the form “Give me all documents in which all of the words x, y and z appear”. One way to answer such queries efficiently is to index the documents in the corpus: for instance implement a function called index() for which index(document, word) returns non-zero if and only if word occurs in document at least once.

One way to implement this kind of index is with a matrix in which each line corresponds to a document and each column corresponds to a word. When the query above is executed, only the lines for which the entries on the x, y and z columns are simultaneously not zero are returned.

The size of this matrix is number of documents × number of distinct words that appear in the corpus. Clearly, the smaller the matrix, the faster the queries are executed. Ideally, we would like to reduce the number of words while not losing the ability to answer queries. One way to do this is to eliminate morphological variability — for instance, map both “goes” and “went” to the same column “go”. This is called "lemmatization". Part-of-speech tagging can help lemmatization so as not to confuse similar surface forms with different roles (e.g. not confuse "can" as in "a can of beer" as opposed to "I can do it".).

To begin, we will need a corpus to work with. Let's make use of the Brown Corpus. Load the list of all word-tag tuples from the corpus:

In [5]:

brown_tagged = brown.tagged_words(tagset="universal")

and see what you got; e.g.

In [6]:

brown_tagged[0:10]

Out[6]:

[('The', 'DET'),
 ('Fulton', 'NOUN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP')]

Note that we are using the “universal” tagset because the original Brown tagset is somewhat unwieldy.

Next, we want to see how many distinct words appear in the corpus. Variations of the same word should be counted separately, e.g. "go" and "went" shall be counted separately here, but case should be ignored. One way to do this is to use the NLTK-supplied FreqDist class:

In [7]:

fdist_before = nltk.FreqDist(word.lower() for (word, tag) in brown_tagged)

This builds a frequency distribution from the words in the corpus (have a look!). To see how many distinct words are in the distribution, simply take the length:

In [8]:

len(fdist_before)  ## should be 49815

Out[8]:

Now run the WordNet Lemmatizer on the tagged corpus. Note that this lemmatizer needs information about the part of speech of a token in order to correctly lemmatize it. One way of lemmatizing the entire corpus is to iterate through the list of (token, tag) pairs — brown_tagged variable above —, lemmatize each token and add the lemma to a separate list. For the part-of-speech format needed for lemmatization, you might find the following function useful (do not forget to import wordnet from nltk.corpus):

In [9]:

# converts a pos tag from the 'universal' format to wordnet format
def get_wordnet_pos(universal_tag):
    if universal_tag == "VERB":
        return wordnet.VERB
    elif universal_tag == "ADJ":
        return wordnet.ADJ
    elif universal_tag == "ADV":
        return wordnet.ADV
    else:
        return wordnet.NOUN

Count the distinct lemmatized words, for example using FreqDist. Do not forget to convert everything to lowercase. What percentage reduction did you obtain?

In [10]:

lem = WordNetLemmatizer()

# build a frequency distribution from the lowercase form of the lemmas
fdist_after = nltk.FreqDist(
    lem.lemmatize(word.lower(), get_wordnet_pos(tag)) for (word, tag) in brown_tagged
)

# how many words exist after lemmatization (we obtained ~21% reduction)
print(len(fdist_after))  ## should be 39358

print(
    100 * (len(fdist_before) - len(fdist_after)) / len(fdist_before)
)  ## should be 21%

39358
20.991669175951017

3. N-gram tagging¶

3.1 Unigram tagger¶

In this exercise, we will see how adding context can improve the performance of automatic part-of-speech tagging. From the NLTK point of view, everything you need to know can be found in section 5 of chapter 5 of the book.

We will begin with a simple unigram tagger and build it up to a slightly more complex tagger. Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

The first thing we want to do is to consider a corpus to work with and separate it into a training set and a test set. We recommend the Brown corpus again, but this time it is a good idea to restrict it to one or two of its categories, otherwise the experiments would be too slow (we choose news and fiction categories below):

In [11]:

# import the tagged and untagged sentences
brown_tagged_sents = brown.tagged_sents(categories=["news", "fiction"])
print("Size of corpus: {0} sentences".format(len(brown_tagged_sents)))

# split the sentences into training and test sets
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
print("Size of training set: {0} sentencces".format(size))

# Tagged test sentences (used as reference)
test_sents = brown_tagged_sents[size:]

# Untagged test sentences
raw_sents = brown.sents(categories=["news", "fiction"])[size:]

Size of corpus: 8872 sentences
Size of training set: 7984 sentencces

Train a unigram tagger (on the training set of sentences) and run it on some of the sentences from raw_sents. Observe that some words are not assigned a tag. Why not?
What is the accuracy of the tagger on the test set (the evaluate() function may come in handy here)?

In [12]:

# train the tagger and test it on unseen sentences
unigram_tagger = nltk.UnigramTagger(train_sents)
print("Unigram tagger unseen sentence example:")
print(unigram_tagger.tag(raw_sents[10]))

print(
    "\nPerf. of unigram tagger alone: {0}".format(unigram_tagger.evaluate(test_sents))
)

Unigram tagger unseen sentence example:
[('``', '``'), ('Um', None), ("''", "''"), (',', ','), ('said', 'VBD'), ('the', 'AT'), ('old', 'JJ'), ('lady', 'NN'), (',', ','), ('and', 'CC'), ('brought', 'VBD'), ('her', 'PP$'), ('eyes', 'NNS'), ('down', 'RP'), ('to', 'TO'), ('the', 'AT'), ('tray', None), ('.', '.')]

Perf. of unigram tagger alone: 0.8502823106037104

/tmp/user/1000/ipykernel_20610/1206963432.py:7: DeprecationWarning: 
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  "\nPerf. of unigram tagger alone: {0}".format(unigram_tagger.evaluate(test_sents))

3.2 Guesser¶

An easy way to mitigate the problem of some words not being assigned tags is to have the unigram tagger fall back on a guesser when it does not know what tag to assign to a word.

As its name suggests, a guesser is a PoS Tagger that assigns a tag to any token (be it a correct word or not). The use of a guesser as a fallback can improve the robustness of the PoS tagging system (i.e. the system will associate a PoS tag to every word, even if it was not seen beforehand, or if it is not in the lexicon). Another scenario in which a guesser could be helpful is in the initialization phase of the Brill tagging algorithm. In addition, the guesser’s standalone performance could be used as a baseline, when evaluating more complex taggers.

For instance, perhaps the simplest guesser is the one that just assigns the same tag to every word it encounters. This is called a “default tagger” in NLTK.

Instantiate a default tagger and use it to tag the test sentences (test_sents) defined in the previous section. Do not combine it yet with other taggers at this point. Report the accuracy. (Feel free to look at the examples in sections 4 and 5 from chapter 5 of the NLTK book.)

In [13]:

t0 = nltk.DefaultTagger("NN")
print("Perf. default tagger alone: {0}".format(t0.evaluate(test_sents)))

Perf. default tagger alone: 0.11230377861884966

/tmp/user/1000/ipykernel_20610/2325258753.py:2: DeprecationWarning: 
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print("Perf. default tagger alone: {0}".format(t0.evaluate(test_sents)))

Then use the default tagger as a fallback for a Unigram tagger. Report the change in accuracy.

In [14]:

t1 = nltk.UnigramTagger(train_sents, backoff=t0)
print(
    "Perf. unigram tagger, fallback on default tagger: {0}".format(
        t1.evaluate(test_sents)
    )
)

Perf. unigram tagger, fallback on default tagger: 0.868834150276106

/tmp/user/1000/ipykernel_20610/1508534276.py:4: DeprecationWarning: 
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  t1.evaluate(test_sents)

🎯 [to do: How would you build a better guesser, while still not relying on any further training data?]

Answer: We could make use of morphology, e.g. using regular expressions to take some decision on the tag to guess, as done now.

One step further than the default tagger could be to assign tags to tokens on the basis of matching patterns. In other words, the form of the token is considered when guessing its part-of-speech. For example, it might guess that any word ending in “ed” is the past participle of a verb, and any word ending with “'s” is a possessive noun. This is called a “regular expression tagger” and is already available in NLTK.

🎯 [to do:]

🎯 Instantiate a regular expression tagger and use it to tag the test sentences. You could use the same rules provided in the link above, or define some other rules of your own. Do not combine it yet with other taggers at this point. Report the accuracy.
🎯 Now use the regular expression tagger as a fallback for a Unigram tagger. Report the change in accuracy.

In [15]:

regexp_tagger = nltk.RegexpTagger(
    [
        (r"^-?[0-9]+(.[0-9]+)?$", "CD"),  # cardinal numbers
        (r"(The|the|A|a|An|an)$", "AT"),  # articles
        (r".*able$", "JJ"),  # adjectives
        (r".*ness$", "NN"),  # nouns formed from adjectives
        (r".*ly$", "RB"),  # adverbs
        (r".*s$", "NNS"),  # plural nouns
        (r".*ing$", "VBG"),  # gerunds
        (r".*ed$", "VBD"),  # past tense verbs
        (r".*", "NN"),  # nouns (default)
    ]
)
print("Perf. RegExp tagger alone: {0}".format(regexp_tagger.evaluate(test_sents)))

# evaluate a unigram tagger with a backoff to a regexp tagger
uni_regexp = nltk.UnigramTagger(train_sents, backoff=regexp_tagger)
print(
    "Perf. unigram tagger, fallback on regexp_tagger: {0}".format(
        uni_regexp.evaluate(test_sents)
    )
)

/tmp/user/1000/ipykernel_20610/1988486054.py:14: DeprecationWarning: 
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print("Perf. RegExp tagger alone: {0}".format(regexp_tagger.evaluate(test_sents)))

Perf. RegExp tagger alone: 0.2701495315505367

Perf. unigram tagger, fallback on regexp_tagger: 0.8877582676676801

/tmp/user/1000/ipykernel_20610/1988486054.py:20: DeprecationWarning: 
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  uni_regexp.evaluate(test_sents)

3.3 Bigrams and More¶

When performing PoS tagging using a unigram-based tagger, the larger context in which the words appear is ignored. Unigram taggers consider only one item of context, in isolation. This poses a significant limitation on any tagger in this class: the most that it can do is to assign the a priori most likely tag to a given token. In a sense, the tag of each word is decided before it is placed in a sentence. Therefore, the word “object” would be assigned the same part of speech every time, regardless if it appears as “the object” (should be a noun) or as “to object” (should be a verb).

An n-gram tagger picks the tag that is most likely in the given context of size n. An instance of an n-gram tagger is the bigram tagger, which considers groups of two tokens when deciding on the parts-of-speech. For example, we would expect a bigram tagger to find the correct tags for the two occurrences of the word “_objec_t” in the following sentence (as opposed to a unigram tagger):

“I did not object to the object”

Indeed, a NLTK unigram tagger trained on the Brown corpus yields:

In [16]:

sentence = "I did not object to the object"
tokens = nltk.word_tokenize(sentence)

print(unigram_tagger.tag(tokens))
print(t1.tag(tokens))
print(uni_regexp.tag(tokens))

[('I', 'PPSS'), ('did', 'DOD'), ('not', '*'), ('object', 'VB'), ('to', 'TO'), ('the', 'AT'), ('object', 'VB')]
[('I', 'PPSS'), ('did', 'DOD'), ('not', '*'), ('object', 'VB'), ('to', 'TO'), ('the', 'AT'), ('object', 'VB')]
[('I', 'PPSS'), ('did', 'DOD'), ('not', '*'), ('object', 'VB'), ('to', 'TO'), ('the', 'AT'), ('object', 'VB')]

No way to have a different tag for the two "object" words.

🎯 [to do:]

🎯 Train a bigram tagger alone on the training set (train_sents) and check it in the above sentence (weird result expected, see below).
🎯 Then, find its accuracy on the test set (test_sents).

In [17]:

bigram = nltk.BigramTagger(train_sents)

print(bigram.tag(tokens))

print("\nPerf. bigram tagger alone: {0}".format(bigram.accuracy(test_sents)))

[('I', 'PPSS'), ('did', 'DOD'), ('not', '*'), ('object', 'VB'), ('to', 'TO'), ('the', None), ('object', None)]

Perf. bigram tagger alone: 0.19631445058013278

You have probably found a surprising result on our sample sentence and very low accuracy for the bigram tagger, when run alone. If you run the trained bigram tagger on a sentence it has not seen during training (e.g. bigram.tag(raw_sents[1]), you will notice that a majority of words will be tagged as “None”. When the bigram tagger encounters a new word, it is unable to assign a tag. It will also be unable to tag the following word — even if it was seen during training — because it was never seen preceded by a “None” tag. From this point on, all the words up to the end of the sentence will be tagged as “None” for the same reason. This explains the low accuracy of the bigram tagger when run alone.

The problem that arises when using n-grams is that once we pass to larger values of n, more data is needed for training, as the specificity of the context increases. In other words, there are many more possible bigrams than unigrams and thus the word/tag combinations that we need to consider when tagging are less likely to have appeared in the training set, as n increases. This is known as the sparse data problem, and is quite pervasive in NLP.

This means that, when using n-gram taggers, we encounter the following tradeoff: having n large means more accurate predictions, at the cost of having to provide more training data; having a smaller n, leads to not so accurate predictions, but less necessary training data.

In order to overcome this limitation, we can train a bigram tagger that falls back on our “unigram+default” or “unigram+regexp” taggers defined in the previous section.

🎯 [to do:]

🎯 Try this out now (an example for a bigram tagger with fallback on “unigram+default” can be found in the book, section 5.5.4 “Combining Taggers”). Does this improve performance?
🎯 What about adding a trigram tagger that falls back on the “bigram+unigram+default” or on “bigram+unigram+regexp”?

In [18]:

# evaluate a bigram tagger with a backoff to the unigram+dummy tagger
t2 = nltk.BigramTagger(train_sents, backoff=t1)
print(
    "Perf. bigram tagger,  fallback on        unigram+default tagger: {0}".format(
        t2.accuracy(test_sents)
    )
)

# evaluate a trigram tagger with a backoff to the bigram+unigram+dummy tagger
t3 = nltk.TrigramTagger(train_sents, backoff=t2)
print(
    "Perf. trigram tagger, fallback on bigram+unigram+default tagger: {0}".format(
        t3.accuracy(test_sents)
    )
)

print("")
# evaluate a bigram tagger with a backoff to the unigram+regexp tagger
t2_regexp = nltk.BigramTagger(train_sents, backoff=uni_regexp)
print(
    "Perf. bigram tagger,  fallback on        unigram+regexp tagger: {0}".format(
        t2_regexp.accuracy(test_sents)
    )
)

# evaluate a trigram tagger with a backoff to the bigram+unigram+regexp tagger
t3_regexp = nltk.TrigramTagger(train_sents, backoff=t2_regexp)
print(
    "Perf. trigram tagger, fallback on bigram+unigram+regexp tagger: {0}".format(
        t3_regexp.accuracy(test_sents)
    )
)

## and on our sample sentence:
print(t2.tag(tokens))

Perf. bigram tagger,  fallback on        unigram+default tagger: 0.8848420922007818

Perf. trigram tagger, fallback on bigram+unigram+default tagger: 0.8845939070546628

Perf. bigram tagger,  fallback on        unigram+regexp tagger: 0.9036421170192964

Perf. trigram tagger, fallback on bigram+unigram+regexp tagger: 0.9036421170192964
[('I', 'PPSS'), ('did', 'DOD'), ('not', '*'), ('object', 'VB'), ('to', 'TO'), ('the', 'AT'), ('object', 'NN')]

4. Viterbi Algorithm¶

🎯 [to do: Code the Viterbi algorithm (in your most favorite programming language) and apply it to the toy example provided on slide 8 of the lecture on HMM: two symbols (H,T), two states (the two coins)]

A = [[ 0.4 0.6 ]; [0.9 0.1 ]],
B = [[ 0.49 0.51 ]; [ 0.85 0.15 ]],
I = [ 0.5 0.5 ].

🎯 [to do: What is (one of) the most probable sequence of states corresponding to the observation HTTHTTHHTTHTTTHHTHHTTHTTTTHTHHTHTHHTTTH?]

Here (link) are some slides illustrating the Viterbi algorithm at work on a concreate simple example.

Regarding our question, the solution should be one of 32 equaly probable solutions:

211211 12 112111 12 1 12 112111121 12 121 12 1112
211211 12 112111 12 1 12 112111121 12 121 21 1112
211211 12 112111 12 1 12 112111121 21 121 12 1112
211211 12 112111 12 1 12 112111121 21 121 21 1112
211211 12 112111 12 1 21 112111121 12 121 12 1112
211211 12 112111 12 1 21 112111121 12 121 21 1112
211211 12 112111 12 1 21 112111121 21 121 12 1112
211211 12 112111 12 1 21 112111121 21 121 21 1112
                                             
211211 12 112111 21 1 12 112111121 12 121 12 1112
211211 12 112111 21 1 12 112111121 12 121 21 1112
211211 12 112111 21 1 12 112111121 21 121 12 1112
211211 12 112111 21 1 12 112111121 21 121 21 1112
211211 12 112111 21 1 21 112111121 12 121 12 1112
211211 12 112111 21 1 21 112111121 12 121 21 1112
211211 12 112111 21 1 21 112111121 21 121 12 1112
211211 12 112111 21 1 21 112111121 21 121 21 1112
                                             
211211 21 112111 12 1 12 112111121 12 121 12 1112
211211 21 112111 12 1 12 112111121 12 121 21 1112
211211 21 112111 12 1 12 112111121 21 121 12 1112
211211 21 112111 12 1 12 112111121 21 121 21 1112
211211 21 112111 12 1 21 112111121 12 121 12 1112
211211 21 112111 12 1 21 112111121 12 121 21 1112
211211 21 112111 12 1 21 112111121 21 121 12 1112
211211 21 112111 12 1 21 112111121 21 121 21 1112
                                             
211211 21 112111 21 1 12 112111121 12 121 12 1112
211211 21 112111 21 1 12 112111121 12 121 21 1112
211211 21 112111 21 1 12 112111121 21 121 12 1112
211211 21 112111 21 1 12 112111121 21 121 21 1112
211211 21 112111 21 1 21 112111121 12 121 12 1112
211211 21 112111 21 1 21 112111121 12 121 21 1112
211211 21 112111 21 1 21 112111121 21 121 12 1112
211211 21 112111 21 1 21 112111121 21 121 21 1112

Here are a C++ code, a Perl code and an Haskel code (from 2014 student Romain Edelman).

Part-of-Speech Tagging Practical Session¶

Introduction¶