Introduction¶

The purpose of this practical session is to gain more insight into using parsers and grammars and to get familiarized with the tools provided to this end by NLTK. Our goals are to:

Understand the process of parsing better by doing some examples by hand.
Gain insight into grammars by writing a grammar of our own.
Experiment with training a grammar from a corpus and evaluating it.

The practical session is organized in four sections.

The warm-up, where you are going to use a given toy grammar to generate parse trees and get a chance to compare your intuition with the results provided by the NLTK parsers.
Then, you will need to design and test your own grammar, again, for a small example.
Next, you will automatically extract a SCFG from a corpus of annotated sentences.
Finally, you are going to explore ways to improve automatically extracted grammars, through error analysis.

Before diving in, it would be useful to have a look at how NLTK handles parsing. At the very end of this page, you will find a summary of the most useful commands. For more helpful explanations and examples, feel free to consult the NLTK Book , Chapter 8 (along with the chapter extras ) , Sections 1, 2, 3, 4.1, 4.2, 6.

At the end of this document, you can find a summary of useful NLTK commands .

Section 1: Warm-up¶

In general, given a raw text, would it be possible to directly use a parser, in order to generate a parse tree?
- If not, explain what pre-processing steps are necessary to perform on raw texts prior to using a parser.
- If yes, explain why it is possible.

Have a look at the texts and at the grammar below.

Input texts:

I saw an elephant
I saw an elephant in my pajamas

Grammar:

S   -> NP VP     [1.0]
PP  -> P NP      [1.0]
NP  -> Det N     [0.4]
NP  -> Det N PP  [0.2]
VP  -> V NP      [0.5]
VP  -> VP PP     [0.5]

NP  -> 'I'        [0.4]
Det -> 'an'       [0.5]
Det -> 'my'       [0.5]
N   -> 'elephant' [0.5]
N   -> 'pajamas'  [0.5]
V   -> 'saw'      [1.0]
P   -> 'in'       [1.0]

Using the grammar, manually construct the parse tree(s) for the two sentences above; apply whichever method you find suitable.
Now, it is time to verify your intuition and compare your solutions to the solutions obtained using existing NLTK parsers. Try parsing with the NLTK Chart Parser and then with the NLTK Viterbi parser.

Examine the output of the parsers. Compare the output to your intuitive parse trees. Is there any difference?

You can have a look at the general parser API here . More details about the implementations of the two parsers can be found here and here.
Using the NLTK GUI, draw the most probable parse trees found by the NLTK parsers and compare them to the trees you have found manually. How are they different? Do they differ only in the names of the nonterminals use, or is a structural difference as well? Why do you think these differences occur?

Implementation hints:¶

First, create the grammar in NLTK, starting from the provided rules. Next, you need to instantiate the two parsers with this particular grammar. Then, you will be ready to generate the parse trees for the two example sentences and to display them graphically through the draw() function.
If you try parsing other sentences, please do not add punctuation, because punctuation is not included in the grammar.

Section 2: Hand-written grammars¶

You are given the following sentences:

Alice wondered with Bob in the empty city streets
Bob offered Alice an iguana for her birthday
Alice gave an inspiring speech at the conference on education

Design a non-probabilistic context-free grammar for parsing these sentences and then manually provide the parse tree(s) for these sentences. Here is a list of lexical rules and syntactic categories that you could use to devise the grammar:
Lexical rules:
```
V    -> 'wondered' | 'offered' | 'gave'
NP   -> 'Alice' | 'Bob'
N    -> 'city' | 'streets' | 'iguana' | 'birthday' | 'speech' | 'conference' | 'education'
Adj  -> 'empty' | 'inspiring'
Det  -> 'the' | 'an' | 'her'
Prep -> 'with' | 'in' | 'for' | 'at' | 'on'
```
Suggested syntactic categories (non-terminals):
```
 S - Sentence 
 NP - Noun Phrase 
 VP - Verb Phrase 
 PNP - Prepositional Noun Phrase
```
However, if you have other ideas, feel free to come up with your own lexical and syntactical rules.
With the grammar you designed in point (1.), parse the above sentences using NLTK’s Chart parser, described in Section 1.
Display and compare the parse trees obtained in points (1.) and (2.).

Section 3: Extracting a Grammar from a Corpus¶

In the previous section, we saw that even for a very small example designing grammars manually can be time consuming and does not necessarily have the best results. In this section, we explore a different approach to creating grammars: automatic extraction from annotated texts.

For this, we first need to load an annotated text to the workspace. We will work with an existing treebank from the NLTK framework. This is a ~5% fragment of the Penn Treebank . It contains data from Wall Street Journal for 1650 sentences. Let us have a look at an examples of the type of data in the Penn Treebank, shown below:

[ The files in this hierarchy were automatically created by inserting the part of speech tags from a tagged text file (.pos file) into a parsed text file (.prd file). ]

( (S
    (NP-SBJ
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,)
      (ADJP
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) )
    (VP (MD will)
      (VP (VB join)
        (NP (DT the) (NN board) )
        (PP-CLR (IN as)
          (NP (DT a) (JJ nonexecutive) (NN director) ))
        (NP-TMP (NNP Nov.) (CD 29) )))
    (. .) ))

Your goal is to automatically extract the CFG out of the treebank and then to evaluate and improve the extracted grammar (Section 4).

First, split the corpus into two parts : one for learning the grammar (learning set) and one for evaluating it (test set). For the sake of saving time during this practical session, use 3% of the total number of sentences for the test set. Here is how you could make the split:

In [ ]:

## If not done yet, download the treebank:
# nltk.download('treebank')

treebank = nltk.corpus.treebank
dataset_size = len(treebank.parsed_sents())

## here, we define the split percentage for the training set and the
## learning set, in our case ~3% and ~97%

split_size = int(dataset_size * 0.97)
learning_set = treebank.parsed_sents()[:split_size]
test_set = treebank.parsed_sents()[split_size:]

Note that test_set already contains the parses of the sentences . This is what we are going to compare against, in order to assess the quality of the extracted grammar. In addition, we need the raw (not parsed) format of the sentences in the test set. These raw sentences will be parsed with the extracted grammar and then compared against the reference, stored in test_set.

In [ ]:

# create a set containing the raw sentences
sents = treebank.sents()
raw_test_set = [ [ w for w in sents[i] ] for i in range(split_size, dataset_size) ]

Extract the grammar out of the learning corpus . At this point, it might be useful to have another look at the commands provided in the end of the document. Here is a suggestion on how to do it. Note that we still need the entire lexicon (i.e. rules of the type PoS -> ) even though the syntactic rules are extracted from only a portion of the treebank. Assume that you have already separated the learning set from the test set and that the learning set is stored in the learning_set variable.

In [ ]:

from nltk.tree import Tree
from nltk.grammar import Nonterminal

# This is where we will store all of the productions necessary to
# construct the PCFG.
tbank_productions = []

# For all of the (parsed) sentences in the learning set, extract the
# productions (i.e. extract the rules).
for sent in learning_set:
	for production in sent.productions():
		tbank_productions.append(production)

# Now, we will add the lexical rules for the ENTIRE lexicon.
for word, tag in treebank.tagged_words():

	# for each tagged word, we create a tree containing that
	# lexical rule, in order to be able to add it to our
	# production set tbank_productions.

	t = Tree.fromstring("("+ tag + " " + word  +")")
	for production in t.productions():
		tbank_productions.append(production)

# At this point, we have the syntactic rules extracted from the
# learning set and all of the lexical rules. We are ready to extract
# the PCFG.
tbank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), tbank_productions)

Try to have a look at the grammar. How many rules are there? Imagine you would have to correct it, for instance. You certainly will come to the conclusion that this is not feasible. Indeed, in such a framework (automated extraction) the grammars should not be modified by hand (maybe not even read), but only the corpus has to be changed/corrected/extended.

Evaluate the performance of the grammars on the test set (the 3% of the sentences from the treebank that you isolated in the beginning of the exercise). What proportion of sentences from the test set have been parsed as in the reference (i.e. are correctly parsed) by the extracted grammar?
We now want to study the size of the grammar with respect to the size of the learning set. Split the former learning set (i.e. the 95% of the treebank) into two parts: one kept for actual learning and another part which is not used. Do this for several percentages of the learning set (10%, 50%, 70%, 80%, 90%, 100%), several times for each ratio. H ave a look at the resulting grammars and try to understand their differences.

You will probably come to the conclusion that this is not an appropriate way to work. Indeed, in such a framework (automated extraction) the grammars should not be modified by hand. Only the corpus should be changed, corrected or extended. This is the goal of the next (and last) part.

Section 4 (Optional): Error Analysis, Improving the Grammar¶

In this section, the goal is to achieve a deeper understanding of results obtained in the previous section. Take the grammar that is learned based on 100% of the learning set and a subset of sentences from the test set (~10 sentences) .

Is the reference (i.e. the parse in the corpus, for the chosen sentence in the test set) correct according to you?
If the reference is correct:
- if the result of the parsing is incorrect, this either means that the correct parse cannot be constructed or that it is not the best one.
- In both cases, we can try to fix it by adding the corresponding parse to the learning set.
- Add the corresponding correct pars e to the learning set and learn again as long as you do not get the correct pars e as a result of the learning.
- As you are doing this, is the performance improving on the rest of the test set?
If the reference is incorrect, correct it in the test set and start the evaluation again. Iterate the process and study the performance on the test set.

Useful NLTK commands¶

Grammar creation from a given string:

g = nltk.CFG.fromstring("""<grammar rules here>""")

Use PCFG instead of CFG if you write a probabilistic context-free grammar.

Tokenize a sentence:

tokens = nltk.word_tokenize("<sentence>")

Create different types of parsers starting from a grammar:

chart_parser   = nltk.ChartParser(g)
viterbi_parser = nltk.ViterbiParser(g)

Parse a tokenized text:

trees = viterbi_parser.parse(tokens) #for example, for the Viterbi parser

Display the parse trees: [ If not done yet, you first need to install this library: http://tkinter.unpythonic.net/wiki/How_to_install_Tkinter ]

The command below will generate a popup window, containing the graphical representation of the tree; while the parse tree drawing window is open, your script will be paused. The script will be resumed when you close the window.

from nltk.tree import *
from nltk.draw import tree
tree.draw()

Parsing Practical Session¶

Introduction¶

Section 1: Warm-up¶

Implementation hints:¶

Section 2: Hand-written grammars¶

Section 3: Extracting a Grammar from a Corpus¶

SOLUTION:¶

Section 4 (Optional): Error Analysis, Improving the Grammar¶

Useful NLTK commands¶