This part was produced using Jupyter.
If you are used to it, you can download the corresponding notebook code from here. If not, no problem at all, this is not mandatory at all: simply proceed as usual in your favorite Python environment.
The purpose of this practical session is to gain more insight into using parsers and grammars and to get familiarized with the tools provided to this end by NLTK. Our goals are to:
The practical session is organized in four sections.
Before diving in, it would be useful to have a look at how NLTK handles parsing. At the very end of this page, you will find a summary of the most useful commands. For more helpful explanations and examples, feel free to consult the NLTK Book , Chapter 8 (along with the chapter extras ) , Sections 1, 2, 3, 4.1, 4.2, 6.
At the end of this document, you can find a summary of useful NLTK commands .
In general, given a raw text, would it be possible to directly use a parser, in order to generate a parse tree?
ANSWER: Strictly speaking, it's not possible to parse raw text directly. Several steps need to be taken in order to arrive to this stage. First, the text must be split into tokens; here, one needs to be careful with potentially ambiguous separators. Then, if the available grammar terminals are Part-of-Speech tags only (as opposed to the actual words), one needs to first perform PoS tagging on the text to be parsed.
In this practical session, the PoS tagging part can be omitted, since the example grammars already come with rules for the words (i.e. grammar terminals indeed correspond to actual words).
Have a look at the texts and at the grammar below.
Input texts:
I saw an elephant
I saw an elephant in my pajamas
Grammar:
S -> NP VP [1.0]
PP -> P NP [1.0]
NP -> Det N [0.4]
NP -> Det N PP [0.2]
VP -> V NP [0.5]
VP -> VP PP [0.5]
NP -> 'I' [0.4]
Det -> 'an' [0.5]
Det -> 'my' [0.5]
N -> 'elephant' [0.5]
N -> 'pajamas' [0.5]
V -> 'saw' [1.0]
P -> 'in' [1.0]
Using the grammar, manually construct the parse tree(s) for the two sentences above; apply whichever method you find suitable.
Now, it is time to verify your intuition and compare your solutions to the solutions obtained using existing NLTK parsers. Try parsing with the NLTK Chart Parser and then with the NLTK Viterbi parser.
Examine the output of the parsers. Compare the output to your intuitive parse trees. Is there any difference?
You can have a look at the general parser API here . More details about the implementations of the two parsers can be found here and here.
Using the NLTK GUI, draw the most probable parse trees found by the NLTK parsers and compare them to the trees you have found manually. How are they different? Do they differ only in the names of the nonterminals use, or is a structural difference as well? Why do you think these differences occur?
First, create the grammar in NLTK, starting from the provided rules. Next, you need to instantiate the two parsers with this particular grammar. Then, you will be ready to generate the parse trees for the two example sentences and to display them graphically through the draw() function.
If you try parsing other sentences, please do not add punctuation, because punctuation is not included in the grammar.
import nltk
from nltk.tree import *
from nltk.draw import tree
# Grammar creation
grammar = nltk.PCFG.fromstring("""
S -> NP VP [1.0]
PP -> P NP [1.0]
NP -> Det N [0.4]
NP -> Det N PP [0.2]
NP -> 'I' [0.4]
VP -> V NP [0.5]
VP -> VP PP [0.5]
Det -> 'an' [0.5]
Det -> 'my' [0.5]
N -> 'elephant' [0.5]
N -> 'pajamas' [0.5]
V -> 'saw' [1.0]
P -> 'in' [1.0]
""")
# Import example sentences to NLTK and tokenize them
str_sentence1 = "I saw an elephant"
str_sentence2 = "I saw an elephant in my pajamas"
print("Example sentences")
print(str_sentence1)
print(str_sentence2)
tokens1 = nltk.word_tokenize(str_sentence1)
tokens2 = nltk.word_tokenize(str_sentence2)
# Create the Chart and Viterbi parsers, with the input grammar
chart_parser = nltk.ChartParser(grammar)
viterbi_parser = nltk.ViterbiParser(grammar)
# Results for the Chart Parser
print("Parse trees obtained with the Chart parser")
print("Sentence 1")
for tree in chart_parser.parse(tokens1):
print(tree)
tree.draw()
print("Sentence 2")
for tree in chart_parser.parse(tokens2):
print(tree)
tree.draw()
#Results for the Viterbi Parser
print("Parse trees obtained with the Viterbi parser")
print("Sentence 1")
for tree in viterbi_parser.parse(tokens1):
print(tree)
tree.draw()
print("Sentence 2")
for tree in viterbi_parser.parse(tokens2):
print(tree)
tree.draw()
Here are the trees obtained:
Chart Parser


Viterbi Parser



Before proceeding with the parse analysis, it is important to have a clear idea of what grammars and parse trees represent. Grammars contain sets of rules, which are used to infer the structure of sentences in general. One or more parse trees can be associated to a certain sentence. The parse trees show a possible structure of that particular sentence, which was obtained by applying a sequence of the grammar rules. The root of the tree is always the sentence (S) symbol, because a valid parse needs to yield a valid sentence structure. The leafs of the trees are terminal symbols (i.e words), which are immediately linked to pre-terminal symbols (i.e. PoS tags). The nodes in between the root and the leaves represent the different grammar rules which were applied to obtain that particular parse.
We can notice that for the first sentence, the two the parsers agree on the result. This is due to the more simple structure of the sentence. For the second sentence, the Chart parser finds two parse trees, while the Viterbi parser outputs only one result. This occurs because the Viterbi parser only computes the most probable parse of the sentences, while the Chart parser is not concerned with the probabilities and it searches for all parse possibilities. In this case, we can notice that the most probable parse for the second sentence is the second parse produced by the Chart parser.
You are given the following sentences:
Alice wondered with Bob in the empty city streets
Bob offered Alice an iguana for her birthday
Alice gave an inspiring speech at the conference on education
Design a non-probabilistic context-free grammar for parsing these sentences and then manually provide the parse tree(s) for these sentences. Here is a list of lexical rules and syntactic categories that you could use to devise the grammar:
Lexical rules:
V -> 'wondered' | 'offered' | 'gave'
NP -> 'Alice' | 'Bob'
N -> 'city' | 'streets' | 'iguana' | 'birthday' | 'speech' | 'conference' | 'education'
Adj -> 'empty' | 'inspiring'
Det -> 'the' | 'an' | 'her'
Prep -> 'with' | 'in' | 'for' | 'at' | 'on'
Suggested syntactic categories (non-terminals):
S - Sentence
NP - Noun Phrase
VP - Verb Phrase
PNP - Prepositional Noun Phrase
However, if you have other ideas, feel free to come up with your own lexical and syntactical rules.
With the grammar you designed in point (1.), parse the above sentences using NLTK’s Chart parser, described in Section 1.
Display and compare the parse trees obtained in points (1.) and (2.).
# This is an example of a grammar that can be used to parse the three sentences:
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | VP NP PNP | VP PNP | V
PNP -> Prep NP | Prep N
NP -> Det N | Det N PNP | Det NP | Adj N | Adj NP | N N
V -> 'wondered' | 'offered' | 'gave'
NP -> 'Alice' | 'Bob'
N -> 'city' | 'streets' | 'iguana' | 'birthday' | 'speech' | 'conference' | 'education'
Adj -> 'empty' | 'inspiring'
Det -> 'the' | 'an' | 'her'
Prep -> 'with' | 'in' | 'for' | 'at' | 'on'
""")
# sentences
sentences=(
"Alice wondered with Bob in the empty city streets",
"Bob offered Alice an iguana for her birthday",
"Alice gave an inspiring speech at the conference on education"
)
# parser
chart_parser = nltk.ChartParser(grammar)
# results
for sentence in sentences:
print(sentence+':')
tokens = nltk.word_tokenize(sentence)
for tree in chart_parser.parse(tokens):
print(tree)
print('\n')
In the previous section, we saw that even for a very small example designing grammars manually can be time consuming and does not necessarily have the best results. In this section, we explore a different approach to creating grammars: automatic extraction from annotated texts.
For this, we first need to load an annotated text to the workspace. We will work with an existing treebank from the NLTK framework. This is a ~5% fragment of the Penn Treebank . It contains data from Wall Street Journal for 1650 sentences. Let us have a look at an examples of the type of data in the Penn Treebank, shown below:
[ The files in this hierarchy were automatically created by inserting the part of speech tags from a tagged text file (.pos file) into a parsed text file (.prd file). ]
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
Your goal is to automatically extract the CFG out of the treebank and then to evaluate and improve the extracted grammar (Section 4).
## If not done yet, download the treebank:
# nltk.download('treebank')
treebank = nltk.corpus.treebank
dataset_size = len(treebank.parsed_sents())
## here, we define the split percentage for the training set and the
## learning set, in our case ~3% and ~97%
split_size = int(dataset_size * 0.97)
learning_set = treebank.parsed_sents()[:split_size]
test_set = treebank.parsed_sents()[split_size:]
Note that test_set already contains the parses of the sentences . This is what we are going to compare against, in order to assess the quality of the extracted grammar. In addition, we need the raw (not parsed) format of the sentences in the test set. These raw sentences will be parsed with the extracted grammar and then compared against the reference, stored in test_set.
# create a set containing the raw sentences
sents = treebank.sents()
raw_test_set = [ [ w for w in sents[i] ] for i in range(split_size, dataset_size) ]
Extract the grammar out of the learning corpus . At this point, it might be useful to have another look at the commands provided in the end of the document. Here is a suggestion on how to do it. Note that we still need the entire lexicon (i.e. rules of the type PoS ->
from nltk.tree import Tree
from nltk.grammar import Nonterminal
# This is where we will store all of the productions necessary to
# construct the PCFG.
tbank_productions = []
# For all of the (parsed) sentences in the learning set, extract the
# productions (i.e. extract the rules).
for sent in learning_set:
for production in sent.productions():
tbank_productions.append(production)
# Now, we will add the lexical rules for the ENTIRE lexicon.
for word, tag in treebank.tagged_words():
# for each tagged word, we create a tree containing that
# lexical rule, in order to be able to add it to our
# production set tbank_productions.
t = Tree.fromstring("("+ tag + " " + word +")")
for production in t.productions():
tbank_productions.append(production)
# At this point, we have the syntactic rules extracted from the
# learning set and all of the lexical rules. We are ready to extract
# the PCFG.
tbank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), tbank_productions)
Try to have a look at the grammar. How many rules are there? Imagine you would have to correct it, for instance. You certainly will come to the conclusion that this is not feasible. Indeed, in such a framework (automated extraction) the grammars should not be modified by hand (maybe not even read), but only the corpus has to be changed/corrected/extended.
Evaluate the performance of the grammars on the test set (the 3% of the sentences from the treebank that you isolated in the beginning of the exercise). What proportion of sentences from the test set have been parsed as in the reference (i.e. are correctly parsed) by the extracted grammar?
We now want to study the size of the grammar with respect to the size of the learning set. Split the former learning set (i.e. the 95% of the treebank) into two parts: one kept for actual learning and another part which is not used. Do this for several percentages of the learning set (10%, 50%, 70%, 80%, 90%, 100%), several times for each ratio. H ave a look at the resulting grammars and try to understand their differences.
You will probably come to the conclusion that this is not an appropriate way to work. Indeed, in such a framework (automated extraction) the grammars should not be modified by hand. Only the corpus should be changed, corrected or extended. This is the goal of the next (and last) part.
print("Splitting the Data Set")
# The size of the learning set has to be chosen so that we avoid
# running out of memory when parsing (caused by a too large learned
# grammar). However, a too small size of the learning set, could lead
# to a grammar that is not useful (i.e. it cannot parse sentences
# other than in the learning set).
dataset_size = len(treebank.parsed_sents())
split_size = int(dataset_size*0.97)
print("Separating Learning Set")
print("Learning set size: " + str(split_size))
learning_set = treebank.parsed_sents()[:split_size]
print("Separating Test Set")
print("Test set size: " + str(dataset_size - split_size))
# This set already contains the parses of the sentences in the test
# set. It is what we are going to compare against, in order to assess
# the quality of the extracted grammar.
test_set = treebank.parsed_sents()[split_size:]
print("Set containing the raw sentences, to be parsed with the extracted grammar")
sents = treebank.sents()
raw_test_set = [ [ w for w in sents[i] ] for i in range(split_size, dataset_size) ]
print("Extract the syntactic (and part of the lexical) rules from the learning set")
tbank_productions = []
for sent in learning_set:
for production in sent.productions():
tbank_productions.append(production)
print(len(tbank_productions))
print("Extract the rest of the lexical rules from the lexicon")
for word, tag in treebank.tagged_words():
t = Tree.fromstring("("+ tag + " " + word +")")
for production in t.productions():
tbank_productions.append(production)
print(len(tbank_productions))
print("Creating grammar")
tbank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), tbank_productions)
# Uncomment to have a look at the grammar:
# print(tbank_grammar)
print("Initializing parser")
# http://www.nltk.org/_modules/nltk/parse/viterbi.html
parser = nltk.ViterbiParser(tbank_grammar)
# Test the extracted grammar with the Viterbi parser on one sentence.
# The Viterbi parser gives the most probable parse tree
# Test all sentences in the test-set and compare them to the reference parsing
parse_success = 0;
# for i in range(0, len(raw_test_set)):
for i in range(0, 3):
print("==== Parsing sentence " + str(i), flush=True) ## Note: flush argument is Python 3
test_sent = raw_test_set[i]
# This will raise an exception if the tokens in the test_sentence
# are not covered by the grammar; should not happen.
tbank_grammar.check_coverage(test_sent)
print(test_sent)
print("[" + str(i) + "] Reference parse:")
print(test_set[i])
print("[" + str(i) + "] Parse trees:")
for tree in parser.parse(test_sent):
print(tree)
print(test_sent[i] == tree)
if test_sent[i] == tree:
++parse_success
print(parse_success)
print("Successfully parsed sentences: " + str(parse_success) + " out of " + str(3))
#print("Successfully parsed sentences: " + str(parse_success) + " out of " + str(len(test_set)))
In this section, the goal is to achieve a deeper understanding of results obtained in the previous section. Take the grammar that is learned based on 100% of the learning set and a subset of sentences from the test set (~10 sentences) .
Is the reference (i.e. the parse in the corpus, for the chosen sentence in the test set) correct according to you?
If the reference is correct:
if the result of the parsing is incorrect, this either means that the correct parse cannot be constructed or that it is not the best one.
In both cases, we can try to fix it by adding the corresponding parse to the learning set.
Add the corresponding correct pars e to the learning set and learn again as long as you do not get the correct pars e as a result of the learning.
As you are doing this, is the performance improving on the rest of the test set?
If the reference is incorrect, correct it in the test set and start the evaluation again. Iterate the process and study the performance on the test set.
Grammar creation from a given string:
g = nltk.CFG.fromstring("""<grammar rules here>""")
Use PCFG instead of CFG if you write a probabilistic context-free grammar.
Tokenize a sentence:
tokens = nltk.word_tokenize("<sentence>")
Create different types of parsers starting from a grammar:
chart_parser = nltk.ChartParser(g)
viterbi_parser = nltk.ViterbiParser(g)
Parse a tokenized text:
trees = viterbi_parser.parse(tokens) #for example, for the Viterbi parser
Display the parse trees: [ If not done yet, you first need to install this library: http://tkinter.unpythonic.net/wiki/How_to_install_Tkinter ]
The command below will generate a popup window, containing the graphical representation of the tree; while the parse tree drawing window is open, your script will be paused. The script will be resumed when you close the window.
from nltk.tree import *
from nltk.draw import tree
tree.draw()