Introduction to Natural Language Processing (CS-431)

The goal of this practical session is to introduce Natural Language Processing through a few demonstrators:

This practical session is meant to be standalone: you can simply follow it step by step, on your own. Some hints (green) and then keypoints (orange) are then provided. Don't miss them! (later part, about part-of-speech tagging, will be deeply revisited later in the semester anyway; the purpose of that section here today is simply to provide a first illustration (not detailed) about one typical NLP task).
We also recommend you to take personal notes about this practical session, for instance in your favorite editor.

Task specification

Machine translation aims at automatically translating text in source language to its equivalent in target language.
It is certainly one of – if not “the” – the foundation tasks of NLP, dating back to the early (19)50s.
It’s also one of the most complicated NLP tasks and still remain the “holy grail” of NLP.

As any input-to-output association tasks, machine translation can be modeled as an optimization task:

Hands on

The goal of this exercice is to have an engineer look at the automated translation task: evaluate several tools with respect to some objectives. So think about some objective (something between "provide a rough idea about the content of a web page" and "automatic translation of the whole litterature") and how this objective could be evaluate. Then relate to the evaluation at different NLP levels. We recommend you to use at least two different machine translation systems.

Hints

Regarding the first example (here translated to French):

This course is cool
-> Ce cours est cool
   (Note: some systems, especially oldest ones, propose: "Ce cours est frais")

Have all the words been recognized and translated into French?

-> yes

Is the structure of the French sentence correct (according to French syntax)?

-> yes

Is the original English sentence correct?

-> yes

Thus, where does the problem (poor/incorrect translation) come from?

-> the word cool has several meanings.

The translation software has to make a choice between two meanings: "cold" or "trendy". The difficult part is thus at the lexical level (which French word to choose). Its solution, however, is at the pragmatic level.

The problem is not related to the syntactic level since the Part-of-Speech role is the same in both ("cold" or "trendy") situations: both (French) translations are adjectives.

However, with the sentence: "This can is empty", "can" could either be (in principle) a noun or a verb. In this case, a syntactic choice has to be made (i.e. at the level of the grammar), which here clearly indicates its a noun. Such a case is easier to cope with because of the grammatical context ("This" for instance in this case).

Going back to the pragmatical level: why isn't it only semantic (i.e. at the level of the meaning of the sentence without context)?
Without context it could be possible, although not very common, to imagine a situation where the course is "cold", for instance in winter in a cold room.
Thus the meaning of "cool" is really related to the context of the sentence, which deals with pragmatics.

Staying at the semantic level only, the most obvious local choice for "cool" may here be "cold" since the word it is related to ("course") is not animated and "cool" is (maybe) most often used in the meaning "trendy" in relation with persons...
...unless the "trendy" meaning of "cool" has been introduced in a broader coverage, as "a nice thing" in general.
(This illustrates the general difficulty of semantics: the "knowledge representation bottleneck": where to stop the knowledge representation?!)

Note: "Rules" expressed in this course are never absolute rules! In the NLP domaine, general rules very often need to be adapted to both the application domain and the targeted objectives. It is thus very crucial to clearly identify them.

In the above example, a context can be created such that the above "semantic rule" becomes wrong (due to pragmatics): Ex: "This winter is cold and this course is cool..."

Let us now consider the second example (which is in French):

 Il mange un avocat 
->  He eats a lawyer
or  He's eating a lawyer
or  It eats a lawyer

...the translation of which seems wrong (expecting "He is eating an avocado")...
...unless we were taling about some animal ("It") which can eat men (e.g. a lion).
Here again, pragmatics seem to play a role. In most cases however, semantics would be enough (for a well identified application context).

The problem comes here again from the lexical level, since it's about the meaning of a single word ("avocat"). The solution to this problem is (most often, see the former remark) at the semantic level (generaly speaking fruits ("avocado") are eaten, not men ("lawyer")).

This semantic solution could be found using the word, in this case a verb, "mange" (eat), which "avocat" is related to.

For this example:

The spirit is willing but the flesh is weak
-> L'esprit est prêt mais la chair est faible
   (Note: the word "willing" is often translated in "prêt", "disposé", "prompt" or "fort".)
-> The spirit is willing but the flesh is weak

With proposed translation systems, even if the French translation is not entirely correct, they are able to translate it back correctly. However, with some systems you might get "The spirit is strong but the flesh is weak"

There is only a minor difference between the first and the last sentence. The translation is really good here.

This is not always the case: it depends on the investment spend on developing exceptions (in the broader sense, here: idiomatic expressions).

For the next example:

Time flies like an arrow
-> Le temps file comme une flèche
or Le temps vole comme une flèche
-> Time slips by like an arrow

The problem (as presented in the lectures) comes from the fact that this sentence can have several interpretations in English:

either it's considered as an idiomatic expressions (meaning that time is running fast)
or you can imagine a context (SF-like) in which there exist some "time flies" which do indeed like arrows (or at least a specific one). In fact there are many possible interpretations to this sentence:
1. time is running fast
2. the "time flies" (insects) really enjoy a specific arrow
3. [you have to] time the flies (insects) as [you would time] an arrow
4. [you have to] time the flies (insects) as an arrow [would do it]
5. [you have to] time the flies (insects) [that look] like an arrow

The results obtained here show that the automated translator knows about some of the idiomatic expressions, and for some even only in one way (from one language to another but not the way back).
This illustrates one of the limits of the translator ressources (and maybe also a development strategy: separated teams for different languages (N teams) rather than teams for pairs of languages (N² teams)).

Regarding:

it is raining cats and dogs
-> il pleut des chats et des chiens

This is clearly a wrong translation, done word by word instead of the idiomatic expressions ("il pleut des cordes" in French). This idomatic expression seems to be missing.

Actually, an automated translator can produce the translation at the different levels, depending on the ambiguities:

[pyramide des
niveaux de traduction]

For idiomatic expressions, it should not "go to high", i.e. should not try to get some meaning but translate them directly at the lexical level (i.e. the idiomatic expression is one lexical entity as such), or maybe in some specific cases where a bit of syntax is required (verb agreements) at the syntactic level, but definitely not "higher".
For most of the ususal sentences, the syntactic level is also sufficient. However, in some difficult semantic/pragmatic cases -- as the ones we are deliberatly focusing on in this practical session -- higher level knowledge is required.

Regarding the evaluation of the translator:

The different parts to be considered are the key steps the system must go through: words, sentences, meaning; i.e. each of these aspects must be considered (in proportion to their importance for the application).

At the lexical level, the first example provides a good illustration as the word "cool" has several meanings. The system must thus choose among them (in different contexts).

At the syntactic level, we can consider a sentence like "This can is empty" as an illustrative example. After "This", there cannot be a verb. This should make the system able to choose between noun or verb translations for "can".

At the semantic level, the first example sentence "This course is cool" is a good illustration of what could be targeted.

At the pragmatic level it is really a challenge: it is in fact only possible to target this level in very specific and limitated domains.

Finaly, when you evaluate the performance of the translation software, it doesn't make sense to simply evaluate "how well you can understand the result". You should evaluate with a very specific goal in mind.

The aim of the application considered here, is to automatically detect the language (or "type of language") of a text.

Then try the following sentence (in French but with three borrowed English words):

Try now with the same sentence but replacing the "full words" (i.e. meaningful words) with rubish, e.g.:

To show this latter aspect, type some text without any "grammatical word" and only with words looking like English, for instance:

Hints

Example of n-grams of characters, for "chat":

bi-grams : ch ha at

tri-grams : cha hat

which are used to compute the probability that a sequence of letters corresponds to a word in language L:

which is approximated by (Markov chain):

P(chat|L) ~ P(c|L) * P(h|c,L) * P(a|h,L) * P(t|a,L)

i.e.:

P(chat|L) ~ P(ch|L) * P(ha|L)/P(h|L) * P(at|L)/P(a|L)

where you see the use of bi-grams and mono-grams statistics in a given language L.

If there are enough occurences of a letter-sequence that is characteristic for a language, the n-gram model is considered before the others; e.g. "bip", "pib" is typical in Romanian.

Other examples of application of language identification, where you look at domain-specific words rather than grammatical words:

Type of document (medical, computer science,...)
Classification of e-mails (adminstrative, financial, spam, ...)

Readings, illustrating the positionning of the course (wrt ML)"

Finaly, we propose you to read the three following blog articles which, we found, are a good complementary illustration of the positionning of this course w.r.t. ML courses:

Follow-up

Students able to read French could proceed here if they wish for a funny example of how ambiguous language can be. (Sorry! I'm unable to create such an example in English)

F.A.Q.

For those of you who can read French, there are also a few questions with answers on the French version of the page (reproduced from email questions of former French speaking students).

First practical session: Demos

Description

Automated translation

Task specification

Hands on

Language identification

Readings, illustrating the positionning of the course (wrt ML)"

Follow-up

F.A.Q.