First practical session: Demos

Description

The goal of this practical session is to introduce Natural Language Processing through a few demonstrators:

  1. Automated translation
  2. Language identification
  3. and a bit of "food for thought", illustrating the positioning of the course (w.r.t. ML).

This practical session is meant to be standalone: you can simply follow it step by step, on your own. Some hints (green) and then keypoints (orange) are then provided. Don't miss them! (later part, about part-of-speech tagging, will be deeply revisited later in the semester anyway; the purpose of that section here today is simply to provide a first illustration (not detailed) about one typical NLP task).
We also recommend you to take personal notes about this practical session, for instance in your favorite editor.


Automated translation

Task specification

Machine translation aims at automatically translating text in source language to its equivalent in target language.
It is certainly one of – if not “the” – the foundation tasks of NLP, dating back to the early (19)50s.
It’s also one of the most complicated NLP tasks and still remain the “holy grail” of NLP.

Inputs:

Output:

As any input-to-output association tasks, machine translation can be modeled as an optimization task:


o* = argmaxo  ℒs, t(o, i)

where here:

Evaluation metrics: (for information only; more details next week)

Hands on

As Natural Langage Translation demonstrator, we propose several of those:

The goal of this exercice is to have an engineer look at the automated translation task: evaluate several tools with respect to some objectives. So think about some objective (something between "provide a rough idea about the content of a web page" and "automatic translation of the whole litterature") and how this objective could be evaluate. Then relate to the evaluation at different NLP levels. We recommend you to use at least two different machine translation systems.

  1. Start first with a few personnal tests in some of the languages you know.

    Try to find examples where the system performs well, and then some more problematic examples. Try more specifically to find examples with difficulties in each of the language levels: lexical, syntactic, semantic and pragmatic.

    Suggestions:

    This course is cool.

    (from English to another language you know well enough),

    Il mange un avocat.

    (French to English. Ask for an explanation about French if you do not understand).

    Could you find a way to output the correct translation of "avocat" in this case, by adding more context ? What if you use its plural form ?

    All systems have bias which comes either from the corpus or the model. Is the bias corresponding to our task ? For example, in Wikipedia, the word "pilot" appears more often in a context of "men" than "women" or inversely for the word "dancer".

    See also examples below.

  2. (A wrongly claimed historical example ; see here for the exaplanation [pdf] by John Hutchin (1995)) Translate the sentence [Mark 14:38 ; Matthew 26:41]

    The spirit is willing but the flesh is weak.

    from English to another language and then translate the result back in English.

    What do you get?

    By the way, Google used recently this kind of disambiguation problems as a data augmentation technique for Question-Answering System (ICLR2018).

  3. Consider now the sentence

    Time flies like an arrow

    1. Why/how could it be a problem?
    2. Try with the same expression in another langue you know (and let the system translate it in English).
    3. Translate the obtained result back in English.
  4. Do the same with some idiomatic expressions:

    it's raining cats and dogs

    1. Are the translations correct?
    2. look at the differences (idiomatic aspect)
    3. Translate back in the original language to study the symetry.
    4. Other examples (from French):

      il offre un cadeau à son fils.
      il offre un cadeau à la mode.
      il offre un cadeau à la dernière minute.

      la mère a élevé ses fils.
      l'araignée a tissé ses fils.
      on a vu ses fils.

  5. Try to evaluate, for a given task, what are the performances of this software: what would be the different aspects to evaluate (with respect to the task)? How to evaluate the quality of the translation automatically ?
    Evaluate the robustness/performances at: the word level (lexical), the sentence level (syntactic) and the meaning level (semantic/pragmatic).
    For example, you can look at (and find examples for):

Hints

Regarding the first example (here translated to French):

This course is cool
-> Ce cours est cool
   (Note: some systems, especially oldest ones, propose: "Ce cours est frais")
  

Have all the words been recognized and translated into French?

->  yes

Is the structure of the French sentence correct (according to French syntax)?

->  yes

Is the original English sentence correct?

->  yes

Thus, where does the problem (poor/incorrect translation) come from?

->  the word cool has several meanings.

The translation software has to make a choice between two meanings: "cold" or "trendy". The difficult part is thus at the lexical level (which French word to choose). Its solution, however, is at the pragmatic level.

The problem is not related to the syntactic level since the Part-of-Speech role is the same in both ("cold" or "trendy") situations: both (French) translations are adjectives.

However, with the sentence: "This can is empty", "can" could either be (in principle) a noun or a verb. In this case, a syntactic choice has to be made (i.e. at the level of the grammar), which here clearly indicates its a noun. Such a case is easier to cope with because of the grammatical context ("This" for instance in this case).

Going back to the pragmatical level: why isn't it only semantic (i.e. at the level of the meaning of the sentence without context)?
Without context it could be possible, although not very common, to imagine a situation where the course is "cold", for instance in winter in a cold room.
Thus the meaning of "cool" is really related to the context of the sentence, which deals with pragmatics.

Staying at the semantic level only, the most obvious local choice for "cool" may here be "cold" since the word it is related to ("course") is not animated and "cool" is (maybe) most often used in the meaning "trendy" in relation with persons...
...unless the "trendy" meaning of "cool" has been introduced in a broader coverage, as "a nice thing" in general.
(This illustrates the general difficulty of semantics: the "knowledge representation bottleneck": where to stop the knowledge representation?!)

Note: "Rules" expressed in this course are never absolute rules! In the NLP domaine, general rules very often need to be adapted to both the application domain and the targeted objectives. It is thus very crucial to clearly identify them.

In the above example, a context can be created such that the above "semantic rule" becomes wrong (due to pragmatics): Ex: "This winter is cold and this course is cool..."

Let us now consider the second example (which is in French):

 Il mange un avocat 
->  He eats a lawyer
or  He's eating a lawyer
or  It eats a lawyer

...the translation of which seems wrong (expecting "He is eating an avocado")...
...unless we were taling about some animal ("It") which can eat men (e.g. a lion).
Here again, pragmatics seem to play a role. In most cases however, semantics would be enough (for a well identified application context).

The problem comes here again from the lexical level, since it's about the meaning of a single word ("avocat"). The solution to this problem is (most often, see the former remark) at the semantic level (generaly speaking fruits ("avocado") are eaten, not men ("lawyer")).

This semantic solution could be found using the word, in this case a verb, "mange" (eat), which "avocat" is related to.

For this example:

The spirit is willing but the flesh is weak
-> L'esprit est prêt mais la chair est faible
   (Note: the word "willing" is often translated in "prêt", "disposé", "prompt" or "fort".)
-> The spirit is willing but the flesh is weak

With proposed translation systems, even if the French translation is not entirely correct, they are able to translate it back correctly. However, with some systems you might get "The spirit is strong but the flesh is weak"

There is only a minor difference between the first and the last sentence. The translation is really good here.

This is not always the case: it depends on the investment spend on developing exceptions (in the broader sense, here: idiomatic expressions).

For the next example:

Time flies like an arrow
-> Le temps file comme une flèche
or Le temps vole comme une flèche
-> Time slips by like an arrow

The problem (as presented in the lectures) comes from the fact that this sentence can have several interpretations in English:

The results obtained here show that the automated translator knows about some of the idiomatic expressions, and for some even only in one way (from one language to another but not the way back).
This illustrates one of the limits of the translator ressources (and maybe also a development strategy: separated teams for different languages (N teams) rather than teams for pairs of languages (N2 teams)).

Regarding:

it is raining cats and dogs
-> il pleut des chats et des chiens

This is clearly a wrong translation, done word by word instead of the idiomatic expressions ("il pleut des cordes" in French). This idomatic expression seems to be missing.

Actually, an automated translator can produce the translation at the different levels, depending on the ambiguities:

[pyramide des
niveaux de traduction]

For idiomatic expressions, it should not "go to high", i.e. should not try to get some meaning but translate them directly at the lexical level (i.e. the idiomatic expression is one lexical entity as such), or maybe in some specific cases where a bit of syntax is required (verb agreements) at the syntactic level, but definitely not "higher".
For most of the ususal sentences, the syntactic level is also sufficient. However, in some difficult semantic/pragmatic cases -- as the ones we are deliberatly focusing on in this practical session -- higher level knowledge is required.

Regarding the evaluation of the translator:

The different parts to be considered are the key steps the system must go through: words, sentences, meaning; i.e. each of these aspects must be considered (in proportion to their importance for the application).

At the lexical level, the first example provides a good illustration as the word "cool" has several meanings. The system must thus choose among them (in different contexts).

At the syntactic level, we can consider a sentence like "This can is empty" as an illustrative example. After "This", there cannot be a verb. This should make the system able to choose between noun or verb translations for "can".

At the semantic level, the first example sentence "This course is cool" is a good illustration of what could be targeted.

At the pragmatic level it is really a challenge: it is in fact only possible to target this level in very specific and limitated domains.

Finaly, when you evaluate the performance of the translation software, it doesn't make sense to simply evaluate "how well you can understand the result". You should evaluate with a very specific goal in mind.

Conclusions / Key ideas:

Translation is a very hard NLP problem. The main difficulty is to capture the context (semantic) but also the relations among words, e.g. a fruit is eatable whereas a person should not be. Even state-of-the-art systems for translation (and other tasks) are far from producing broad range acceptable results.

For any NLP application:

References

[1]    Published Google system in 2017 and Published Google system in 2018

[2]    Quantitative comparison of DeepL with other state-of-the-art systems.


Language identification

The aim of the application considered here, is to automatically detect the language (or "type of language") of a text.

Some applications of such a module (in a broader sense) include:

The performances of such systems are usually around 99%.

Here area few language identifiers we propose you to compare:

Pick up at least two of them and compare.

First try with a few personal examples.

Then try the following sentence (in French but with three borrowed English words):

Le bug dans le soft a engendré un crash du système

Not so bad...

Try now with the same sentence but replacing the "full words" (i.e. meaningful words) with rubish, e.g.:

Le xwt dans le xwt a xwtyxwy un xwt du wxt

None of the words "xwt..." are french. Surprising, isn't it?

Try now without the "grammatical words":

xwt xwt xwtyxwy xwt wxt

and with the original "full words":

bug soft engendré crash système

In most cases, such an identification system combines three techniques:

  1. presence of very specific characters, like œ (for French), č in some Eastern Europe language, Chineese characters, etc.
  2. recognition of most frequent (and discriminative) words, such as for instance the (short) "grammatical words" (remember Zipf law?). This explains the former behaviour.
  3. identification of frequent n-grams of characters (i.e. frequent sequences of a few n characters; typically 1 <= n <= 6).

To show this latter aspect, type some text without any "grammatical word" and only with words looking like English, for instance:

computerish Rajmanism wisoriperaught

Try other examples (correct and looking like), in different languages.

Try your family name.

Hints

Example of n-grams of characters, for "chat":

bi-grams : ch ha at

tri-grams : cha hat

which are used to compute the probability that a sequence of letters corresponds to a word in language L:

P(chat|L) = P(c|L) * P(h|c,L) * P(a|ch,L) * P(t|cha,L)

which is approximated by (Markov chain):

P(chat|L) ~ P(c|L) * P(h|c,L) * P(a|h,L) * P(t|a,L)

i.e.:

P(chat|L) ~ P(ch|L) * P(ha|L)/P(h|L) * P(at|L)/P(a|L)

where you see the use of bi-grams and mono-grams statistics in a given language L.

If there are enough occurences of a letter-sequence that is characteristic for a language, the n-gram model is considered before the others; e.g. "bip", "pib" is typical in Romanian.

Other examples of application of language identification, where you look at domain-specific words rather than grammatical words:

Conclusions / Key ideas:

Language identification is heavily based on Zipf law

As any statistical technique, each of the above techniques works only when there is enough material:

The fundamental rules of corpus-based approaches:


Readings, illustrating the positionning of the course (wrt ML)"

Finaly, we propose you to read the three following blog articles which, we found, are a good complementary illustration of the positionning of this course w.r.t. ML courses:


Follow-up

Students able to read French could proceed here if they wish for a funny example of how ambiguous language can be. (Sorry! I'm unable to create such an example in English)


If you want to "play" with other demos, you can follow the WWW links related to this course.


F.A.Q.

For those of you who can read French, there are also a few questions with answers on the French version of the page (reproduced from email questions of former French speaking students).


Last modified: Tue Sep 13, 2022