Named Entity Recognition

(some slides)

Named entity recognition (NER) is an NLP task that consists in tagging groups of words that correspond to an entity, e.g. a family name, a company name, a protein, etc.
In this practical session, you will explore 3 important methodological approaches for NER (also see slide 6 of presentation):

  1. Lexicon-based NER
    • to match in text occurrences of lexical entities;
    • add some preprocessing (stemming, rewriting, synonyms) to increase matching;
    • our targeted example will here be to recognize brain regions in scientific biomedical litterature;
  2. Machine Learning (ML) NER
    • different ML models (MaxEnt, HMM, conditional random fields (CRF)), trained and evaluated on annotated corpus, are used to tag NEs;
    • the ML features are selected based on domain knowledge; this requires (costly) annotated data;
    • our example here: brain regions again;
  3. Rule (or regular expressions)-based NER
    • besides general tagging techniques that can be used for NER, hand-written regular expressions can also be used for complex but regular entities;
    • our target example will here be measures (units and numbers, ratios).

For this practical session, we will use a system called BLUIMA that was developed at EPFL's Blue Brain project, together with the LIA. BLUIMA is an integrated suite of software components for natural language processing of neuroscientific literature (neuroNLP). BLUIMA is based on the high-performance Apache UIMA framework and provides UIMA components wrapping state-of-the-art NLP tools so they can be used interchangeably in processing pipelines. BLUIMA also includes original models and tools specific to neuroscience and provides corpus readers for neuroscientific corpora (more on BLUIMA).

Not used in this practical session but also worth mentionning is Sherlok, a RESTful annotation server based on UIMA, including Bluima... ...and more!

0) Installing BLUIMA

For this practical session, you need to install some software on you computer.

  • Dependencies
  • Download an unzip the release (331 Mb).
  • cd release_20130529/
  • Edit bin/run_pipeline, at the bottom, change -Xmx6G to whatever RAM you have available (e.g. -Xmx2G works)

1) Simple BLUIMA example (= test)

To test that your installation was successful, run from the command line:

  • ./bin/run_pipeline pipelines/examples/1_simple/simple.pipeline

This will run the NLP pipeline as defined in the above simple.pipeline file. A pipeline consists of different components and will be executed in the orders as they are defined in simple.pipeline. Usually, a pipeline starts with a collection reader (that provides documents to work on) and severals subsequent annotators (that incrementally add meta-information in forms of annotations to the document). Open the pipelines/examples/1_simple/simple.pipeline file and see what components are defined for this pipeline. You can see a description of each components in the file release_20130529/DOCUMENTATION.html. See also slide 3 of presentation.

2) More evolved BLUIMA example: named entity recognition

The script below performs a more complex pipeline. Again, analyse what components are defined and what they do.

  • ./bin/run_pipeline pipelines/examples/2_complete/complete.pipeline
  • To see its results, we need one more tool: Brat.

3) View extracted documents with Brat

Brat is a web-based tool to visual the output documents processed with a BLUIMA pipeline. Let's launch it and see what information was extracted by the above two pipelines:

4) Brain Region NER, lexicon-based

The above NER uses a list (lexica) of brain regions to annotate text. In the pipeline below, we perform an evaluation of this NER against a corpus of annotated brain regions. See also slide 5 of presentation.

In the annotated documents, the Gold annotations (tags) correspond to the true brain regions (= referential). These have been manually created by domain experts (neuroscientists). The red BrainRegion annotations are generated by the lexicon-based NER.

5) Brain Region NER, machine learning-based

Like above, we evaluate a brain region NER against the annotated corpus. This time, the model is machine-learning based (see model features and also slide 9 of presentation.)

6) Brain Region NER, rule-based with Apache Ruta

Now is time for you to create a rule-basd NER. For this, we will use Ruta (site, documentation), a workbench to write NLP rules and test them.

  • Download Eclipse (or use your own, if recent enough)
  • Install RUTA plugin
    • Help > Install New Software...
      • Work with: http://www.apache.org/dist/uima/eclipse-update-site/
      • Add...
        • Name: RUTA, OK
      • Select Apache UIMA Eclipse tooling and runtime support and Apache UIMA Ruta
        • Next, Next, Finish
  • Open Ruta
    • Window > Open Perspective > Other...
      • Select: Uima Ruta
  • Install project

    • File > Import
      • General > Existing Project into Workspace, Next
      • Select archive file
      • Browse... to project from above
        • Finish
    • The project should appear in your Script Browser on the left
  • Now let's apply the rules on some text.

    • Open script script/ch/epfl/bbp/uima/types/Main.ruta. This is where rules are defined. Check Ruta documentation to understand the syntax.
    • Right click, Debug As, UIMA Ruta
    • Open Console view to see progress, wait until it's <terminated>
  • View results
    • Open output/scratch_bre.txt.xmi
      • Select Type System MainTypeSystem.xml
    • Click on Eclipse View Annotation Browser View (tabed window on the right)
      • Select first 5 checkboxes, to reveal the annotations we added.
  • Inspect results
    • This allows to see which rules have been applied to which sentence.
    • Window > Open Perspective > Ruta Explain
    • See what rules have been applied
      • Click on Eclipse View Applied Rules
      • Select rule [3/6] ("(" W{REGEXP("[A-Z]{2,5}") (third one)
      • Click on (STN) on the Eclipse View Matched Rules (window at the bottom) to see which text has been found
  • Your turn now
    • Uncomment some of the lines in Main.ruta, and see their effect
    • View results of output/rnd_500.txt.xmi and see if/how you can improve them
    • Again: here is Ruta's documentation here