Named Entity Recognition
Named entity recognition (NER) is an NLP task that consists in tagging groups of words that correspond to an entity, e.g. a family name, a company name, a protein, etc.
In this practical session, you will explore 3 important methodological approaches for NER (also see slide 6 of presentation):
- Lexicon-based NER
- to match in text occurrences of lexical entities;
- add some preprocessing (stemming, rewriting, synonyms) to increase matching;
- our targeted example will here be to recognize brain regions in scientific biomedical litterature;
- Machine Learning (ML) NER
- different ML models (MaxEnt, HMM, conditional random fields (CRF)), trained and evaluated on annotated corpus, are used to tag NEs;
- the ML features are selected based on domain knowledge; this requires (costly) annotated data;
- our example here: brain regions again;
- Rule (or regular expressions)-based NER
- besides general tagging techniques that can be used for NER, hand-written regular expressions can also be used for complex but regular entities;
- our target example will here be measures (units and numbers, ratios).
For this practical session, we will use a system called BLUIMA that was developed at EPFL's Blue Brain project, together with the LIA. BLUIMA is an integrated suite of software components for natural language processing of neuroscientific literature (neuroNLP). BLUIMA is based on the high-performance Apache UIMA framework and provides UIMA components wrapping state-of-the-art NLP tools so they can be used interchangeably in processing pipelines. BLUIMA also includes original models and tools specific to neuroscience and provides corpus readers for neuroscientific corpora (more on BLUIMA).
Not used in this practical session but also worth mentionning is Sherlok, a RESTful annotation server based on UIMA, including Bluima... ...and more!
0) Installing BLUIMA
For this practical session, you need to install some software on you computer.
- Dependencies
- Java 6 (JRE or JDK) or higher
- Python 2.6 or higher for the visualization
- Download an unzip the release (331 Mb).
cd release_20130529/
- Edit
bin/run_pipeline
, at the bottom, change-Xmx6G
to whatever RAM you have available (e.g.-Xmx2G
works)
1) Simple BLUIMA example (= test)
To test that your installation was successful, run from the command line:
./bin/run_pipeline pipelines/examples/1_simple/simple.pipeline
This will run the NLP pipeline as defined in the above simple.pipeline
file. A pipeline consists of different components and will be executed in the orders as they are defined in simple.pipeline
. Usually, a pipeline starts with a collection reader (that provides documents to work on) and severals subsequent annotators (that incrementally add meta-information in forms of annotations to the document). Open the pipelines/examples/1_simple/simple.pipeline
file and see what components are defined for this pipeline. You can see a description of each components in the file release_20130529/DOCUMENTATION.html
. See also slide 3 of presentation.
2) More evolved BLUIMA example: named entity recognition
The script below performs a more complex pipeline. Again, analyse what components are defined and what they do.
./bin/run_pipeline pipelines/examples/2_complete/complete.pipeline
To see its results, we need one more tool: Brat.
3) View extracted documents with Brat
Brat is a web-based tool to visual the output documents processed with a BLUIMA pipeline. Let's launch it and see what information was extracted by the above two pipelines:
cd brat/
-
mkdir work
(only need to do the first time) -
python standalone.py
(ctrl-D to exit) - visit http://127.0.0.1:8001
- browse, for instance visit http://127.0.0.1:8001/index.xhtml#/complete_pipeline__0/11584811
4) Brain Region NER, lexicon-based
The above NER uses a list (lexica) of brain regions to annotate text. In the pipeline below, we perform an evaluation of this NER against a corpus of annotated brain regions. See also slide 5 of presentation.
./bin/run_pipeline pipelines/examples/7_brainregion_ner/evaluate_lexicon-based_brainregion_ner.pipeline
- visit http://127.0.0.1:8001/index.xhtml#/lex_brainregions_ner_0/10753309
In the annotated documents, the Gold
annotations (tags) correspond to the true brain regions (= referential). These have been manually created by domain experts (neuroscientists). The red BrainRegion
annotations are generated by the lexicon-based NER.
5) Brain Region NER, machine learning-based
Like above, we evaluate a brain region NER against the annotated corpus. This time, the model is machine-learning based (see model features and also slide 9 of presentation.)
./bin/run_pipeline pipelines/examples/7_brainregion_ner/evaluate_crf-based_brainregion_ner.pipeline
- visit http://127.0.0.1:8001/index.xhtml#/crf_brainregions_ner_0/10753309
6) Brain Region NER, rule-based with Apache Ruta
Now is time for you to create a rule-basd NER. For this, we will use Ruta (site, documentation), a workbench to write NLP rules and test them.
- Download Eclipse (or use your own, if recent enough)
- Install RUTA plugin
- Help > Install New Software...
- Work with:
http://www.apache.org/dist/uima/eclipse-update-site/
- Add...
- Name: RUTA, OK
- Select
Apache UIMA Eclipse tooling and runtime support
andApache UIMA Ruta
- Next, Next, Finish
- Work with:
- Help > Install New Software...
- Open Ruta
- Window > Open Perspective > Other...
- Select: Uima Ruta
- Window > Open Perspective > Other...
-
Install project
- File > Import
- General > Existing Project into Workspace, Next
- Select archive file
- Browse... to project from above
- Finish
- The project should appear in your Script Browser on the left
- File > Import
-
Now let's apply the rules on some text.
- Open script
script/ch/epfl/bbp/uima/types/Main.ruta
. This is where rules are defined. Check Ruta documentation to understand the syntax. - Right click, Debug As, UIMA Ruta
- Open Console view to see progress, wait until it's
<terminated>
- Open script
- View results
- Open
output/scratch_bre.txt.xmi
- Select Type System
MainTypeSystem.xml
- Select Type System
- Click on Eclipse View
Annotation Browser View
(tabed window on the right)- Select first 5 checkboxes, to reveal the annotations we added.
- Open
- Inspect results
- This allows to see which rules have been applied to which sentence.
- Window > Open Perspective > Ruta Explain
- See what rules have been applied
- Click on Eclipse View
Applied Rules
- Select rule
[3/6] ("(" W{REGEXP("[A-Z]{2,5}")
(third one) - Click on (STN) on the Eclipse View
Matched Rules
(window at the bottom) to see which text has been found
- Click on Eclipse View
- Your turn now
- Uncomment some of the lines in Main.ruta, and see their effect
- View results of
output/rnd_500.txt.xmi
and see if/how you can improve them - Again: here is Ruta's documentation here