Named Entity Recognition

Named Entity Recognition (NER) is the process of extracting relevant terms (or words) from text- sentences, paragraphs, or articles. Typically in NER, one scans through all the words in a text, tagging each with the most relevant tag from a given dictionary of tags. If a person does this manually, it is very difficult to keep track of the number of times a particular tag has been assigned or to remember the portion of the text where a particular tag appears the most number of times, let alone assess the similarity of texts based on the frequency of these tags. For this reason, NER is done on a large scale using pre-trained algorithms.

   For example, consider the foloowing sentence:

The European Union, with U.S. backing, has threatened to refer Iran to the U.N. Security Council, which could impose sanctions if it finds Tehran has violated the Nuclear Non-Proliferation treaty.

When this sentence is run through an NER algorithm, the most likely output is the following:

Organization: European Union, U.N. Security Council
Geopolitical Entity: U.S., Iran,Tehran
Artifact: Nuclear Non-Proliferation treaty

This wikipedia write-up on NER sums up all the basics of NER as well as some application areas pretty well.


Where is it used?


Some applications of NER as listed in this article are:

1. Classifying content for news providers <br>
2. Efficient Search Algorithms<br>
3. Powering Content Recommendations<br>
4. Customer support<br>
5. Research Papers. <br> <br>

There are many more applications. NER can be used very effectively wherever information has to be selectively extracted from text.


How does it work?


Basically, the algorithms that power an NER product are classifiers. Classifiers are the algorithms that use statistical or machine learning tools to come up with scores (usually probability) for each tag from a pre-specified pool of tags. A word is assigned a tag with the maximum score (or probability).


About this work


In this work, a huge dataset of about 50,000 sentences has been analyzed for making a Named Entity Recognition tool. This length is about 10x as much as the length of a mid-size book (ref). The final aim is to assign tags to each and every word. The dataset has been obtained from this Kaggle database. We selectively use the first column, which gives the individual words and the third column, which gives the corresponding tags for our feature engineering experiments.

Even though many deep learning models and standard libraries like Spacy are availabe, we have primarily chosen a Logistic Regression classifier because our basic intention is to study how a small change in the selection of features affects classification accuracy. All the codes and metric files can be found in this Github repository. All codes are written in Python using the Numpy module.

We are going to consider the following tags:

Tag Description
art Artifact
eve Event
geo Geographical Entity
gpe Geopolitical Entity
nat Natural Phenomenon
org Organization
per Person
tim Time indicator

Model 1


Data Matrices

Model 1 is the simplest model imaginable. For any given datapoint, an individual word corresponding to that datapoint makes up the feature. In other words, the feature matrix is a single column matrix of dimension N x 1, where N is the total number of words in the entire dataset.

Feature matrix X:

$\begin{pmatrix}word \hspace{0.1cm} 1\\ word\hspace{0.1cm}2\\ word\hspace{0.1cm}3\\ ...\\ word\hspace{0.1cm}N\\\end{pmatrix}$

Label matrix Y:

$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$

Results

Tag Precision Recall F1 Score Number of datapoints under this
b-art 0.33333 0.04651 0.08163 43
b-eve 0.66667 0.29630 0.41026 27
b-geo 0.79778 0.76673 0.78195 3751
b-gpe 0.92446 0.95269 0.93837 1670
b-nat 1.00000 0.20000 0.33333 15
b-org 0.72855 0.38951 0.50763 2136
b-per 0.76718 0.36022 0.49024 1674
b-tim 0.88656 0.69536 0.77940 2068
i-art 0.00000 0.00000 0.00000 22
i-eve 0.00000 0.00000 0.00000 25
i-geo 0.66972 0.53441 0.59446 683
i-gpe 1.00000 0.18750 0.31579 32
i-nat 0.00000 0.00000 0.00000 4
i-org 0.63462 0.01887 0.03665 1749
i-per 0.76443 0.56616 0.65052 1708
i-tim 0.49796 0.19396 0.27918 629
o 0.94022 0.99789 0.96820 85194

Weighted average F1-score = 0.90987
Weighted average F1-score (without o-tag) = 0.60380


Model 2


In model 2, model 1 feature matrix has been altered as shown. For any given datapoint, instead of considering only the given word as the feature, one word before the given word and one word after the given word have been considered. In this manner, the feature matrix is modified to account for all the words. The terms 'BOS' and 'EOS' indicating 'Beginning of Sentence' and 'End of Sentence' have been added at the beginning and end of each sentence.

Feature matrix X:

$\begin{pmatrix} BOS \hspace{0.1cm} \hspace{0.3cm}word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\\ word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\\ word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\hspace{0.3cm}word\hspace{0.1cm}4\\ ...\\ word\hspace{0.1cm}N-1 \hspace{0.3cm}word\hspace{0.1cm}N\hspace{0.3cm}EOS\\\end{pmatrix}$

Label matrix Y:

$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$

Results

Tag Precision Recall F1 Score Number of datapoints under this
b-art 0.30769 0.09302 0.14286 43
b-eve 0.80000 0.29630 0.43243 27
b-geo 0.86643 0.77493 0.81813 3750
b-gpe 0.94223 0.94787 0.94504 1669
b-nat 1.00000 0.33333 0.50000 15
b-org 0.78608 0.50773 0.61696 2135
b-per 0.84945 0.64115 0.73074 1674
b-tim 0.90988 0.81084 0.85751 2067
i-art 0.65841 0.73939 0.69655 683
i-eve 1.00000 0.71875 0.83636 32
i-geo 1.00000 0.25000 0.40000 4
i-gpe 0.78629 0.39337 0.52439 1749
i-nat 0.87068 0.75381 0.80804 1706
i-org 0.80802 0.60890 0.69447 1749
i-per 0.76443 0.56616 0.65052 629
i-tim 0.80802 0.60890 0.69447 629
o 0.96149 0.99603 0.97846 85134

Weighted Average F1 score = 0.94240
Weighted Average F1 score (without considering o-tag) = 0.75401


There is almost 0.15 improvement in F1-score over model 1 when model 2 is used.