Named Entity Recognition¶

What's that?¶

Named Entity Recognition (NER) is the process of extracting relevant terms (or words) from text- sentences, paragraphs, or articles. Typically in NER, one scans through all the words in a text, tagging each with the most relevant tag from a given dictionary of tags. If a person does this manually, it is very difficult to keep track of the number of times a particular tag has been assigned or to remember the portion of the text where a particular tag appears the most number of times, let alone assess the similarity of texts based on the frequency of these tags. For this reason, NER is done on a large scale using pre-trained algorithms.

For example, consider the foloowing sentence:

The European Union, with U.S. backing, has threatened to refer Iran to the U.N. Security Council, which could impose sanctions if it finds Tehran has violated the Nuclear Non-Proliferation treaty.

When this sentence is run through an NER algorithm, the most likely output is the following:

Organization: European Union, U.N. Security Council
Geopolitical Entity: U.S., Iran,Tehran
Artifact: Nuclear Non-Proliferation treaty

This wikipedia write-up on NER sums up all the basics of NER as well as some application areas pretty well.

Where is it used?¶

Some applications of NER as listed in this article are:

1. Classifying content for news providers <br>
2. Efficient Search Algorithms<br>
3. Powering Content Recommendations<br>
4. Customer support<br>
5. Research Papers. <br> <br>

There are many more applications. NER can be used very effectively wherever information has to be selectively extracted from text.

How does it work?¶

Basically, the algorithms that power an NER product are classifiers. Classifiers are the algorithms that use statistical or machine learning tools to come up with scores (usually probability) for each tag from a pre-specified pool of tags. A word is assigned a tag with the maximum score (or probability).

About this work¶

In this work, a huge dataset of about 50,000 sentences has been analyzed for making a Named Entity Recognition tool. This length is about 10x as much as the length of a mid-size book (ref). The final aim is to assign tags to each and every word. The dataset has been obtained from this Kaggle database. We selectively use the first column, which gives the individual words and the third column, which gives the corresponding tags for our feature engineering experiments.

Even though many deep learning models and standard libraries like Spacy are availabe, we have primarily chosen a Logistic Regression classifier because our basic intention is to study how a small change in the selection of features affects classification accuracy. All the codes and metric files can be found in this Github repository. All codes are written in Python using the Numpy module.

We are going to consider the following tags:

Tag	Description
art	Artifact
eve	Event
geo	Geographical Entity
gpe	Geopolitical Entity
nat	Natural Phenomenon
org	Organization
per	Person
tim	Time indicator

Model 1¶

Data Matrices¶

Model 1 is the simplest model imaginable. For any given datapoint, an individual word corresponding to that datapoint makes up the feature. In other words, the feature matrix is a single column matrix of dimension N x 1, where N is the total number of words in the entire dataset.

Feature matrix X:

$\begin{pmatrix}word \hspace{0.1cm} 1\\ word\hspace{0.1cm}2\\ word\hspace{0.1cm}3\\ ...\\ word\hspace{0.1cm}N\\\end{pmatrix}$

Label matrix Y:

$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$

Results¶

Tag	Precision	Recall	F1 Score	Number of datapoints under this
b-art	0.33333	0.04651	0.08163	43
b-eve	0.66667	0.29630	0.41026	27
b-geo	0.79778	0.76673	0.78195	3751
b-gpe	0.92446	0.95269	0.93837	1670
b-nat	1.00000	0.20000	0.33333	15
b-org	0.72855	0.38951	0.50763	2136
b-per	0.76718	0.36022	0.49024	1674
b-tim	0.88656	0.69536	0.77940	2068
i-art	0.00000	0.00000	0.00000	22
i-eve	0.00000	0.00000	0.00000	25
i-geo	0.66972	0.53441	0.59446	683
i-gpe	1.00000	0.18750	0.31579	32
i-nat	0.00000	0.00000	0.00000	4
i-org	0.63462	0.01887	0.03665	1749
i-per	0.76443	0.56616	0.65052	1708
i-tim	0.49796	0.19396	0.27918	629
o	0.94022	0.99789	0.96820	85194

Weighted average F1-score = 0.90987
Weighted average F1-score (without o-tag) = 0.60380

Model 2¶

In model 2, model 1 feature matrix has been altered as shown. For any given datapoint, instead of considering only the given word as the feature, one word before the given word and one word after the given word have been considered. In this manner, the feature matrix is modified to account for all the words. The terms 'BOS' and 'EOS' indicating 'Beginning of Sentence' and 'End of Sentence' have been added at the beginning and end of each sentence.

Feature matrix X:

$\begin{pmatrix} BOS \hspace{0.1cm} \hspace{0.3cm}word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\\ word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\\ word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\hspace{0.3cm}word\hspace{0.1cm}4\\ ...\\ word\hspace{0.1cm}N-1 \hspace{0.3cm}word\hspace{0.1cm}N\hspace{0.3cm}EOS\\\end{pmatrix}$

Label matrix Y:

$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$

Results¶

Tag	Precision	Recall	F1 Score	Number of datapoints under this
b-art	0.30769	0.09302	0.14286	43
b-eve	0.80000	0.29630	0.43243	27
b-geo	0.86643	0.77493	0.81813	3750
b-gpe	0.94223	0.94787	0.94504	1669
b-nat	1.00000	0.33333	0.50000	15
b-org	0.78608	0.50773	0.61696	2135
b-per	0.84945	0.64115	0.73074	1674
b-tim	0.90988	0.81084	0.85751	2067
i-art	0.65841	0.73939	0.69655	683
i-eve	1.00000	0.71875	0.83636	32
i-geo	1.00000	0.25000	0.40000	4
i-gpe	0.78629	0.39337	0.52439	1749
i-nat	0.87068	0.75381	0.80804	1706
i-org	0.80802	0.60890	0.69447	1749
i-per	0.76443	0.56616	0.65052	629
i-tim	0.80802	0.60890	0.69447	629
o	0.96149	0.99603	0.97846	85134

Weighted Average F1 score = 0.94240
Weighted Average F1 score (without considering o-tag) = 0.75401

There is almost 0.15 improvement in F1-score over model 1 when model 2 is used.