Named Entity Recognition (NER) is the process of extracting relevant terms (or words) from text- sentences, paragraphs, or articles. Typically in NER, one scans through all the words in a text, tagging each with the most relevant tag from a given dictionary of tags. If a person does this manually, it is very difficult to keep track of the number of times a particular tag has been assigned or to remember the portion of the text where a particular tag appears the most number of times, let alone assess the similarity of texts based on the frequency of these tags. For this reason, NER is done on a large scale using pre-trained algorithms.
For example, consider the foloowing sentence:
The European Union, with U.S. backing, has threatened to refer Iran to the U.N. Security Council, which could impose sanctions if it finds Tehran has violated the Nuclear Non-Proliferation treaty.
When this sentence is run through an NER algorithm, the most likely output is the following:
Organization: European Union, U.N. Security Council
Geopolitical Entity: U.S., Iran,Tehran
Artifact: Nuclear Non-Proliferation treaty
This wikipedia write-up on NER sums up all the basics of NER as well as some application areas pretty well.
Some applications of NER as listed in this article are:
1. Classifying content for news providers <br>
2. Efficient Search Algorithms<br>
3. Powering Content Recommendations<br>
4. Customer support<br>
5. Research Papers. <br> <br>
There are many more applications. NER can be used very effectively wherever information has to be selectively extracted from text.
Basically, the algorithms that power an NER product are classifiers. Classifiers are the algorithms that use statistical or machine learning tools to come up with scores (usually probability) for each tag from a pre-specified pool of tags. A word is assigned a tag with the maximum score (or probability).
In this work, a huge dataset of about 50,000 sentences has been analyzed for making a Named Entity Recognition tool. This length is about 10x as much as the length of a mid-size book (ref). The final aim is to assign tags to each and every word. The dataset has been obtained from this Kaggle database. We selectively use the first column, which gives the individual words and the third column, which gives the corresponding tags for our feature engineering experiments.
Even though many deep learning models and standard libraries like Spacy are availabe, we have primarily chosen a Logistic Regression classifier because our basic intention is to study how a small change in the selection of features affects classification accuracy. All the codes and metric files can be found in this Github repository. All codes are written in Python using the Numpy module.
We are going to consider the following tags:
Tag | Description |
---|---|
art | Artifact |
eve | Event |
geo | Geographical Entity |
gpe | Geopolitical Entity |
nat | Natural Phenomenon |
org | Organization |
per | Person |
tim | Time indicator |
Model 1 is the simplest model imaginable. For any given datapoint, an individual word corresponding to that datapoint makes up the feature. In other words, the feature matrix is a single column matrix of dimension N x 1, where N is the total number of words in the entire dataset.
Feature matrix X:
$\begin{pmatrix}word \hspace{0.1cm} 1\\ word\hspace{0.1cm}2\\ word\hspace{0.1cm}3\\ ...\\ word\hspace{0.1cm}N\\\end{pmatrix}$
Label matrix Y:
$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$
Tag | Precision | Recall | F1 Score | Number of datapoints under this |
---|---|---|---|---|
b-art | 0.33333 | 0.04651 | 0.08163 | 43 |
b-eve | 0.66667 | 0.29630 | 0.41026 | 27 |
b-geo | 0.79778 | 0.76673 | 0.78195 | 3751 |
b-gpe | 0.92446 | 0.95269 | 0.93837 | 1670 |
b-nat | 1.00000 | 0.20000 | 0.33333 | 15 |
b-org | 0.72855 | 0.38951 | 0.50763 | 2136 |
b-per | 0.76718 | 0.36022 | 0.49024 | 1674 |
b-tim | 0.88656 | 0.69536 | 0.77940 | 2068 |
i-art | 0.00000 | 0.00000 | 0.00000 | 22 |
i-eve | 0.00000 | 0.00000 | 0.00000 | 25 |
i-geo | 0.66972 | 0.53441 | 0.59446 | 683 |
i-gpe | 1.00000 | 0.18750 | 0.31579 | 32 |
i-nat | 0.00000 | 0.00000 | 0.00000 | 4 |
i-org | 0.63462 | 0.01887 | 0.03665 | 1749 |
i-per | 0.76443 | 0.56616 | 0.65052 | 1708 |
i-tim | 0.49796 | 0.19396 | 0.27918 | 629 |
o | 0.94022 | 0.99789 | 0.96820 | 85194 |
Weighted average F1-score = 0.90987
Weighted average F1-score (without o-tag) = 0.60380
In model 2, model 1 feature matrix has been altered as shown. For any given datapoint, instead of considering only the given word as the feature, one word before the given word and one word after the given word have been considered. In this manner, the feature matrix is modified to account for all the words. The terms 'BOS' and 'EOS' indicating 'Beginning of Sentence' and 'End of Sentence' have been added at the beginning and end of each sentence.
Feature matrix X:
$\begin{pmatrix}
BOS \hspace{0.1cm} \hspace{0.3cm}word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\\ word\hspace{0.1cm}1\hspace{0.3cm}word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\\
word\hspace{0.1cm}2\hspace{0.3cm}word\hspace{0.1cm}3\hspace{0.3cm}word\hspace{0.1cm}4\\
...\\
word\hspace{0.1cm}N-1 \hspace{0.3cm}word\hspace{0.1cm}N\hspace{0.3cm}EOS\\\end{pmatrix}$
Label matrix Y:
$\begin{pmatrix}tag \hspace{0.1cm} 1\\ tag\hspace{0.1cm}2\\ tag\hspace{0.1cm}3\\ ...\\ tag\hspace{0.1cm}N\\\end{pmatrix}$
Tag | Precision | Recall | F1 Score | Number of datapoints under this |
---|---|---|---|---|
b-art | 0.30769 | 0.09302 | 0.14286 | 43 |
b-eve | 0.80000 | 0.29630 | 0.43243 | 27 |
b-geo | 0.86643 | 0.77493 | 0.81813 | 3750 |
b-gpe | 0.94223 | 0.94787 | 0.94504 | 1669 |
b-nat | 1.00000 | 0.33333 | 0.50000 | 15 |
b-org | 0.78608 | 0.50773 | 0.61696 | 2135 |
b-per | 0.84945 | 0.64115 | 0.73074 | 1674 |
b-tim | 0.90988 | 0.81084 | 0.85751 | 2067 |
i-art | 0.65841 | 0.73939 | 0.69655 | 683 |
i-eve | 1.00000 | 0.71875 | 0.83636 | 32 |
i-geo | 1.00000 | 0.25000 | 0.40000 | 4 |
i-gpe | 0.78629 | 0.39337 | 0.52439 | 1749 |
i-nat | 0.87068 | 0.75381 | 0.80804 | 1706 |
i-org | 0.80802 | 0.60890 | 0.69447 | 1749 |
i-per | 0.76443 | 0.56616 | 0.65052 | 629 |
i-tim | 0.80802 | 0.60890 | 0.69447 | 629 |
o | 0.96149 | 0.99603 | 0.97846 | 85134 |
Weighted Average F1 score = 0.94240
Weighted Average F1 score (without considering o-tag) = 0.75401
There is almost 0.15 improvement in F1-score over model 1 when model 2 is used.