What are some metrics for Named Entity Recognition?#

Contributor(s): Lee Xin Jie, Senior AI Engineer (100E)

Named Entity Recognition#

Named Entity Recognition refers to the task for locating and classifying named entities in text documents.

This article assumes prior knowledge of the IOB (also known as BIO) format for entity tagging. IOB stands for inside, outside and beginning, where outside (O) indicates a token not belonging to an entity, beginning (B) indicates a token is the start of an entity and inside (I) indicates a token is inside an entity. You may refer to this article for more information on the IOB format.

NER can be thought of as a token level classification problem. Hence, NER models are often tuned using metrics such as F1, precision, recall at a token level during model training. When it comes to evaluating NER models on their downstream tasks’ performance, it may be more beneficial to aggregate the individual tokens into full named-entities, and evaluate NER models at a full named-entity level against golden standard annotations. This also makes the evaluation more relatable for the client.

There are multiple evaluation schemes for evaluating NER models at a full named-entity level. This article will focus on the SemEval (International Workshop on Semantic Evaluation) scheme, which is one of the most popular schemes. The SemEval scheme is built off the Message Understanding Conferene (MUC) scheme.

Before we dive deeper into the SemEval scheme, ley us take a brief look into the MUC scheme. The MUC scheme introduces 5 categories to reflect the correctness of the full named-entities, and the classes are as follows. For a further read up on the MUC scheme, refer to this link

Evaluation Scheme	Explanation
Correct (COR)	The output of a system and the golden annotation are the same
Incorrect (INC)	The output of a system and the golden annotation don’t match
Partial (PAR)	System and the golden annotation are somewhat “similar” but not the same
Missing (MIS)	A golden annotation is not captured by a system
Spurius (SPU)	System produces a response which doesn’t exist in the golden annotation
Source

In the SemEval scheme, there are 4 options of measuring the overall performance. They differ in terms of what is considered to be correct, incorrect, partial, missed and spurious predictions. Specifically, these 4 options indicate the correctness of the full named-entities with respect to their boundaries and entity classes. All 4 options can be applied to the computation of F1, precision and recall. A detailed explanation can be found in this blog.

Evaluation Scheme	Explanation
Strict	Exact boundary surface string and entity type
Exact	Exact boundary match over the surface string, regardless of the type
Partial	Partial boundary match over the surface string, regardless of the type
Type	Some overlap between the system tagged entity and the gold annotation is required
Source

To illustrate the SemEval scheme with an example, consider the same example as before.

Token	Gold Labels
The	Other
news	Org
agency	Org
is	Other
here	Other

SemEval example

Next, we compute the total number of gold annotations with the formula:

\(Possible (POS) = COR + INC + PAR + MIS = TP + FN\)

The total annotations produced by the system is:

\(ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP\)

To compute the precision and recall for Strict and Exact:

\(Precision = \frac{COR}{ACT} = \frac{TP}{TP + FP}\)

\(Recall = \frac{COR}{POS} = \frac{TP}{TP + FN}\)

To compute the precision and recall for Partial and Type:

\(Precision = \frac{COR + 0.5 \times PAR}{ACT} = \frac{TP}{TP + FP}\)

\(Recall = \frac{COR + 0.5 \times PAR}{POS} = \frac{COR}{ACT} = \frac{TP}{TP + FP}\)

To compute the F1, precision and recall for the example above:

Measure	Partial	Type	Exact	Strict
Correct	2	2	2	1
Incorrect	0	2	2	3
Partial	2	0	0	0
Missed	1	1	1	1
Spurius	1	1	1	1
Precision	0.6 (a1)	0.4 (b1)	0.4 (c1)	0.2 (d1)
Recall	0.6 (2)	0.4 (b2)	0.4 (c2)	0.2 (d2)
F1	0.6	0.4	0.4	0.2

Reference for calculation:

a1) \(\frac{2 + 0.5 \times 2}{2 + 0 + 2 + 1} = \frac{3}{5}\)

a2) \(\frac{2 + 0.5 \times 2}{2 + 0 + 2 + 1} = \frac{3}{5}\)

b1) \(\frac{2 + 0.5 \times 0}{2 + 2 + 0 + 1} = \frac{2}{5}\)

b2) \(\frac{2 + 0.5 \times 0}{2 + 2 + 0 + 1} = \frac{2}{5}\)

c1) \(\frac{2}{2 + 2 + 0 + 1} = \frac{2}{5}\)

c2) \(\frac{2}{2 + 2 + 0 + 1} = \frac{2}{5}\)

d1) \(\frac{1}{1 + 3 + 0 + 1} = \frac{1}{5}\)

d2) \(\frac{1}{1 + 3 + 0 + 1} = \frac{1}{5}\)

In general, partial will be the most lenient evaluation, while strict will be the most stringent evaluation. The scores for type and exact will usually fall in between partial and strict.

If getting both the boundaries and the classes of the full named-entities are important, then strict will be the measure that you should prioritise. If it is only critical to get the boundaries correct, then exact will be the more appropriate measure to optimise for. Likewise, if it is only critical to get the classes correct and it is acceptable to get partial boundary overlaps, then type will be the more appropriate measure. Lastly, if only partial boundary overlaps are required and it is not important to get the classes correct, then partial can be the preferred measure. Generally, it is more common for type and exact to be preferred. You should check what the client’s requirements are before making a decision as to which to prioritise.

AI Practitioner Handbook

What are some metrics for Named Entity Recognition?

Contents

What are some metrics for Named Entity Recognition?#

Named Entity Recognition#

References#