Custom Named Entity Recognition (Part 3)
Label Your Data
Properly labeling or
tagging your data is a crucial step in developing a custom entity extraction
model. Labels serve to indicate instances of particular entities within the
text that are used for training the model. Three things to focus on are:
- Consistency -
Label your data the same way across all files for training. Consistency allows
your model to learn without any conflicting inputs.
- Precision -
Label your entities consistently, without unnecessary extra words. Precision
ensures only the correct data is included in your extracted entity.
- Completeness -
Label your data completely, and don't miss any entities. Completeness helps
your model always recognize the entities present.
How To Label Your Data ?
Language Studio offers a
straightforward approach for annotating your data. It enables you to view the
file, mark the start and end of your entities, and specify their type.
Each label you identify
is saved in a file located in your storage account alongside your dataset,
formatted as an auto-generated JSON file. This file is then utilized by the
model to learn how to identify custom entities. You can input this file when setting
up your project (for instance, if you are bringing in labels from a different
project), but it must conform to the accepted custom NER data formats.
Train and Evaluate Your Model
Training and assessing
your model is a cyclical process that involves incorporating more data and
labels into your training dataset to enhance the model's accuracy. To identify
which data and labels require refinement, Language Studio offers scoring on the
View model details page located in the left-hand pane.
Individual entities and
your overall model score are broken down into three metrics to explain how
they're performing and where they need to improve:
- Precision-
The ratio of successful entity recognitions to all attempted recognitions. A
high score means that as long as the entity is recognized, it's labeled
correctly.
- Recall-
The ratio of successful entity recognitions to the actual number of entities in
the document. A high score means it finds the entity or entities well,
regardless of if it assigns them the right label.
- F1 Score-
Combination of precision and recall providing a single scoring metric.
Scores are available both
per entity and for the model as a whole. You may find an entity scores well,
but the whole model doesn't.
How To Interpret Metrics
?
Our model should ideally
perform well in both precision and recall, which indicates that entity
recognition is effective. A low score for both metrics indicates that the model
is having trouble identifying items in the document and, when it does, it is not
confidently assigning the right label.
If precision is low but
recall is high, it means that the model recognizes the entity well but doesn't
label it as the correct entity type.
If precision is high but
recall is low, it means that the model doesn't always recognize the entity, but
when the model extracts the entity, the correct label is applied.
Confusion Matrix
The Confusion matrix is
located on a separate tab at the top of the same View model information page.
This view gives a comprehensive picture of the model and highlights its
shortcomings by providing a visual table of all the entities and their
respective performances.
The confusion matrix
allows you to visually identify where to add data to improve your model's
performance.
Conclusion
We have successfully
learnt about labelling data as well as training and evaluating our model.
Comments
Post a Comment