Custom Named Entity Recognition (Part 3)

 




Label Your Data

Properly labeling or tagging your data is a crucial step in developing a custom entity extraction model. Labels serve to indicate instances of particular entities within the text that are used for training the model. Three things to focus on are:

  • Consistency - Label your data the same way across all files for training. Consistency allows your model to learn without any conflicting inputs.

  • Precision - Label your entities consistently, without unnecessary extra words. Precision ensures only the correct data is included in your extracted entity.

  • Completeness - Label your data completely, and don't miss any entities. Completeness helps your model always recognize the entities present.

How To Label Your Data ?

Language Studio offers a straightforward approach for annotating your data. It enables you to view the file, mark the start and end of your entities, and specify their type.

Each label you identify is saved in a file located in your storage account alongside your dataset, formatted as an auto-generated JSON file. This file is then utilized by the model to learn how to identify custom entities. You can input this file when setting up your project (for instance, if you are bringing in labels from a different project), but it must conform to the accepted custom NER data formats.

Train and Evaluate Your Model

Training and assessing your model is a cyclical process that involves incorporating more data and labels into your training dataset to enhance the model's accuracy. To identify which data and labels require refinement, Language Studio offers scoring on the View model details page located in the left-hand pane.

Individual entities and your overall model score are broken down into three metrics to explain how they're performing and where they need to improve:

  • Precision- The ratio of successful entity recognitions to all attempted recognitions. A high score means that as long as the entity is recognized, it's labeled correctly.

  • Recall- The ratio of successful entity recognitions to the actual number of entities in the document. A high score means it finds the entity or entities well, regardless of if it assigns them the right label.

  • F1 Score- Combination of precision and recall providing a single scoring metric.

Scores are available both per entity and for the model as a whole. You may find an entity scores well, but the whole model doesn't.

How To Interpret Metrics ?

Our model should ideally perform well in both precision and recall, which indicates that entity recognition is effective. A low score for both metrics indicates that the model is having trouble identifying items in the document and, when it does, it is not confidently assigning the right label.

If precision is low but recall is high, it means that the model recognizes the entity well but doesn't label it as the correct entity type.

If precision is high but recall is low, it means that the model doesn't always recognize the entity, but when the model extracts the entity, the correct label is applied.

Confusion Matrix

The Confusion matrix is located on a separate tab at the top of the same View model information page. This view gives a comprehensive picture of the model and highlights its shortcomings by providing a visual table of all the entities and their respective performances.

The confusion matrix allows you to visually identify where to add data to improve your model's performance.

Conclusion

We have successfully learnt about labelling data as well as training and evaluating our model.








Comments

Popular posts from this blog

Information Protection Scanner: Resolve Issues with Information Protection Scanner Deployment

Azure AI Search plugin in Microsoft Security Copilot (Preview)

How AMI Store & Restore Works?