Custom Named Entity Recognition (Part 2)
Azure AI Language Project
Life Cycle
Creating an entity
extraction model typically follows a similar path to most Azure AI Language
service features:
- Define entities:
Understanding the data and entities you want to identify, and try to make them
as clear as possible. For example, defining exactly which parts of a bank
statement you want to extract.
- Tag data: Label,
or tag, your existing data, describing what text in your dataset relates to
which entity. This phase is vital to accomplish precisely and thoroughly, as
any improper or missed labels will reduce the effectiveness of the trained
model. A good variation of possible input documents is useful. For example,
label bank name, customer name, customer address, specific loan or account
terms, loan or account amount, and account number.
- Train model:
Train your model once your entities are labeled. Training teaches your model
how to recognize the entities you label.
- View model:
After your model is trained, view the results of the model. This page includes
a score of 0 to 1 that is based on the precision and recall of the data tested.
Here, you can see which entities worked well (such as customer name) and which
entities need improvement (such as account number).
- Improve model:
Improve your model by seeing which entities failed to be identified, and which
entities were incorrectly extracted. Find out what data needs to be added to
your model's training to improve performance. This page shows you how entities
failed, and which entities (such as account number) need to be differentiated
from other similar entities (such as loan amount).
- Deploy model:
Once your model performs as desired, deploy your model to make it available via
the API. For example, you can send to requests to the model when it's deployed
to extract bank statement entities.
- Extract entities:
Use your model for extracting entities.
Considerations for Data Selection
and Refining Entities
For the optimal
performance, you'll need to use both high quality data to train the model and
properly defined entity types.
High quality data will
let you spend less time tweaking and get better outcomes from your model.
- Distribution -
use the appropriate distribution of document types. A more diverse dataset to
train your model will help your model avoid learning incorrect relationships in
the data.
- Accuracy -
use data that is as close to real world data as possible. Fake data works to
start the training process, but it likely will differ from real data in ways
that can cause your model to not extract correctly.
Additionally, entities
must be well thought out and as clearly specified as possible. Avoid ambiguous
entities (such as two names next to each other on a bank statement), as it will
make the model difficult to differentiate. If it's necessary to have some
ambiguous entities, make sure your model has more examples to learn from so it
can distinguish between them.
Keeping your entities
distinct will also go a long way in aiding your model's performance. For
example, trying to extract anything like "Contact info" that could be
a phone number, social media handle, or email address would require multiple examples
to accurately teach your model. Instead, try to break them down into more
specific entities such as "Phone", "Email", and
"Social media" and let the model classify whichever sort of contact
information it finds.
Conclusion
We have successfully
learnt about Azure AI language’s project cycle as well as considerations for
data selection and refining entities.
Comments
Post a Comment