Understand How To Build Text Classification Projects
Introduction
Your workspace for
developing, honing, refining, and implementing your classification model is
custom text classification projects. Language Studio and the REST API are the
two methods you can work on your project. The lab will use Language Studio as
the GUI, but the REST API offers the same features. The processes for creating
your model are the same regardless of your preferred approach.
Azure AI Language Project
Life Cycle
- Define Labels:
Understanding the data you want to classify, identify the possible labels you
want to categorize into.
- Tag Data:
Tag, or label, your existing data, specifying the label or labels each file
falls under. Labeling data is important since it's how your model will learn
how to classify future files. Best practice is to have clear differences
between labels to avoid ambiguity, and provide good examples of each label for
the model to learn from.
- Train Model:
Train your model with the labeled data.
- View Model:
After your model is trained, view the results of the model. Your model is
scored between 0 and 1, based on the precision and recall of the data tested.
Take note of which genre didn't perform well.
- Improve Model:
Improve your model by seeing which classifications failed to evaluate to the
right label, see your label distribution, and find out what data to add to
improve performance. Try to find more examples of each label to add to your
dataset for retraining your model.
- Deploy Model:
Once your model performs as desired, deploy your model to make it available via
the API. Your model might be named "GameGenres", and once deployed
can be used to classify game summaries.
- Classify text:
Use your model for classifying text.
How To Split Datasets For
Training ?
When labeling your data,
you can specify which dataset you want each file to be:
- Training - In
order to educate your model which data should be classified to which label, the
machine learning algorithm is fed the data and labels from the training
dataset. The larger of the two datasets. roughly 80% of your labeled data, will
serve as the training dataset.
- Testing - After
your model has been trained, it may be verified using the labeled testing
dataset. In order to assess the model's performance, Azure will take the data
from the testing dataset, feed it into the model, and then compare the results
to the way you categorized the data. The outcome of that comparison determines
your model's score and gives you insight into how to enhance your forecasting
abilities.
During the Train model
step, there are two options for how to train your model.
- Automatic Split - Azure takes all of your data, splits it into the specified percentages randomly, and applies them in training the model. This option is best when you have a larger dataset, data is naturally more consistent, or the distribution of your data extensively covers your classes.
- Manual Split - Manually specify which files should be in each dataset. When you submit the training job, the Azure AI Language service will tell you the split of the dataset and the distribution. This split is best used with smaller datasets to ensure the correct distribution of classes and variation in data are present to correctly train your model.
To use the automatic
split, put all files into the training dataset when labeling your data (this
option is the default). To use the manual split, specify which files should be
in testing versus training during the labeling of your data.
Conclusion
We have successfully
learnt Azure AI language project cycle and how to split datasets for training.
Comments
Post a Comment