Understanding End-to-End Machine Learning Process (Part 4 of 5)

By Ashwin Venugopal - February 01, 2023

To read part 1, please click here

To read part 2, please click here

To read part 3, please click here

To read part 5, please click here

Defining Labels & Engineering Features

Now, we have to create and transform features, typically referred to as feature engineering while creating labels when missing.

Labeling

It is also known as annotation and although it's the least exciting part of an ML project, it's the most important one in the whole process. Labeling data requires deep insight and understanding of the context of the dataset as well as the prediction process. Proper labels will greatly help in improving the prediction performance and also helps in studying the dataset deeply. Mislabeling might lead to label noise that can affect the performance of every downstream process in the ML pipeline and it should be avoided.

Some other techniques and tooling are also available to make the labelling process faster due to the fact that ML algorithm can be used both for the desired project and learning of the ways to label data. Such models can start proposing labels during your manual annotation of the dataset.

Feature Engineering

Now, we can transform or add new features with the help of the knowledge gathered from previous steps. Typically, we have to perform one of the following actions:

Feature creation- Create new features from a given set of features or from additional information sources.
Feature transformation- Transforms single features to make them useful and stable for the utilized ML algorithm.
Feature extraction- Create derived features from the original data.
Feature selection- Choose the most prominent and predictive features.

Training Labels

The following steps defines the process of training an ML model:

Define Your ML Task- First of all, we have to define the ML task we are facing, which is generally defined by the business decision behind your use case. You also choose among supervised as well as unsupervised learning methods along with the other categories according to the amount of labeled data.
Pick a Suitable Model- This might be logistical regression, a gradient-boosted ensemble tree, or a DNN, just to name a few popular ML model choices and this choice must depend on the training infrastructure as well as the shape and type of data.
Pick or Implement a Loss Function & an Optimizer- While experimenting your data, you might have found a strategy to test your model performance and should have picked a data split, loss function, and optimizer already. However, if its not done yet, then, you have to evaluate what you want to measure and optimize at this point.
Pick a Dataset Split- Splitting your data into different sets (like training, validation, and test sets) offers you extra insight into the performance of your training and optimization process and also allows you to avoid overfitting your model to your training data.
Train a Simple Model Using Cross-Validation- After all the preceding are made, you can train your ML model via cross-validation on a training and validation set, without leaking training data into validation. Next, you have to interpret the error metric of the validation runs.
Tune the Model- In the end, you can either tune the outcome of the model by working with the so-called hyperparameters of a model, do model stacking or the other advanced methods, or you might have to go back to the initial data and work on that before training the model again.

To read part 1, please click here

To read part 2, please click here

To read part 3, please click here

To read part 5, please click here

Search This Blog

Blogs by Ashwin