Understanding End-to-End Machine Learning Process (Part 3 of 5)

By Ashwin Venugopal - January 31, 2023

To read part 1, please click here

To read part 2, please click here

To read part 4, please click here

To read part 5, please click here

Excavating Data & Sources

When you start an ML project, you might realize the need of additional data points to increase the quality of your result. The following options will give you an overview of acquiring additional data carefully:

In-house data sources- If the project is run in or with the company, then, firstly look internally. It's advantageous in the fact that it is free of cost, often standardized, and it is easier to find a person with the knowledge of this data as well as how to obtain it. However, it's very difficult to find whatever you are looking for, as it is poorly documented with questionable quality due to bias in data.

Open data sources- You can also use freely available datasets as they are typically gigantic in size (terabytes (TB) of data), can cover different time periods, and generally well structured and documented. However, some of the data fields might be hard to understand with varying quality due to bias in the data, and generally you are required to publish your results.
Data seller (data as a service or DaaS)- Finally, you can also buy data from a data seller by either going for an existing dataset or asking for the creation of a new one. It can easily save your time, provide easy access to an individualized dataset as well as to a preprocessed data. However, this one is quite expensive, you have to perform all the steps to make this data useful, and it might also raise questions regarding privacy and ethics.

Preparing & Cleaning Data

In order to clean data and build derive d features, you have to understand your data which generally defines most of the necessary cleaning and preprocessing steps along with the available algorithms and the performance of your predictive model. To achieve this, we can go through a checklist of data exploration tasks before staring any data cleaning, preprocessing, feature engineering, or model selection. This will allow you to better understand the data as well as access knowledge regarding preprocessing tasks.

Storing & Preparing Data

As your data might come in various formats, you can simply store it in a folder on your system if you work individually, but, you might need some form of cloud storage while working with cloud infrastructure or even just a company infrastructure in general. Since the data that you work with might come from a live system and needs to be extracted from there, you can take help from extract-transform-load (ETL) tools to automate the process of bringing raw data into cloud storage.

Now you have some form of layered data in your storage, from raw to cleaned to labeled to processed datasets.

Cleaning Data

As in this step we have to look for the inconsistencies and structural errors in the data itself, the following options might help in knowing what to look out for:

Duplicates- You might find some duplicate samples due to mistakes in data copying or combination of different data sources. Although the copies can be deleted easily, you have to make sure that they aren't two different samples looking exactly same.

Irrelevant information- Generally, you will have datasets with many unnecessary different features for your project. You can simply remove the obvious ones in the beginning, while others can be deleted later on after carefully analyzing the data further.

Structural errors- These are the values present in the samples, like different entries with the same meaning (such as US and United States) or simply typos, that should be standardized or cleaned up by simply visualizing all the available values of a feature.

Anomalies (outliers)- This one refers to the unlikely values for which you have to decide whether they are actually true or errors which can only be done after analyzing the data when you know the distribution of a feature.

Missing values- As the name suggests, these are the cells in your data that are either blank or have some generic value in them. They can be easily corrected in many ways besides deleting the entire samples. You can also wait for analyzing the data further to get better insight or ideas to replace them.

Now, you can further analyze the cleaned version of our dataset.

Analyzing Data

In this step, we have to calculate the statistical properties for each feature, visualize them, find their correlated features, and measure something that is called feature importance to calculate the impact of a feature on the label, also known as the target variable.

You can also apply other techniques like dimensional reduction to map a complex high-dimensional sample to a two-dimensional or three-dimensional representation in the form of a vector which will help in finding the similarities in different samples easily.

To read part 1, please click here

To read part 2, please click here

To read part 4, please click here

To read part 5, please click here

Search This Blog

Blogs by Ashwin