Understanding End-to-End Machine Learning Process (Part 3 of 5)
Excavating Data & Sources
- In-house data sources- If the project is run in or with the company, then, firstly look internally. It's advantageous in the fact that it is free of cost, often standardized, and it is easier to find a person with the knowledge of this data as well as how to obtain it. However, it's very difficult to find whatever you are looking for, as it is poorly documented with questionable quality due to bias in data.
- Open data sources- You can also use freely available datasets as they are typically gigantic in size (terabytes (TB) of data), can cover different time periods, and generally well structured and documented. However, some of the data fields might be hard to understand with varying quality due to bias in the data, and generally you are required to publish your results.
- Data seller (data as a service or DaaS)- Finally, you can also buy data from a data seller by either going for an existing dataset or asking for the creation of a new one. It can easily save your time, provide easy access to an individualized dataset as well as to a preprocessed data. However, this one is quite expensive, you have to perform all the steps to make this data useful, and it might also raise questions regarding privacy and ethics.
Preparing & Cleaning Data
Storing & Preparing Data
Now you have some form of layered data in your storage, from raw to cleaned to labeled to processed datasets.
Cleaning Data
As in this step we have to look for the inconsistencies and structural errors in the data itself, the following options might help in knowing what to look out for:
- Duplicates- You might find some duplicate samples due to mistakes in data copying or combination of different data sources. Although the copies can be deleted easily, you have to make sure that they aren't two different samples looking exactly same.
- Irrelevant information- Generally, you will have datasets with many unnecessary different features for your project. You can simply remove the obvious ones in the beginning, while others can be deleted later on after carefully analyzing the data further.
- Structural errors- These are the values present in the samples, like different entries with the same meaning (such as US and United States) or simply typos, that should be standardized or cleaned up by simply visualizing all the available values of a feature.
- Anomalies (outliers)- This one refers to the unlikely values for which you have to decide whether they are actually true or errors which can only be done after analyzing the data when you know the distribution of a feature.
- Missing values- As the name suggests, these are the cells in your data that are either blank or have some generic value in them. They can be easily corrected in many ways besides deleting the entire samples. You can also wait for analyzing the data further to get better insight or ideas to replace them.
Now, you can further analyze the cleaned version of our dataset.
Analyzing Data
In this step, we have to calculate the statistical properties for each feature, visualize them, find their correlated features, and measure something that is called feature importance to calculate the impact of a feature on the label, also known as the target variable.
You can also apply other techniques like dimensional reduction to map a complex high-dimensional sample to a two-dimensional or three-dimensional representation in the form of a vector which will help in finding the similarities in different samples easily.
Comments
Post a Comment