Performing Data Analysis & Visualization (Part 2 of 3)

 






To read part 1, please click here
To read part 3, please click here









Exploring & Analyzing Tabular Datasets

Tabular Datasets helps us in using the full spectrum of mathematical and statistical functions to analyze as well as transform our dataset. However, generally we don't have that kind of time or resources to randomly run every dataset through all the possible techniques in our arsenal. Hence, in order to get a good understanding of a dataset, we can start by checking at the following aspects of every feature and target vector in the dataset:
  • Data Type- Let us know if the content of the vector continuous, ordinal, nominal, or a text string, whether they are stored in the correct programmatic data type, or if it requires a data tyoe conversion.

  • Missing Data- Are there any missing criteria? How do we handle them?

  • Inconsistent Data- Are date and time stored in different ways? Are the same categories written in Different ways? Are there any different categories with the same meaning in the given context?

  • Unique Values- How many unique values exist for a categorical feature and how many are they? Should we create a subset of them?

  • Statistical Properties- What are the mean, median, and variance of a feature? Are there any outliers? What are the minimum and maximum values? What is the most common value?

  • Statistical Distribution- How are the values distributed? Is there a data skew? Would normalization or scaling be useful?

  • Correlation- How are different features correlated to each other? Are there features containing similar information that could be omitted? How much are my features correlated with the target?

Since it's important to understand the features for your modeling, we simply do that by checking the relationship between features and target variable by using some of the following ways:
  1. Regression coefficient- Used in regression.
  2. Feature importance- Used in classification.
  3. High error rates for categorical values- Used in binary classification.
These steps will allow you to understand the data and gain thorough knowledge about the much needed preprocessing tasks for your data, features, and targe variables.

After successfully uploading the data to a storage service in Azure, we can bring up a notebook environment and start exploring the data.

Handling Missing Values & Outliers

It's one of the basic things to look for the missing values in a new database to gain deeper understanding of the dataset and the actions to be taken. Now, you can check the outliers in your data, especially the ones given below:
  1. The null values (look for Null, "Null", " ", NaN, etc.)
  2. The minimum and maximum values.
  3. The most common value (MODE)
  4. The zero value
  5. Any unique values
After identifying these values, various preprocessing techniques can be used to impute missing values as well as normalize or exclude dimensions. Following are the typical options, that can be used for dealing with missing values:
  • Deletion- It can delete whole rows or columns in an dataset, that can also lead to bias or insufficient data for training.

  • New Category- This one can add a category called Missing for categorical features.

  • Column Average- It will help you to fill mean, median or mode value of the entire data column or a subset of the column based on relationships with other features.

  • Interpolation- You can fill an interpolated value according to the column's data.

  • Hot-deck imputation- Here, you can fill-in the logical previous value from the sorted records of the data column (useful in time series datasets).

Some of the options to deal with the outliers are:
  • Erroneous observations- If the value is wrong, you can drop either the full column or replace the outlier with the mean of the column.

  • Leave-as-is- means important information and model doesn't get distorted by it.

  • Cap or floor- You can cap or floor the value to a maximum deviation from the mean.

However, it is more useful to statistically analyze the column distribution and correlations in order to handle the missing values and outliers.  












To read part 1, please click here
To read part 3, please click here






Comments

Popular posts from this blog

Deployment (Part 3)

Deployment (Part 1)

Project Resourcing (Part 2)