Performing Data Analysis On a Tabular Dataset

By Ashwin Venugopal - March 02, 2023

Initial Exploration & Cleansing of the Melbourne Housing Dataset

Here, we will look at the contents of a data loaded from a data store that is registered in Azure Machine Learning and start doing some basic cleaning regarding the raw data:

Download the given packages via Python PIP either separately or using the requirements file you can find in GitHub's repository- pandas, seaborn, plotly, scikit-learn, numpy, missingno, umap-learn, and statsmodels.

You can either create a new Jupyter notebook or follow along with the one mentioned before.

Now, connect to your ML workspace via the configuration file.

After that, retrieve the data from your defined ML data store, yourname, and load the dataset into a tabular dataset object.

Since, the methods offered by tabular dataset object are not enough, we have to convert it into pandas DataFrame, providing the first look of our data.

Now, we can look at the so-called shape of the datasets, which will show us how many columns and how many rows the dataset contains- raw_df.shape.

Finally, we have to run the code in order to look at the unique values, the number of missing values, and the data type of each feature.

Now, we can start by removing some of the not so important features by sticking with our original DataFrame, called raw_df, and create a new one called df. This will also allow us to add the removed features at any time and since, every row in a DataFrame has an index, so even if we filter out the rows, we can still match the original values.

After that, we can rename some columns to increase our understanding of them.

Look for the duplicates using keep to False showing each row with a duplicate, so that we can simply remove one of them. However, we can also use the attribute called inplace to directly overwrite the current DataFrame.

Check the categorical features that seem to have missing categories and concentrate on the one with a lot of entries. This code will show the list of unique values in the column- df['CouncilArea'].unique().

Finally, we can check if we can get rid of either suburbs or postcodes by creating a DataFrame showing the number of assigned suburbs as well as postcodes. We can look for the ones that have been mapped to multiple suburbs to find the respective list.

Up till now, we have done some basic exploration and base pruning of our dataset.

Search This Blog

Blogs by Ashwin

Performing Data Analysis On a Tabular Dataset

Initial Exploration & Cleansing of the Melbourne Housing Dataset

Comments

Post a Comment

Popular posts from this blog

Deployment (Part 3)

Deployment (Part 1)

Deployment (Part 2)