Ingesting Data Into Azure (Part 2 of 2)

By Ashwin Venugopal - February 25, 2023

To read part 1, please click here

Understanding Tooling for Automated Ingestion & Transformation of Data

There are some services that can help us to automatically transform as well as move data and can also integrate easily with the pipelines and MLOps in Azure Machine Learning.

Azure Data Factory

It is an enterprise-ready solution for moving and transforming data in Azure that also allows you to connect with hundreds of different sources and create pipelines to transform the integrated data, calling multiple other services in Azure. It can help you to create pipelines, data flows, datasets, and power queries:

Pipelines- They are the main attraction of Azure Data Factory. Complex pipelines can be created by calling multiple services to pull data from a source, transform it, and store it in a sink.

Datasets- As they are used in pipelines as a source or a sink, you have to specify a connection to a particular data in a datastore that you want to read from or write to in the end before building a pipeline.

Data Flows- They permits you to do the actual processing or transformation of data within Data Factory itself, instead of calling a different service to do the heavy lifting.

Power Query- It helps you to do Data exploration with DAX inside the Data Factory, which is generally possible only with Power BI or Excel otherwise.

Azure Synapse Spark Pools

We already know that Azure Databricks and Azure Synapse allows you to run Spark jobs in Spark pool and as Apache Spark allows you to transform as well as preprocess extremely large datasets via the distributive nature of the node pool underneath, this tool can be very helpful in taking apart and filtering out the datasets even before starting the actual machine learning process.

We can also run notebooks from any one of the Azure Data factory or the integration engine in azure Synapse, thus having access to these services automatically. Besides, we can also add Synapse Spark Pool as a so-called Linked Service in the Azure Machine Learning workspace allowing us the access of both the ML compute targets and Spark pool as a target for computations via the Azure Machine Learning SDK; offering us another good option for building a clean end-to-end MLOps workflow.

Copying Data to Blob storage

We will follow the steps given below and work with the Melbourne Housing dataset, created by Anthony Pino, available at- https://www.kaggle.com/anthonypino/melbourne-housing-market, to make our file available in our mldemoblob datastore:

Firstly, download the melb_data.csv file from https://www.kaggle.com/dansbecker/melbourne-housing-snapshot, and store it in a suitable folder on your device.

Now, you have to navigate the folder and run the following command in the CLI while replacing the storage account name with your own-

az storage blob upload \

-- account-name mldemoblob8765 \

-- file ./melb_data.csv \

-- container-name mlfiles \

-- name melb_data.csv

In order to verify this, you have to install Azure Storage Explorer and then login to your Azure account in that application; after that, navigate to your storage account and open the mlfiles container. You will see your file where it should be. You can also just drag and drop your file here, creating a blob file automatically, according to your comfort.

Finally, check the application itself. For example, when you right-click on the container, you can choose Get Shared Access Signature, opening a wizard allowing you to create a SAS token directly here, instead of using the command line.

Now, our raw dataset file is available in our storage account and therefore in our ML datastore.

To read part 1, please click here

Search This Blog

Blogs by Ashwin