Ingesting Data & Managing Datasets (Part 1 of 2)

By Ashwin Venugopal - February 21, 2023

To read part 2, please click here

Choosing Data Storage Solutions for Azure Machine Learning

If you want to start training an ML model on remote compute targets like VM, then, you have to ensure that all the executables can access the training data efficiently; all the more for the people who want to access the data in parallel for experimentation, labeling, and training from multiple environments as well as multiple machines. In order to achieve this, we have to manage the data efficiently for different types usages in Azure.

Organizing Data in Azure Machine Learning

In Azure Machine Learning, data is managed as datasets and data storage as datastores.

Datastore is an abstraction of a physical data storage system that is used to link the existing storage system to an Azure Machine Learning workspace. You will have to provide the connection as well as authentication details to connect the existing storage to the workspace by creating a datastore after which the data storage can be easily accessed by the users via the datastore object. This makes it easy to provide access to data storage to your developers, data engineers, and scientists who are collaborating in an Azure machine Learning workspace. At present, the following services can be connected as datastores to a workspace:

Azure Blob containers
Azure file share
Azure Data Lake
Azure Data Lake Gen2
Azure SQL Database
Azure Database for PostgreSQL
Databricks File System
Azure Database for MySQL

Dataset is an abstraction of data in general and Azure Machine Learning supports two types of data formats-

Tabular Datasets, that are used to define tabular data like from comma or delimiter-separated files, or from paraquet and JSON files, or from SQL queries. They can also be used and defined directly from their publicly available URLs (which is called Direct Dataset), just like the fetching of data via URLs similar to the other popular libraries such as pandas and requests.

File Datasets, that are used to define any binary data from files and folders, like images, audio, and video data.

Both of them can be easily registered in your workspace (also called as registered datasets) and are available in your Azure ML Studio under Datasets.

Understanding the Default Storage Accounts of Azure Machine Learning

Default datastore is an Azure Blob storage account, that is created automatically with Azure Machine Learning when you set up the initial workspace, and it is internally used to store all the snapshots, logs, figures, models, and much more while executing the experiment runs.

The Default datastore in your Azure workspace can be accessed as well as used similar to the custom datastores by creating a datastore reference, and after successfully accessing the datastore and connecting the custom datastores, we have to ensure the efficient data storage for the various ML use cases.

To read part 2, please click here

Search This Blog

Blogs by Ashwin