Ingesting Data & Managing Datasets (Part 2 of 2)

By Ashwin Venugopal - February 22, 2023

To read part 1, please click here

Exploring Options for Storing Training Data in Azure

Database systems can be classified according to the type of data and data access into the following types:

Relational Database Management Systems (RDBMSs)- They are generally used to store normalized transactional data via B-tree-based ordered indices and the joining of multiple rows with multiple columns may lead to typical queries filter, group, and aggregate. Azure can support various RDBMSs, such as Azure SQL Database, as well as Azure Database for PostgreSQL and MySQL.

NoSQL- These are Key-value-based storage systems used for storing de-normalized data with hash-based or ordered indices. Typical queries can access a single record via a collection distributed according to a partition key. Azure can support various NoSQL-based services like Azure cosmos DB and Azure Table storage.

Hence, both the database technologies can be used to store data for machine learning, according to your use cases. While RDBMSs are good for storing training data for machine learning, NoSQL systems are great for storing lookup data like training labels, or ML results such as recommendations, predictions, or feature vectors.

Creating a Datastore & Ingesting Data

Creating Blob Storage & Connecting it with the Azure ML Workspace

Creating Blob Storage

Firstly, navigate a terminal of your choice then, login to Azure, and check if you are working on correct subscription.

In order to create a storage account, we have to explore the available options and required settings by running this command-

$ az storage account create -h

Now, we create our storage account and choose a globally unique name for it.

Finally, we have to create a container in our new blob storage.

Now, after completing this part, we can connect the storage to our Azure ML workspace.

Creating a Datastore in Azure ML

First of all, we have to understand all the requirements to create a datastore via this command-

az ml datastore create -h

Its output will help us to understand that the name of the resource group, name of the ML workspace, and a YAML file is required.

For YAML file, we will navigate through https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-datastore-blob, to locate the required schema of our file with some examples, in which the most secured are the ones with limited access via an SAS token.

After that we can either download the blobdatastore.yml file from the GitHub repository or create a file with the same name.

Now, we have to generate an SAS token as well as the permissions to add, create, delete, list, read, and write in the mlfiles container. Users either choose an expiration data that's far enough in the future or normally select a much shorter expiry to rotate this key accordingly.

Navigate to the directory the YAML file is in to finally create the datastore in Azure ML workspace.

These above steps ensures that we have registered a datastore connected to our blob storage using an SAS token.

To read part 1, please click here

Search This Blog

Blogs by Ashwin