Ingesting Data & Managing Datasets (Part 1 of 2)
To read part 2, please click here
Choosing Data Storage Solutions for Azure Machine Learning
Organizing Data in Azure Machine Learning
In Azure Machine Learning, data is managed as datasets and data storage as datastores.
Datastore is an abstraction of a physical data storage system that is used to link the existing storage system to an Azure Machine Learning workspace. You will have to provide the connection as well as authentication details to connect the existing storage to the workspace by creating a datastore after which the data storage can be easily accessed by the users via the datastore object. This makes it easy to provide access to data storage to your developers, data engineers, and scientists who are collaborating in an Azure machine Learning workspace. At present, the following services can be connected as datastores to a workspace:
- Azure Blob containers
- Azure file share
- Azure Data Lake
- Azure Data Lake Gen2
- Azure SQL Database
- Azure Database for PostgreSQL
- Databricks File System
- Azure Database for MySQL
Dataset is an abstraction of data in general and Azure Machine Learning supports two types of data formats-
Tabular Datasets, that are used to define tabular data like from comma or delimiter-separated files, or from paraquet and JSON files, or from SQL queries. They can also be used and defined directly from their publicly available URLs (which is called Direct Dataset), just like the fetching of data via URLs similar to the other popular libraries such as pandas and requests.
File Datasets, that are used to define any binary data from files and folders, like images, audio, and video data.
Both of them can be easily registered in your workspace (also called as registered datasets) and are available in your Azure ML Studio under Datasets.
Understanding the Default Storage Accounts of Azure Machine Learning
The Default datastore in your Azure workspace can be accessed as well as used similar to the custom datastores by creating a datastore reference, and after successfully accessing the datastore and connecting the custom datastores, we have to ensure the efficient data storage for the various ML use cases.
To read part 2, please click here
Comments
Post a Comment