Using Datasets in Azure Machine Learning

By Ashwin Venugopal - February 26, 2023

Creating New Datasets

Although there are multiple ways to create new datasets, they are mainly differentiated between tabular and file datasets having different constructors, according to the type of dataset you would like to create:

Dataset.Tabular.from_* for tabular datasets
Dataset.file.from_* for file-based datasets

Tabular dataset can also be further divided either into a Direct dataset where the data is being accessed from the original location via a public URL; or stored on either the default or a custom datastore.

A Dataset object can be accessed or passed around in the current environment through its object reference, but, it can also be registered as well as accessed through the dataset name called a registered dataset.

Exploring Data in Datasets

There are many ways to explore the registered datasets in Azure ML. In tabular ones, a dataset can be loaded and analyzed programmatically in an Azure Machine Learning workspace and after having a reference to the dataset, it can be converted into an actual in-memory pandas DataFrame or a lazy-loaded Spark or Dask DataFrame by calling any one of the following methods:

to_pandas_dataframe () to create an in-memory pandas DataFrame
to_spark_dataframe () to create a lazily loaded Spark DataFrame
to_dask_dataframe () to create lazily loaded Dask DataFrame

Lazy datasets are the ones that only loads some data to memory when explicitly needed; while non-lazy ones can simply load all the data into memory and hence are limited by the available memory.

Now, since you have successfully loaded the DataFrame, you can run your favorite pandas methods to explore the datasets.

Using External Datasets with Open Datasets

If you want to improve the prediction performance of any ML model, then, you should add additional information to your training data by joining external datasets to the training data. It is easier to join them when you work with transactional data containing dates, to create additional features for the training dataset and hence improve prediction performance. Some of the derived features for dates are weekdays, weekends, time to or since weekends, holidays, time to or since holidays, sports events, concerts, and whatnots. You can also join additional country-specific data, like- population data, economic data, sociological data, health data, labor data, and much more; which generally offers you extra insights to improve your model's performance.

Open datasets is a service that provides access to curated datasets for transportation, health and genomics, labor and economics, population and safety, categories and common datasets, etc. that can used to boost your model's performance.

However, Azure Open Datasets allows you to access curated datasets more conveniently in the form of Azure Machine Learning datasets right from within your Azure Machine Learning workspace.

Search This Blog

Blogs by Ashwin