Using Datasets in Azure Machine Learning
Creating New Datasets
- Dataset.Tabular.from_* for tabular datasets
- Dataset.file.from_* for file-based datasets
Tabular dataset can also be further divided either into a Direct dataset where the data is being accessed from the original location via a public URL; or stored on either the default or a custom datastore.
A Dataset object can be accessed or passed around in the current environment through its object reference, but, it can also be registered as well as accessed through the dataset name called a registered dataset.
Exploring Data in Datasets
- to_pandas_dataframe () to create an in-memory pandas DataFrame
- to_spark_dataframe () to create a lazily loaded Spark DataFrame
- to_dask_dataframe () to create lazily loaded Dask DataFrame
Lazy datasets are the ones that only loads some data to memory when explicitly needed; while non-lazy ones can simply load all the data into memory and hence are limited by the available memory.
Now, since you have successfully loaded the DataFrame, you can run your favorite pandas methods to explore the datasets.
Using External Datasets with Open Datasets
Open datasets is a service that provides access to curated datasets for transportation, health and genomics, labor and economics, population and safety, categories and common datasets, etc. that can used to boost your model's performance.
However, Azure Open Datasets allows you to access curated datasets more conveniently in the form of Azure Machine Learning datasets right from within your Azure Machine Learning workspace.
Comments
Post a Comment