What are Azure Open Datasets?
Azure Open Datasets are curated public datasets that you can add to scenario-specific features to machine learning solutions, for more accurate models. Open Datasets are available in the cloud, on Microsoft Azure. They’re integrated into Azure Machine Learning and readily available to Azure Databricks. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets through Azure Open Datasets.Curated and Prepared Datasets
Curated open public datasets in Azure Open Datasets are optimized for consumption in machine learning workflows. Data scientists often spend most of their time cleaning and preparing data for advanced analytics. To save you time, Open Datasets are copied to the Azure cloud, and then preprocessed. At regular intervals, data is pulled from the sources - for example, by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA). Next, the data is parsed into a structured format, and then enriched as needed, with features such as ZIP Code or the locations of the nearest weather stations. Datasets are cohosted with cloud compute in Azure, to make access and manipulation easier.Available Dataset Categories
Transportation
Taxi trip records from New York City, including yellow and green taxi data with pickup/drop-off times, locations, distances, fares, and passenger counts.Health and Genomics
COVID-19 datasets tracking cases, deaths, and recoveries worldwide, plus genomics datasets including 1000 Genomes, gnomAD, and ClinVar for genetic variation research.Labor and Economics
US labor statistics, employment data, consumer price index, and producer price index datasets for economic analysis and forecasting.Population and Safety
US population data by county and ZIP code, plus 311 safety data from major cities including Boston, Chicago, New York, San Francisco, and Seattle.Supplemental Datasets
Common datasets for machine learning including MNIST handwritten digits, Diabetes dataset, public holidays data covering 38 countries, and simulated sales data.Access Methods
With an Azure account, you can access open datasets through code or through the Azure service interface. The data is colocated with Azure cloud compute resources for use in your machine learning solutions.Python SDK
Access datasets programmatically using theazureml-opendatasets Python package:
Azure Machine Learning
Open Datasets are available through the Azure Machine Learning UI and SDK. You can create datasets from Open Datasets and use them in your ML experiments.Azure Databricks
Access datasets in Azure Databricks notebooks using the Python SDK or direct access to Azure Blob Storage.Direct Access
You don’t need an Azure account to access Open Datasets - you can access them from any Python environment with or without Spark.Benefits
No Extra Storage Cost
Datasets are lazily evaluated and data remains in its existing location
Improved Performance
ML workflow performance speeds are optimized through data cohosting
Data Integrity
No risk of unintentionally changing your original data sources
Regular Updates
Datasets are updated at regular intervals from trusted sources
Request or Contribute Datasets
If you can’t find the data you want, you can:- Request a dataset: Email aod@microsoft.com with details about the dataset you need
- Contribute a dataset: Share your public datasets by emailing aod@microsoft.com
Next Steps
Browse Catalog
Explore all available datasets in the catalog
Create Dataset
Learn how to create datasets from Open Datasets
Public Holidays
View the Public Holidays dataset
Python SDK Reference
View the Python SDK documentation