Getting Datasets for Data Analysis tasks - Useful sites for finding datasets
Open data fuels innovation. It enables people to focus more on research than on data collection, which is both time-consuming and expensive.
The video above is a hands on version of the article. You should read the article first and follow the video for a hands on part.
In the series — Getting Datasets for Data Analysis tasks, we are looking at ways to access datasets from the internet. In the first part, we learnt to streamline Google search to find only specific files on the web. In this part, let’s look at some of the sites, which host free and openly available datasets which can be used for the data analysis tasks. Some of the sources are pretty well known in the data science community like the UCI Machine Learning Repository, Kaggle datasets and Data.gov. Hence I shall not touch upon them in this article.
Is the data FAIR?
Making data available publicly is vital for the benefit of the Making data available publicly is vital for the benefit of the research community and society as a whole. However, the shared data should follow some essential guidelines so that it can be put to maximum use. In “The FAIR Guiding Principles for Scientific Data Management and Stewardship,” Wilkinson et al., lay down the principles for data management and data sharing. FAIR is an acronym that stands for data that is — Findable, Accessible, Interoperable and Reusable.
Let’s now look at some of the useful sites for finding open and publicly available datasets, quickly and without much hassle.
1. Google Dataset Search
Google Dataset Search is a search engine dedicated to finding datasets. It is a search engine over metadata from data providers. This implies that it indexes over the descriptions of a dataset instead of its content. So if a dataset is available publicly, there is a good chance, that it will pop up in the Google dataset search. At the time of the launch, Dataset Search had almost 25 million different datasets from across the globe. Google dataset search relies on keyword search and like regular Google search offers a neat autocomplete option when looking for datasets on this site.
If you wish to make your own datasets discoverable in Google Dataset search, make sure you use an open standard (schema.org) to describe the properties of your dataset on your own web page.
So, if you have a dataset on your site and you describe it using schema.org, an open standard, others can find it in Dataset Search.
🔗 Link to the site: https://datasetsearch.research.google.com/
OpenML is an open data science platform meant to democratize machine learning research. It provides a large amount of data from a variety of domains ranging from healthcare to education to climate change.
Every dataset on this site has a dedicated webpage, and the data can be downloaded in multiple formats like CSV, JSON, XML etc. OpenML can also be used to build machine learning models, and those models can then be uploaded online so that others could use them.
OpenML is essentially designed for collaborative data science where people can share their code and results.
🔗 Link to the datasets: https://www.openml.org/search?type=data
FiveThirtyEight is a site that hosts interactive articles. It presents some compelling analytical stories backed by interesting and curated datasets. These datasets have been made openly available for the public via their Github repository. Anybody can use these datasets and perform analysis of their own on topics ranging from politics to sports.
Some of the interesting datasets on their site include the following:
- Airline Safety dataset — the data behind the story Should Travellers Avoid Flying Airlines That Have Had Crashes in the Past?
- Avengers-the data behind the story Joining The Avengers Is As Deadly As Jumping Off A Four-Story Building.
🔗 Link to the datasets: https://github.com/fivethirtyeight/data
4. Awesome-Public-Datasets on Github
The `awesome public datasets` is a GitHub repository containing some high-quality public datasets which have been nicely separated by industry. The Repository mentions that —
This is a list of topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free; however, some are not.
Here is a quick look into some of the categories for which datasets are available in the Repository:
🔗 Link to the datasets: https://github.com/awesomedata/awesome-public-datasets
5. BuzzFeed News
BuzzFeed News is an American news website published by BuzzFeed, an American Internet media, news and entertainment company. BuzzFeed News features stories, and it has open-sourced the data, analysis, libraries, tools, and guides from those stories on Github.
You can find some interesting datasets, for instance:
- College Tuition and Minimum Wage Analysis
- Government Surveillance Planes Analysis
🔗 Link to the datasets: https://github.com/BuzzFeedNews/everything
These were some of the data aggregator sites which host open datasets. This isn’t in any way an exhaustive list, but these are some of my favourites. If you are looking to work on some machine learning projects, I’m hopeful these sites will prove to be pretty beneficial.