5 Real World datasets for honing your Exploratory Data Analysis skills

The best way to learn data science is by doing it


If you are just getting started in Data Science and looking for some cool datasets to play with, this might be the article for you. A lot of courses and books never really move beyond the classic titanic and the Iris datasets. Not that there is any harm in that, but there have been instances of extreme familiarity with these datasets to the extent that people also know the number of missing values or the number of string columns in them. Therefore, this article might appear as a fresh chance to learn about some great data sets to tinker with.

Palmer Archipelago penguin data

A drop-in replacement for Iris Dataset



The overused Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher. The palmer penguins datasets come as a drop-in replacement to the classic IRIS data. . The dataset consists of attributes of three penguin species — Adélie, Gentoo, and Chinstrap. It is a great intro dataset for data exploration & visualization.

The data folder contains two CSV files:

  • penguins_size.csv, which includes variables like species, body_mass, gender, island, etc.
  • penguins_lter.csv: Original combined data for three penguin species.

Link to Dataset: https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data

Starter Notebook: https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris

COVID-19 Clinical Trials dataset

Database of COVID-19 related clinical studies being conducted worldwide



ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is maintained by the National Institute of Health. The COVID-19 Clinical Trials dataset consists of clinical trials related to COVID 19 studies presented on the site.

The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number, a unique identifier of a study in the Clinical Trials repository. Additionally, a CSV file has also been provided, which might not have as much information as contained in the XML file, but does give sufficient information. The starter notebook explains how to convert XML files into a pandas dataframe

Link to Dataset: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset

Starter Notebook: EDA on COVID-19 Clinical Trials

Article: Extracting information from XML files into a Pandas dataframe

Forbes Highest-Paid Athletes 1990–2020

Who earns the most in Sports?



This dataset consists of a complete list of the world’s highest-paid athletes since Forbes’s first list in 1990. In 2002, they changed the reporting period from the full calendar year to June-to-June, and consequently, there are no records for 2001. The dataset consists of records till the year 2020.

Link to Dataset: https://www.kaggle.com/parulpandey/covid19-clinical-trials-dataset

Starter Notebook: 💰Who earned the most in Sports in 2020🏆?

IT Salary Survey for EU region(2018–2020)

Annual Anonymous IT Salary Survey for the European region



An anonymous salary survey is conducted annually since 2015 among European IT specialists with a stronger focus on Germany. This year 1238 respondents volunteered to participate in the survey. This data has been made publicly available by the authors and shared on Kaggle to reach a wider audience. The dataset contains rich information about the salary patterns among the IT professionals in the EU region and offers some great insights.

Link to Dataset: https://www.kaggle.com/parulpandey/2020-it-salary-survey-for-eu-region

Article— IT Salary Survey December 2020

U.S. International Air Traffic data(1990–2020)

Airport and airline Traffic by the US and International Carriers



This dataset comes from the U.S. International Air Passenger and Freight Statistics Report. As part of the T-100 program, USDOT receives traffic reports of US and international airlines operating to and from US airports. There are two datasets available:

  • Departures: Data on all flights between US gateways and non-US gateways, irrespective of origin and destination.
  • Passengers: Data on the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline.

Link to Dataset: https://www.kaggle.com/parulpandey/us-international-air-traffic-data


There is no better way to learn something than by doing, and the field of data science is no different. All these datasets are available on kaggle and can be analyzed in their dockerized environment. This means most of the libraries that you would require for your analysis are already installed. The start notebooks can help you to get started quickly. You can begin by exploring one of the datasets and then convert it into a blog post to share your results with the community.

Originally published here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s