Originally published here
Github contribution graph shows your repository contributions over the past year. A filled-up contribution graph is not only pleasing to the eye but points towards your hard work, too(unless if you have hacked it). The graph, though pretty, also displays considerable information regarding your performance. However, if you look closely, it is just a heatmap displaying some time series data. Therefore, as a weekend activity, I tried to replicate the graph for some basic time-series data, sharing the process with you through this article.
Dataset and some preprocessing
The dataset that I’m going to use in this article comes from the Tabular Playground Series(TPS) competitions on Kaggle. The TPS competitions are month-long competitions launched on the 1st of every month. I’ll be using the dataset from the TPS — July edition competition.
The dataset is a time series based data where the task is to predict the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.
Let’s import the basic libraries and parse the dataset in pandas.
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetimedata = pd.read_csv(‘train.csv’, parse_dates=[‘date_time’])
This is a good dataset decent enough for our purpose. Let’s get to work.
Creating a basic heat map using the Seaborn library
Seaborn is a statistical data visualization library in Python. It is based on matplotlib but has some great default themes and plotting options. Creating a heatmap technically is essentially replacing the numbers with colors. To be more precise, it means to plot the data as a color-encoded matrix. Let’s see how we can achieve this via code. but before that, we will have to convert our data into the desired format
#Importing the seaborn library along with other dependencies import seaborn as sns import matplotlib.pyplot as plt import datetime as dt from datetime import datetime # Creating new features from the data data['year'] = data.date_time.dt.year data['month'] = data.date_time.dt.month data['Weekday'] = data.date_time.dt.day_name()
Subsetting data to include only the
year 2010 and then discarding all columns except the
deg_C. We’ll then pivot the dataset to get a matrix-like structure
data_2010 = data[data['year'] == 2010] data_2010 = data_2010[['month','Weekday','deg_C']] pivoted_data = pd.pivot_table(train_2010, values='deg_C', index=['Weekday'] , columns=['month'], aggfunc=np.mean)
Since our dataset is already available in the form of a matrix, plotting a heatmap with seaborn is just a piece of cake now.
plt.figure(figsize = (16,6))
sns.heatmap(pivoted_data, linewidths=5, cmap='YlGn',linecolor='white', square=True)
The heatmap displayed the average temperature(degree Celcius) in 2010. We can clearly see that July was the hottest month of that year. To emulate Github’s contribution plot, some parameters have been used:
- pivoted_data: The dataset used
- linewidths: the width of the lines that divide each cell.
- line color: The color of the lines dividing the cell
- square: to ensure each cell is square-shaped
This was a good attempt, but there is still scope for improvement. We aren’t still near Github’s contribution plot. Let’s give it another try with another library.
Creating Calendar heatmaps using calmap
Instead of tinkering with seaborn, there is a dedicated library available in Python called calmap. It creates beautiful calendar heatmaps from time-series data on the lines of Github’s contribution plot and that too in a single line of code.
#Installing and importing the calmap library
pip install calmap#import the library
We’ll use the same dataset as used in the previous section and use the
yearplot() method to the plot.
#Setting the date_time column as the index data = data.set_index('date_time')#plotting the calender heatmap for the year 2010 plt.figure(figsize=(20,10)) calmap.yearplot(data['deg_C'], cmap='YlGn', fillcolor='lightgrey',daylabels='MTWTFSS',dayticks=[0, 2, 4, 6], linewidth=2)
Above, we have customized the
linewidth and the
fillcolor i.e., the color to use for days without data. You can set these values as per your requirements. More information can be obtained from the documentation.
It is also possible to plot all years as subplots into one figure using the
As you can see, there isn’t much data for 2011, but I’m sure you have got the idea.
Heatmaps are useful visualization tools and help convey a pattern by giving a perspective of depth using colors. It helps visualize the concentration of values between two dimensions of a matrix that is more obvious to the human eye than mere numbers. This article shows you how to brighten up and jazz up your heatmaps and have fun while creating them.