Learn how to cluster your data in Tableau easily
Consider a situation where you have some sales data belonging to your company. Let’s say you wanted to discover a pattern in terms of the consumers’ spending capacity. If you could uncover distinct groups or associations within the data, your company could target the different groups based on their preferences. The basic idea behind this intuition is called clustering, and tableau has an inherent feature that can automatically cluster similar data points based on certain attributes. In this article, we will explore this functionality of Tableau and see how we can apply the clustering method to some real-world data set.
What is Clustering?
Clustering, also known as cluster analysis, is an unsupervised machine learning algorithm that tends to group more similar items based on some similarity metric.
The figure below visualizes the working of the K -Means algorithm very intuitively. In K means clustering, the algorithm splits the dataset into k clusters where every cluster has a centroid, which is calculated as the mean value of all the points in that cluster. In the figure below, we start by randomly defining 4 centroid points. The K means algorithm then assigns each data point to its nearest cluster (cross). The centroid shifts to a new position as the mean value of all data points changes,. This entire process is then repeated until there is no further observable change in the centroids’ position.
Clustering in Tableau
Tableau also uses the K Means clustering algorithm under the hood. It uses the Calinski-Harabasz criterion to assess cluster quality. Here is the mathematical interpretation of the Calinski-Harabasz criterion :
Here SSB is the overall between-cluster variance, SSW the overall within-cluster variance, k is the number of clusters, and N the number of observations.
This ratio gives a value that determines the cohesiveness of the clusters. A higher value suggests that the clusters are more closely associated, having low within-cluster distance and high between-cluster distance.
Now that we have an idea as to what clustering is, it is time to look at how the same can be applied using tableau.
Using clustering to uncover patterns in the dataset
Clustering helps to uncover the patterns in the dataset. Suppose that you are an analyst with some Tourism company. As a company, it would be useful to understand the patterns in people’s traveling habits. You are interested to know which age group likes to travel more. Your work is to use the World Indicators sample data to identify the countries where there are enough of the right kind of customers.
In this tutorial, we will be working with Tableau Public, which is absolutely free. Download the Tableau Public edition from the official website. Follow the installation instructions, and if the following screen appears on clicking the tableau icon, you are good to go.
Connecting to the Dataset
The World Economic Indicators dataset consists of useful indicators driving the economies of the various countries of the world like life expectancy, ease of doing business, population, etc. The dataset has been obtained from the United Nations website. The dataset can be accessed from here.
- Download the dataset on to your systems.
- Import the Data into the Tableau workspace from the computer. Use the
Data Interpreter, present under
SheetsTab, to rectify and realign the data.
Formatting the Data Source
In the worksheet, the columns from your data source are shown as fields on the left side of the Data pane. The Data pane contains of a variety of fields organized by the table. There are many features that can be clubbed together under a single category. This will also help to represent all the data fields better.
Business Tax Rate,
Days to Start Business,
Ease of Business,
Hours to do Tax, and
Lending Interest> Folders > Create Folder
- Name the folder as Business, and now all the above fields are included in this particular folder.
- Similarly, create three new folders —
Population, in the same way, as shown above. Add the following fields, respectively. This is how the Data Pane will look like after the formatting:
- Double click
Countryin the Data pane. Tableau creates a map view with a filled circle representing each country. Change the mark type to Map, on the Marks card,
Identifying the variables for clustering
The next step in clustering is to identify the variables that will be used in the clustering algorithm. In tableau, the variables are akin to the fields. There is no single answer to the best variables that will give ideal clusters, but you can experiment with several variables to see the desired results. In our case, let’s work with the following fields:
- Population Urban
Urban population is a good indicator of the population density in a country. Higher the density, more business opportunities become available.
- Population 65+
Population greater than 65 signifies senior citizens. A lot of senior citizens tend to like traveling, so this could be a useful indicator.
- Life Expectancy Female and Life Expectancy Male
Countries with a higher life expectancy signify that people there tend to live longer and be more interested in traveling.
- Tourism Per Capita
This field doesn’t exist and can be created as a calculated field using
Tourism Outbound and
Population Total fields as follows:
Tourism Per Capita = SUM([Tourism Outbound])/SUM([Population Total])
Tourism Outbound represents the money (in US dollars) that people spend annually on international travel. To get the average value, we will need to divide this field by the population of each country
Adding a selected field to the view
Before moving ahead, we need to change the default aggregation from
AVERAGE. Tableau makes it possible to aggregate measures or dimensions, though aggregating measures are more common. Whenever we add a measure to the view, an aggregation is applied to that measure by default. The type of aggregation that needs to be used depends on the context of the view.
Change the Aggregation for all the selected fields and then drag them on to the Detail on the Marks card as follows:
Clustering in Tableau is a simple drag and drop process. The following steps outline the clustering process:
- Click on the
AnalyticsPane and drag
Clusteronto the view, and the data is clustered by Tableau automatically. It is that simple.
- Although Tableau can automatically decide the number of clusters to create, we can also control the number of clusters and what variables to compute it. Drag a field in the box to include it in the clustering algorithm or drag it out to exclude it.
- We shall go with 4 clusters and the default variables for better analysis. Note some countries did not fall in any cluster and have been marked as not clustered.
- The cluster is created as a new pill and can be seen on the color shelf. Drag this pill on to the Data Pane to be saved as a group.
So here, we have clustered the countries in relation to the chosen measures. But how do we make sense of these results, and how we make business decisions based on the clusters? The next section addresses these concerns.
Click on the Clusters field in the Marks card and click on the Describe Clusters option.
This displays a document that contains a detailed description of the clusters. There are two tabs in the document —
This gives a summary of the results and the average values of each variable for every cluster.
From the above results, we can infer that Cluster 2 has :
- Highest Average life expectancy for both males and females
- Highest Total Tourism Per capita
- Highest average urban population
This means it has a wealthy urban population with a larger life expectancy and seems to be a good market for the Senior Tourism Industry. Let us see which countries are included in this cluster.
models’ tab displays the various statistical value for all the variables/fields’ average value and shows their statistical significance. You can read more about the cluster model statistics here.
Thus as an analyst, you can present this list to the Sales Team to focus on these prospective clients. Clustering has provided us with some great insights. From here, you can experiment with different fields, set a threshold for population or Income, etc. There are many ways to cluster the data, but the basic principle stays the same.
In this article, we learned how to perform cluster analysis on a given dataset in Tableau with a simple drag and drop mechanism. Clustering is an essential tool and, when coupled with Tableau, gives the power of a statistical analysis technique in analysts’ hands.
References and for further study
Find Clusters in Data — A self guide by Tableau which goes deeper into the concept of cluster analysis.
Originally published here