Getting more value from the Pandas’ value_counts()

Originally posted at Towards Data Science on Medium

Data exploration is an important aspect of the Machine Learning pipeline. Before we decide which model to train and how many to train, we must have an idea of what our data contains. The Pandas library is equipped with a number of useful functions for this very purpose and value_counts is one of them. This function returns the count of unique items in a pandas dataframe. However, most of the time, we end up using value_counts with the default parameters. So in this short article, I’ll show you how to achieve more by altering the default parameters.

value_counts()

The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.

Syntax

Series.value_counts()

Parameters

Basic usage

Let’s see the basic usage of this method by on a dataset. I’ll be using the Titanic dataset for the demo. I have also published an accompanying notebook on Kaggle, incase you want to get directly to the codes.

Importing the dataset

Let’s begin by importing the necessary libraries and the dataset. This is a fundamental step in every data analysis process.

# Importing necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Reading in the data
train = pd.read_csv('../input/titanic/train.csv')

Explore the first few rows of the dataset

train.head()

Calculating the number of null values

train.isnull().sum()

Thus, the Age, Cabin and Embarked columns have null values. With this, we have a bare idea of what are dataset looks like. Let’s now see how we can use value_counts() in five different ways to explore this data further.

1. value_counts() with default parameters

Let’s call the value_counts() on the Embarked column of the dataset. This will return the count of unique occurrences in this column.

train['Embarked'].value_counts()
-------------------------------------------------------------------

S      644
C      168
Q       77

The function returns the count of all unique values in the given index in descending order without any null values. We can quickly see that the maximum people embarked from Southampton, followed by Cherbourg and then Queenstown.

2. value_counts() with relative frequencies of the unique values.

Sometimes, getting a percentage is a better criterion then the count. By setting normalize=True, the object returned will contain the relative frequencies of the unique values. The normalizeparameter is set to False by default.

train['Embarked'].value_counts(normalize=True)
-------------------------------------------------------------------

S    0.724409
C    0.188976
Q    0.086614

Knowing that 72% of people embarked from Southampton is a better metric than saying 644 people embarked from Southampton.

3. value_counts() in ascending order

The series returned by value_counts() is in descending order by default. We can reverse the case by setting the ascending parameter to True .

train['Embarked'].value_counts(ascending=True)
-------------------------------------------------------------------

Q     77
C    168
S    644

4. value_counts() displaying the NaN values

By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna parameter to False .

train['Embarked'].value_counts(dropna=False)
-------------------------------------------------------------------

We can easily see that there are two null values in the column.

5. value_counts() to bin continuous data into discrete intervals

This is one of my favorite uses of the value_counts() function and an underutilized one too. value_counts() can be used to bin continuous data into discrete intervals with the help of the bin parameter. This option works only with numerical data. It is similar to the pd.cut function. Let’s see how it works using the Fare column.

# applying value_counts on a numerical column without the bin parameter

train['Fare'].value_counts()

This doesn’t convey much information as the output contains a lot of categories for every value of Fare. Instead, let’s group them into seven bins.

train['Fare'].value_counts(bins=7)

Binning makes it easy to understand the idea being conveyed. We can easily see that most of the people out of the total population paid less than 73.19 for their ticket. Also, we can see that having five bins serves our purpose since no passenger falls into the last two bins.

Thus, we can see that value_counts() is a handy tool, and we can do some interesting analysis with this single line of code.

References

pandas.Series.value_counts documentation