Loading large datasets in Pandas

The pandas’ library is a vital member of the Data Science ecosystem. However, the fact that it is unable to analyze datasets larger than memory makes it a little tricky for big data. Consider a situation when we want to analyze a large dataset by using only pandas. What kind of problems can we run into? For instance, let’s take a file comprising 3GB of data summarising yellow taxi trip data for March in 2016. To perform any sort of analysis, we will have to import it into memory. We readily use the pandas’ read_csv() function to perform the reading operation as follows:

import pandas as pd
df = pd.read_csv('yellow_tripdata_2016-03.csv')

When I ran the cell/file, my system threw the following Memory Error. (The memory error would depend upon the capacity of the system that you are using).

Image for post
Image by Author

Any Alternatives?

Before criticizing pandas, it is important to understand that pandas may not always be the right tool for every task. Pandas lack multiprocessing support, and other libraries are better at handling big data. One such alternative is Dask, which gives a pandas-like API foto work with larger than memory datasets. Even the pandas’ documentation explicitly mentions that for big data:

it’s worth considering not using pandas. Pandas isn’t the right tool for all situations.

In this article, however, we shall look at a method called chunking, by which you can load out of memory datasets in pandas. This method can sometimes offer a healthy way out to manage the out-of-memory problem in pandas but may not work all the time, which we shall see later in the chapter. Essentially we will look at two ways to import large datasets in python:

  • Using pd.read_csv() with chunksize
  • Using SQL and pandas[Read More]

1 Comment »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s