The pandas’ library is a vital member of the Data Science ecosystem. However, the fact that it is unable to analyze datasets larger than memory makes it a little tricky for big data. Consider a situation when we want to analyze a large dataset by using only pandas. What kind of problems can we run into? For instance, let’s take a file comprising 3GB of data summarising yellow taxi trip data for March in 2016. To perform any sort of analysis, we will have to import it into memory. We readily use the pandas’
read_csv() function to perform the reading operation as follows:
import pandas as pd
df = pd.read_csv('yellow_tripdata_2016-03.csv')
When I ran the cell/file, my system threw the following Memory Error. (The memory error would depend upon the capacity of the system that you are using).
Before criticizing pandas, it is important to understand that pandas may not always be the right tool for every task. Pandas lack multiprocessing support, and other libraries are better at handling big data. One such alternative is Dask, which gives a pandas-like API foto work with larger than memory datasets. Even the pandas’ documentation explicitly mentions that for big data:
it’s worth considering not using pandas. Pandas isn’t the right tool for all situations.
In this article, however, we shall look at a method called chunking, by which you can load out of memory datasets in pandas. This method can sometimes offer a healthy way out to manage the out-of-memory problem in pandas but may not work all the time, which we shall see later in the chapter. Essentially we will look at two ways to import large datasets in python:
- Using SQL and pandas[Read More]