Chunksize in Pandas

Chunksize in Pandas

The pandas library in Python allows us to work with DataFrames. Data is organized into rows and columns in a DataFrame.

We can read data from multiple sources into a DataFrame.

In real-life situations, we can deal with datasets that contain thousands of rows and columns. This dataset can be read into a DataFrame depending on the source.

Chunksize in Pandas

Sometimes, we use the chunksize parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize parameter.

This saves computational memory and improves the efficiency of the code.

First let us read a CSV file without using the chunksize parameter in the read_csv() function. In our example, we will read a sample dataset containing movie reviews.

import pandas as pd
df = pd.read_csv('ratings.csv')
print(df.shape)
print(df.info)

Output:

(25000095, 4)
<bound method DataFrame.info of           userId  movieId  rating   timestamp
0              1      296     5.0  1147880044
1              1      306     3.5  1147868817
2              1      307     5.0  1147868828
3              1      665     5.0  1147878820
4              1      899     3.5  1147868510
...          ...      ...     ...         ...
25000090  162541    50872     4.5  1240953372
25000091  162541    55768     2.5  1240951998
25000092  162541    56176     2.0  1240950697
25000093  162541    58559     4.0  1240953434
25000094  162541    63876     5.0  1240952515

[25000095 rows x 4 columns]>

In the above example, we read the given dataset and display its details. The shape attribute returns the rows and columns, 25000095 and 4, respectively.

We also display some information about the rows and columns of the dataset using the info attribute.

We can see that this dataset contains 2500005 rows, and it takes a lot of the computer’s memory to process such large datasets. In such cases, we can use the chunksize parameter.

For this, let us first understand what iterators are in Python.

An iterable sequence can be looped over using a for loop. The for loop applies the iter() method to such objects internally to create iterators.

We can access the elements in the sequence with the next() function.

When we use the chunksize parameter, we get an iterator. We can iterate through this object to get the values.

import pandas as pd
df = pd.read_csv('ratings.csv', chunksize = 10000000)
for i in df:
    print(i.shape)

Output:

(10000000, 4)
(10000000, 4)
(5000095, 4)

In the above example, we specify the chunksize parameter with some value, and it reads the dataset into chunks of data with the given rows. For our dataset, we had three iterators when we specified the chunksize operator as 10000000.

The returned object is not a DataFrame but rather a pandas.io.parsers.TextFileReader object.

We can iterate through the object and access the values. Note that the number of columns is the same for each iterator which means that the chunksize parameter only considers the rows while creating the iterators.

This parameter is available with other functions that can read data from other sources like pandas.read_json, pandas.read_stata, pandas.read_sql_table, pandas.read_sas, and more. It is recommended to check the official documentation before using this parameter to see its availability.

Author: Manav Narula
Manav Narula avatar Manav Narula avatar

Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.

LinkedIn