How to Vectorize a Function in Pandas

Hira Arif Feb 02, 2024
How to Vectorize a Function in Pandas

Vectorization is a way to convert a function into a form that evaluates it more efficiently. It speeds up data processing in Python by converting them into arrays. It speeds up Python code without using a loop.

The Pandas library is a popular tool in Python for data analysis and manipulation. We use Vectorization in Pandas commonly in numerical computing to improve code performance.

A Pandas data frame is a data structure built on top of a data frame, providing the functionality of both R data frames and Python dictionaries. It’s like a Python dictionary but with all the data analysis and manipulation capabilities, such as Excel tables and databases with rows and columns.

Vectorize a Function in Pandas

Let’s install the Python library pandas to import data frames.

PS C:\> pip install pandas

To perform vectorization on a data frame, we import it using the Python library pandas. Let’s run the below code to import a data frame and make it big through concatenation.

Example Code (saved in demo.py):

import pandas as pd

small_df = pd.read_csv("Salaries.csv")
df = pd.concat([small_df] * 100, ignore_index=True)

Now run the code below to calculate the total number of rows of the data frame for data analysis.

Example Code (saved in demo.py):

print(f"No of rows: {len(df)}")

OUTPUT (printed on console):

No of rows: 14865400

Let’s see the consumption time of an operation performed on the data frame without vectorization by running the below code.

Example Code (saved in demo.py):

import time
import numpy

start_time = time.process_time()
pay_with_tax = np.zeros(len(df))
for idx, pay in enumerate(df.TotalPay.values):
    pay_with_tax[idx] = pay * 1.05 + 1
end_time = time.process_time()

print("Without using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")

The function np.zeros() takes size as len(df) and creates an array of zeros of the specified size.for loop iterates over both the pay_with_tax array and the TotalPay column of the data frame as pay.

It calculates tax for each pay and stores it in pay_with_tax.

OUTPUT (printed on console):

vectorize a function in pandas - non vectorized

Vectorization adds flexibility to the operations using SIMD (Single Instruction Multiple Data) approaches. In Pandas, a batch API speeds up the operations without using loops.

Let’s run the below-given code that uses vectorization to calculate the time consumption in calculating salary_with_tax.

Example Code (saved in demo.py):

start_time = time.process_time()
pay_with_tax = df.TotalPay.values * 1.05 + 1
end_time = time.process_time()

print("Using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")

OUTPUT (printed on console):

vectorize a function in pandas - vectorized

You can also apply statistical operations of the numpy library, such as mean, sqrt etc., by adding little changes to the above code.

Example Code (saved in demo.py):

import numpy as np

# non vectorized
for idx, pay in enumerate(df.TotalPay.values):
    pay_with_tax[idx] = np.mean(pay)

# vectorized
pay_with_tax = df["TotalPay"].apply(np.mean)

You can see the difference in time consumption, both with or without vectorization. Industries deal with millions to trillions of rows of big data.

Computing this data with a non-vectorized approach is time-consuming. Thus, the flexible nature of vectorization in Pandas data frames helps in fast data analysis and manipulation.

Related Article - Pandas DataFrame