Vectorize a Function in Pandas

Vectorization is a way to convert a function into a form that evaluates it more efficiently. It speeds up data processing in Python by converting them into arrays. It speeds up Python code without using a loop.

The Pandas library is a popular tool in Python for data analysis and manipulation. We use Vectorization in Pandas commonly in numerical computing to improve code performance.

A Pandas data frame is a data structure built on top of a data frame, providing the functionality of both R data frames and Python dictionaries. It’s like a Python dictionary but with all the data analysis and manipulation capabilities, such as Excel tables and databases with rows and columns.

Vectorize a Function in Pandas

Let’s install the Python library `pandas` to import data frames.

``````PS C:\> pip install pandas
``````

To perform vectorization on a data frame, we import it using the Python library `pandas`. Let’s run the below code to import a data frame and make it big through concatenation.

Example Code (saved in `demo.py`):

``````import pandas as pd
df = pd.concat([small_df]*100, ignore_index=True)
``````

Now run the code below to calculate the total number of rows of the data frame for data analysis.

Example Code (saved in `demo.py`):

``````print(f'No of rows: {len(df)}')
``````

OUTPUT (printed on console):

``````No of rows: 14865400
``````

Let’s see the consumption time of an operation performed on the data frame without vectorization by running the below code.

Example Code (saved in `demo.py`):

``````import time
import numpy

start_time = time.process_time()
pay_with_tax = np.zeros(len(df))
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = pay * 1.05 + 1
end_time = time.process_time()

print("Without using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
``````

The function `np.zeros()` takes size as `len(df)` and creates an array of zeros of the specified size.`for` loop iterates over both the `pay_with_tax` array and the `TotalPay` column of the data frame as `pay`.

It calculates tax for each `pay` and stores it in `pay_with_tax`.

OUTPUT (printed on console):

Vectorization adds flexibility to the operations using SIMD (Single Instruction Multiple Data) approaches. In Pandas, a batch `API` speeds up the operations without using loops.

Let’s run the below-given code that uses vectorization to calculate the time consumption in calculating `salary_with_tax`.

Example Code (saved in `demo.py`):

``````start_time = time.process_time()
pay_with_tax = df.TotalPay.values * 1.05 + 1
end_time = time.process_time()

print("Using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
``````

OUTPUT (printed on console):

You can also apply statistical operations of the `numpy` library, such as `mean`, `sqrt` etc., by adding little changes to the above code.

Example Code (saved in `demo.py`):

``````import numpy as np

# non vectorized
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = np.mean(pay)

# vectorized
pay_with_tax = df['TotalPay'].apply(np.mean)
``````

You can see the difference in time consumption, both with or without vectorization. Industries deal with millions to trillions of rows of big data.

Computing this data with a non-vectorized approach is time-consuming. Thus, the flexible nature of vectorization in Pandas data frames helps in fast data analysis and manipulation.

Related Article - Pandas DataFrame

• Get Pandas DataFrame Column Headers as a List
• Delete Pandas DataFrame Column
• Convert Pandas Column to Datetime
• Convert a Float to an Integer in Pandas DataFrame
• Sort Pandas DataFrame by One Column's Values
• Get the Aggregate of Pandas Group-By and Sum