# Vectorize a Function in Pandas

Vectorization is a way to convert a function into a form that evaluates it more efficiently. It speeds up data processing in Python by converting them into arrays. It speeds up Python code without using a loop.

The Pandas library is a popular tool in Python for data analysis and manipulation. We use Vectorization in Pandas commonly in numerical computing to improve code performance.

A Pandas data frame is a data structure built on top of a data frame, providing the functionality of both R data frames and Python dictionaries. It’s like a Python dictionary but with all the data analysis and manipulation capabilities, such as Excel tables and databases with rows and columns.

## Vectorize a Function in Pandas

Let’s install the Python library `pandas`

to import data frames.

```
PS C:\> pip install pandas
```

To perform vectorization on a data frame, we import it using the Python library `pandas`

. Let’s run the below code to import a data frame and make it big through concatenation.

Example Code (saved in `demo.py`

):

```
import pandas as pd
small_df = pd.read_csv('Salaries.csv')
df = pd.concat([small_df]*100, ignore_index=True)
```

Now run the code below to calculate the total number of rows of the data frame for data analysis.

Example Code (saved in `demo.py`

):

```
print(f'No of rows: {len(df)}')
```

OUTPUT (printed on console):

```
No of rows: 14865400
```

Let’s see the consumption time of an operation performed on the data frame without vectorization by running the below code.

Example Code (saved in `demo.py`

):

```
import time
import numpy
start_time = time.process_time()
pay_with_tax = np.zeros(len(df))
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = pay * 1.05 + 1
end_time = time.process_time()
print("Without using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
```

The function `np.zeros()`

takes size as `len(df)`

and creates an array of zeros of the specified size.`for`

loop iterates over both the `pay_with_tax`

array and the `TotalPay`

column of the data frame as `pay`

.

It calculates tax for each `pay`

and stores it in `pay_with_tax`

.

OUTPUT (printed on console):

Vectorization adds flexibility to the operations using SIMD (Single Instruction Multiple Data) approaches. In Pandas, a batch `API`

speeds up the operations without using loops.

Let’s run the below-given code that uses vectorization to calculate the time consumption in calculating `salary_with_tax`

.

Example Code (saved in `demo.py`

):

```
start_time = time.process_time()
pay_with_tax = df.TotalPay.values * 1.05 + 1
end_time = time.process_time()
print("Using Vectorization")
print(f"pay_with_tax = {pay_with_tax}")
print(f"Computation time = {(1000*(end_time - start_time ))}ms")
```

OUTPUT (printed on console):

You can also apply statistical operations of the `numpy`

library, such as `mean`

, `sqrt`

etc., by adding little changes to the above code.

Example Code (saved in `demo.py`

):

```
import numpy as np
# non vectorized
for idx, pay in enumerate(df.TotalPay.values):
pay_with_tax[idx] = np.mean(pay)
# vectorized
pay_with_tax = df['TotalPay'].apply(np.mean)
```

You can see the difference in time consumption, both with or without vectorization. Industries deal with millions to trillions of rows of big data.

Computing this data with a non-vectorized approach is time-consuming. Thus, the flexible nature of vectorization in Pandas data frames helps in fast data analysis and manipulation.