SciPy stats.zscore Function

Lakshay Kapoor Jan 30, 2023
  1. the scipy.stats.zscore Function
  2. Calculating the z-score for a One-dimensional Array in Python
  3. Calculating the z-score for a Multi-Dimensional Array in Python
  4. Calculating the z-score for a Pandas Dataframe in Python
SciPy stats.zscore Function

z-score is a statistic method that helps calculate how many values standard deviation away is a particular value away from the mean value. The z-score is calculated with the help of the following formula.

z = (X – μ) / σ

In which,

  • X is a particular value from the data
  • μ is the mean value
  • σ is the standard deviation

This tutorial will show how to calculate the z-score value of any data in Python using the SciPy library.

the scipy.stats.zscore Function

The scipy.stats.zscore function of the SciPy library helps to calculate the relative z-score of the given input raw data along with the data’s mean and standard deviation. It is defined as scipy.stats.zscore(a, axis, ddof, nan_policy).

Following are the parameters of the scipy.stats.zscore function.

a (array) An array-like object of the raw input data.
axis (int) It defines the axis along which the function computes the z-score value. The default value is 0 i.e, the function computes over the whole array.
ddof (int) It defines the degree of freedom correction in the whole computation of the standard deviation.
nan_policy This parameter decides how to deal when there are NaN values in the input data. There are three decision parameters in the parameter, propagate, raise, omit. propagate simply returns the NaN value, raise returns an error and omit simply ignores the NaN values and the function continues with computation. These decision parameters are defined in single quotes ''. Also, NaN values never affect the z-score value that is calculated for the other values present in the input data.

All the parameters except the a (array) parameter are optional. That means it is not necessary to define them every time while using the scipy.stats.zscore function.

Now, let us use the scipy.stats.zscore function on one-dimensional array, multi dimensional array, and Pandas Dataframe.

Calculating the z-score for a One-dimensional Array in Python

import numpy as np
import scipy.stats as stats

input_data = np.array([5, 10, 20, 35, 25, 22, 19, 19, 50, 45, 62])

stats.zscore(input_data)

Output:

array([-1.3916106 , -1.09379511, -0.49816411,  0.39528239, -0.20034861,
       -0.37903791, -0.55772721, -0.55772721,  1.28872889,  0.99091339,
        2.00348608])

Note that each z-score value tells that how many standard deviation values away is its corresponding value away from the mean value. Here, the negative sign represents that that value is that many standard deviations below the mean value, and the positive sign represents that that value is that many standard deviations above the mean value. If a z-score value comes out to be 0, then that value is 0 standard deviation values away from the mean value.

Calculating the z-score for a Multi-Dimensional Array in Python

import numpy as np
import scipy.stats as stats

data = np.array([[5, 10, 20, 35], [25, 22, 19, 19], [50, 45, 62, 28], [24, 45, 15, 30]])

stats.zscore(input_data)

Output:

array([-1.3916106 , -1.09379511, -0.49816411,  0.39528239, -0.20034861,
       -0.37903791, -0.55772721, -0.55772721,  1.28872889,  0.99091339,
        2.00348608])

Calculating the z-score for a Pandas Dataframe in Python

In this, we will use the randint() function of the NumPy library. This function is used to generate random sample numbers and store them in the form of a NumPy array. After creating the NumPy array, we will use that array as a Pandas Dataframe.

import pandas as pd
import numpy as np
import scipy.stats as stats

input_data = pd.DataFrame(
    np.random.randint(0, 30, size=(4, 4)), columns=["W", "X", "Y", "Z"]
)
print(input_data)
    W   X   Y   Z
0   7   9   2  15
1  11  23  15  28
2  28  11  25   2
3  11  19  14  15
input_data.apply(stats.zscore)

Output:

          W	        X	        Y	        Z
0	-0.894534	-1.135815	-1.471534	 0.000000
1	-0.400998	 1.310556	 0.122628	 1.414214
2	 1.696529	-0.786334	 1.348907	-1.414214
3	-0.400998	 0.611593	 0.000000	 0.000000

Note that apply() function of the Pandas library is used to calculate the z-score value for each value in the given dataframe. This function is used to apply a specific function defined as a function argument of the apply() function to each value of the Pandas series or dataframe.

Lakshay Kapoor avatar Lakshay Kapoor avatar

Lakshay Kapoor is a final year B.Tech Computer Science student at Amity University Noida. He is familiar with programming languages and their real-world applications (Python/R/C++). Deeply interested in the area of Data Sciences and Machine Learning.

LinkedIn

Related Article - SciPy Stats