Scipy scipy.stats.pearsonr Method

Bhuwan Bhatt Jan 30, 2023
  1. Syntax of scipy.stats.pearsonr():
  2. Example Codes : scipy.stats.pearsonr() Method to Find Corelation Coefficient
  3. Example Codes : Using scipy.stats.pearsonr() Method to Find Correlation Between Variables within a CSV File
Scipy scipy.stats.pearsonr Method

Python Scipy scipy.stats.pearsonr() method is used to find Pearson correlation coefficient, which represents linear relationships between two variables. It also gives the p-value for testing non-correlation.

The value of the Pearson correlation coefficient ranges between -1 to +1. If it is near -1, there is a strong negative linear relationship between variables. If it is 0, there is no linear relation, and at +1, there is a strong relationship between variables.

A positive relationship indicates that if one variable’s value increases or goes up, another’s value also increases.

Syntax of scipy.stats.pearsonr():

scipy.stats.pearsonr(x, y)

Parameters

x It is the input array elements of the first variable or attribute.
y It is the input array elements of the second variable or attribute. Length should be equal to x.

Return

It returns a tuple of two values :

  1. r : It is the Pearson correlation coefficient. It shows the degree of relationship between x and y.
  2. p value: It is the probability significance value. It checks whether to accept or reject the null hypothesis.

The null hypothesis means that there is no relationship between variables under consideration.

Example Codes : scipy.stats.pearsonr() Method to Find Corelation Coefficient

import scipy
from scipy import stats

arr1 = [3, 6, 9, 12]
arr2 = [12, 10, 11, 11]
r, p = scipy.stats.pearsonr(arr1, arr2)

print("The pearson correlation coefficient is:", r)
print("The p-value is:", p)

Output:

The pearson correlation coefficient is: -0.31622776601683794
The p-value is: 0.683772233983162

Here, two arrays having equal elements are considered, and they are passed as an argument into the pearsonr function. Here we see the negative correlation coefficient as an output because the first array has linearly increasing valued elements, whereas elements are taken randomly in the second array.

Since p-value (0.683772233983162) is greater than 0.05, therefore null hypothesis is True.

Example Codes : Using scipy.stats.pearsonr() Method to Find Correlation Between Variables within a CSV File

import numpy as np
import pandas as pd
import scipy
from scipy import stats

data = pd.read_csv("dataset.csv")
newdata = data[["price", "mileage"]].dropna()

r, p = scipy.stats.pearsonr(newdata["price"], newdata["mileage"])
print("The pearson correlation coefficient between price and mileage is:", r)
print("The p-value is:", p)

Output:

The pearson correlation coefficient between price and mileage is: -0.4008381863293672
The p-value is: 4.251481046096957e-97

Here, we use the pandas library to load data as a pandas data frame. The dataset.csv file is read. The file contains car data having columns name, price, mileage, brand, and year of manufacture. Then, we use the dropna() method to drop down every column except price and mileage to check the strength of their relationship.

On analyzing the output value, we can see that the Pearson correlation coefficient is negative, meaning price and mileage have a relatively strong negative linear relationship. Those cars whose price is less will provide the higher mileage, and once the price of the car increases, the mileage value starts to decrease.

Since p is very minute (approx 0), thus test hypothesis is false and should be rejected.