Variance Inflation Factor in Python

Variance Inflation Factor in Python

  1. Variance Inflation Factor in Python
  2. Performance of VIF in Detecting Influential Observations
  3. Calculate Variance Inflation Factor (VIF) in Python

This article describes the variance inflation factor (VIF) and its performance in detecting influential observations and demonstrates how we can use statsmodels to use VIF in Python.

Variance Inflation Factor in Python

The variance inflation factor (VIF) measures the amount of collinearity among predictor variables in a multiple regression model. And it is calculated as the ratio of the given predictor variable’s variance to the residuals’ variance.

A variance inflation factor of 1 indicates no collinearity. In contrast, a VIF more significant than 1 suggests that collinearity is present. The VIF can be used to assess whether the inclusion of a given predictor variable in a multiple regression model is warranted.

If the VIF for a given predictor is high, it may indicate that the predictor is redundant with other predictors in the model. We can also remove it without affecting the model fit.

The appropriate variance inflation factor (VIF) will depend on the specific context and data set. However, in general, the VIF can be a valuable tool for identifying potential issues with multicollinearity in your data.

Performance of VIF in Detecting Influential Observations

There are several ways to detect influential observations in a regression analysis. One standard method is to calculate the variance inflation factor (VIF).

The VIF measures the amount of variance in a predictor due to collinearity with other predictors in the model. A high VIF indicates that other predictors highly influence the predictor in the model.

So, a regression model can calculate the VIF for each predictor. A VIF of 1 indicates that any other predictor does not influence the predictor in the model.

A VIF more significant than 1 suggests that other predictors influence the predictor in the model. The VIF is especially useful for detecting collinearity among categorical predictors.

The VIF can identify which predictors are most influential in a regression model. However, it is compulsory to remember that the VIF is only a measure of collinearity and does not necessarily indicate that a predictor is essential in the model.

Calculate Variance Inflation Factor (VIF) in Python

To use the VIF in Python, we can use the statsmodels library. First, the VIF is calculated using the linear_model.LinearRegression class.

We first fit a linear regression model to our data and then create a new linear regression model with the interaction term. After that, we compare the two models by calculating the VIF for each model.

The model with the interaction term included will have a higher VIF, indicating that the interaction term is causing multicollinearity. The VIF is calculated for each predictor variable in the model and is then used to determine the model’s overall fit.

The VIF can be calculated using the following formula:

VIF = \frac{1}{(1-R^{2})}

Here R^2 is the coefficient of determination for the predictor variable.

The VIF is typically used to assess multicollinearity in a linear regression model. However, we can also use it in other regression models, such as logistic regression and Poisson regression.

The VIF can be used to assess the model’s overall fit and to identify predictor variables. And the predictor variables are highly correlated with other predictor variables in the model.

In statistics, every data point has an error referred to as the variance. However, the conflict of a data set is not the best measure of how variable the data set is.

Variance inflation factor (VIF) is a statistical measure of the effects of multicollinearity in a regression analysis. VIF = (λ 1 / λ 2 ) – 1, where λ 1 is the VIF for a variable in a regression model, and λ 2 is the VIF for the variable in the second regression model.

VIF > 10 indicates multicollinearity among the independent variables. Let’s learn VIF via the Python code example below:

Example Code:

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.tools.tools as smt
import statsmodels.stats.outliers_influence as smo

hp= sm.datasets.get_rdataset(dataname="HousePrices", package="AER", cache=True).data
print(hp.iloc[:, 0:5].head(3))

ivar = hp.iloc[:, 1:5]
print(ivar.head(3))

ivarc = smt.add_constant(data=ivar, prepend=False)
vif_lotsize = smo.variance_inflation_factor(exog=ivarc.values, exog_idx=0)
print(vif_lotsize)

Output:

   price     lotsize  bedrooms    bathrooms  stories
0  42000.0     5850         3          1        2
1  38500.0     4000         2          1        1
2  49500.0     3060         3          1        1
   lotsize  bedrooms  bathrooms  stories
0     5850         3          1        2
1     4000         2          1        1
2     3060         3          1        1
1.047054041442195

As we have learned, there are numerous VIF calculators available. In addition, VIF is one of the many metrics that can help you understand the relationship between two variables.

It is essential to know that VIF is a practical rather than a theoretical concept. It is the VIF value that determines if multicollinearity is a problem.

Zeeshan Afridi avatar Zeeshan Afridi avatar

Zeeshan is a detail oriented software engineer that helps companies and individuals make their lives and easier with software solutions.

LinkedIn