How to Perform Stepwise Regression in Python

Muhammad Maisam Abbas Feb 02, 2024
  1. Stepwise Regression in Python
  2. Stepwise Regression With the statsmodels Library in Python
  3. Stepwise Regression With the sklearn Library in Python
  4. Stepwise Regression With the mlxtend Library in Python
How to Perform Stepwise Regression in Python

This tutorial will discuss the methods to perform Stepwise regression in Python.

Stepwise Regression in Python

Stepwise regression is a method used in statistics and machine learning to select a subset of features for building a linear regression model. Stepwise regression aims to minimize the model’s complexity while maintaining a high accuracy level.

This method is particularly useful in cases where the number of features is large, and it’s unclear which features are important for the prediction.

Stepwise Regression With the statsmodels Library in Python

The statsmodels library provides the OLS() class that can be used to perform stepwise regression. This function uses a combination of forward selection and backward elimination to select the best subset of features.

The function starts with an empty model and adds variables one by one based on the significance of their coefficients. Variables that are not significant are eliminated from the model.

Here is an example of how to use the stepwise function in statsmodels.

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Load the data
data = pd.read_csv("data.csv")

# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]

# Perform stepwise regression
result = sm.OLS(y, x).fit()

# Print the summary of the model
print(result.summary())

Output:

OLS Regression

We first load the data in the above code example and define the dependent and independent variables. Then, we perform a stepwise regression using the OLS() function from the statsmodels.formula.api library and print a model summary, which includes information such as the coefficients of the variables, p-values, and R-squared value.

Stepwise Regression With the sklearn Library in Python

The sklearn library provides a RFE (Recursive Feature Elimination) class for performing stepwise regression. This method starts with all features and recursively eliminates features based on their importance.

The RFE method uses a specified estimator (such as a linear regression model) to estimate the importance of the features and recursively removes the least important feature at each iteration.

Here is an example of how to use the RFE method in sklearn.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Load the data
data = pd.read_csv("data.csv")

# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]

# Create a linear regression estimator
estimator = LinearRegression()

# Create the RFE object and specify the number of
selector = RFE(estimator, n_features_to_select=5)

# Fit the RFE object to the data
selector = selector.fit(x, y)

# Print the selected features
print(x.columns[selector.support_])

Output:

Index(['Tenure', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Exited'], dtype='object')

We first load the data in the above code example and define the dependent and independent variables. Then, we create a linear regression estimator and an RFE object.

We set the number of features to select as 5, which means that the final model will only include the top 5 features according to their importance. Next, we fit the RFE object to the data and print the selected features.

It’s worth noting that the RFE() method uses the specified estimator to compute the importance of the features, so it is important to use an appropriate estimator for the data. The RFE method can also be used with other estimators such as Random Forest or SVM.

Stepwise Regression With the mlxtend Library in Python

The mlxtend library provides the SFS class for performing stepwise regression. This function uses a combination of forward selection and backward elimination to select the best subset of features.

This function also starts with an empty model and adds variables one by one based on the significance of their coefficients. Variables that are not significant are eliminated from the model.

Here is an example of how to use the stepwise function in mlxtend.

from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import joblib
import sys

sys.modules["sklearn.externals.joblib"] = joblib

# Load the data
data = pd.read_csv("data.csv")

# Define the dependent and independent variables
x = data.drop("EstimatedSalary", axis=1)
y = data["EstimatedSalary"]

# Create a linear regression estimator
estimator = LinearRegression()

# Create the SFS object and specify the number of features to select
sfs = SFS(estimator, k_features=5, forward=True, floating=False, scoring="r2", cv=5)

# Fit the SFS object to the data
sfs = sfs.fit(x, y)

# Print the selected features
print(sfs.k_feature_idx_)

Output:

(1, 2, 4, 6, 7)

We first load the data in this example and define the dependent and independent variables. Then, we create a linear regression estimator and an SFS object.

We set the number of features to select as 5, which means that the final model will only include the top 5 features according to their importance. Next, we fit the SFS object to the data and print the selected features.

It’s worth noting that the stepwise() function of mlxtend uses the specified estimator to compute the importance of the features, so it is important to use an appropriate estimator for the data. The function also allows us to set the direction of the selection process, the scoring metric, and the number of cross-validation folds to use.

In summary, stepwise regression is a powerful technique for feature selection in linear regression models. The statsmodels, sklearn, and mlxtend libraries provide different methods for performing stepwise regression in Python, each with advantages and disadvantages.

The choice of method will depend on the problem’s specific requirements and the availability of the data. It is important to note that stepwise regression can be prone to overfitting, and using it in combination with other feature selection techniques and cross-validation is recommended.

Muhammad Maisam Abbas avatar Muhammad Maisam Abbas avatar

Maisam is a highly skilled and motivated Data Scientist. He has over 4 years of experience with Python programming language. He loves solving complex problems and sharing his results on the internet.

LinkedIn

Related Article - Python Regression