Multiple Regression in Python

Multiple Regression in Python

Shivam Arora Oct-12, 2021 Jun-19, 2021 Python Python Regression
  1. Use the statsmodel.api Module to Perform Multiple Linear Regression in Python
  2. Use the numpy.linalg.lstsq to Perform Multiple Linear Regression in Python
  3. Use the scipy.curve_fit() Method to Perform Multiple Linear Regression in Python

This tutorial will discuss multiple linear regression and how to implement it in Python.

Multiple linear regression is a model which computes the relation between two or more than two variables and a single response variable by fitting a linear regression equation between them. It helps estimate the dependency or the change between dependent variables to the change in the independent variables. In standard multiple linear regression, all the independent variables are taken into account simultaneously.

Use the statsmodel.api Module to Perform Multiple Linear Regression in Python

The statsmodel.api module in Python is equipped with functions to implement linear regression. We will use the OLS() function, which performs ordinary least square regression.

We can either import a dataset using the pandas module or create our own dummy data to perform multiple regression. We bifurcate the dependent and independent variables to apply the linear regression model between those variables.

We create a regression model using the OLS() function. Then, we pass the independent and dependent variables in this function and fit this model using the fit() function. In our example, we have created some arrays to demonstrate multiple regression.

See the code below.

import statsmodels.api as sm
import numpy as np

y = [1,2,3,4,3,4,5,3,5,5,4,5,4,5,4,5,6,0,6,3,1,3,1] 
X = [[0,2,4,1,5,4,5,9,9,9,3,7,8,8,6,6,5,5,5,6,6,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,6,8,9,2,1,5,6],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,4,3,1,2,3,4,1,3,9,7]]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

print(reg_m(y, x).summary())

Output:

 OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.241
Model:                            OLS   Adj. R-squared:                  0.121
Method:                 Least Squares   F-statistic:                     2.007
Date:                Wed, 16 Jun 2021   Prob (F-statistic):              0.147
Time:                        23:57:15   Log-Likelihood:                -40.810
No. Observations:                  23   AIC:                             89.62
Df Residuals:                      19   BIC:                             94.16
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0287      0.135     -0.213      0.834      -0.311       0.254
x2             0.2684      0.160      1.678      0.110      -0.066       0.603
x3             0.1339      0.160      0.839      0.412      -0.200       0.468
const          1.5123      0.986      1.534      0.142      -0.551       3.576
==============================================================================
Omnibus:                        9.472   Durbin-Watson:                   2.447
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                7.246
Skew:                          -1.153   Prob(JB):                       0.0267
Kurtosis:                       4.497   Cond. No.                         29.7
==============================================================================

The summary() function allows us to print the results and coefficients of the regression. The R-Squared, and Adjusted R-Squared tell us about the efficiency of the regression.

Use the numpy.linalg.lstsq to Perform Multiple Linear Regression in Python

The numpy.linalg.lstsq method returns the least squares solution to a provided equation by solving the equation as Ax=B by computing the vector x to minimize the normal ||B-Ax||.

We can use it to perform multiple regression as shown below.

import numpy as np

y = [1,2,3,4,3,4,5,3,5,5,4,5,4,5,4,5,6,0,6,3,1,3,1] 
X = [[0,2,4,1,5,4,5,9,9,9,3,7,8,8,6,6,5,5,5,6,6,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,6,8,9,2,1,5,6],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,4,3,1,2,3,4,1,3,9,7]]
X = np.transpose(X) # transpose so input vectors
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
linreg = np.linalg.lstsq(X, y, rcond=None)[0]
print(linreg)

Output:

[ 0.1338682   0.26840334 -0.02874936  1.5122571 ]

We can compare the coefficients for each variable with the previous method and notice that the result is the same. Here the final result is in a NumPy array.

Use the scipy.curve_fit() Method to Perform Multiple Linear Regression in Python

This model uses a function that is further used to calculate a model for some values, and the result is used with non-linear least squares to fit this function to the given data.

See the code below.

from scipy.optimize import curve_fit
import scipy
import numpy as np

def function_calc(x, a, b, c):
    return a + b*x[0] + c*x[1]
y = [1,2,3,4,3,4,5,3,5,5,4,5,4,5,4,5,6,0,6,3,1,3,1] 
X = [[0,2,4,1,5,4,5,9,9,9,3,7,8,8,6,6,5,5,5,6,6,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,6,8,9,2,1,5,6],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,4,3,1,2,3,4,1,3,9,7]]

popt, pcov = curve_fit(function_calc, x, y)
print (popt)
print(pcov)

Output:

[1.44920591 0.12720273 0.26001833]
[[ 0.84226681 -0.06637804 -0.06977243]
 [-0.06637804  0.02333829 -0.01058201]
 [-0.06977243 -0.01058201  0.02288467]]

Related Article - Python Regression

  • Fama-Macbeth Regression in Python