Multiple Regression in Python

Shivam Arora Oct 10, 2023

This tutorial will discuss multiple linear regression and how to implement it in Python.

Multiple linear regression is a model which computes the relation between two or more than two variables and a single response variable by fitting a linear regression equation between them. It helps estimate the dependency or the change between dependent variables to the change in the independent variables. In standard multiple linear regression, all the independent variables are taken into account simultaneously.

Use the `statsmodel.api` Module to Perform Multiple Linear Regression in Python

The `statsmodel.api` module in Python is equipped with functions to implement linear regression. We will use the `OLS()` function, which performs ordinary least square regression.

We can either import a dataset using the `pandas` module or create our own dummy data to perform multiple regression. We bifurcate the dependent and independent variables to apply the linear regression model between those variables.

We create a regression model using the `OLS()` function. Then, we pass the independent and dependent variables in this function and fit this model using the `fit()` function. In our example, we have created some arrays to demonstrate multiple regression.

See the code below.

``````import statsmodels.api as sm
import numpy as np

y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]

def reg_m(y, x):
ones = np.ones(len(x[0]))
for ele in x[1:]:
results = sm.OLS(y, X).fit()
return results

print(reg_m(y, x).summary())
``````

Output:

`````` OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.241
Method:                 Least Squares   F-statistic:                     2.007
Date:                Wed, 16 Jun 2021   Prob (F-statistic):              0.147
Time:                        23:57:15   Log-Likelihood:                -40.810
No. Observations:                  23   AIC:                             89.62
Df Residuals:                      19   BIC:                             94.16
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0287      0.135     -0.213      0.834      -0.311       0.254
x2             0.2684      0.160      1.678      0.110      -0.066       0.603
x3             0.1339      0.160      0.839      0.412      -0.200       0.468
const          1.5123      0.986      1.534      0.142      -0.551       3.576
==============================================================================
Omnibus:                        9.472   Durbin-Watson:                   2.447
Prob(Omnibus):                  0.009   Jarque-Bera (JB):                7.246
Skew:                          -1.153   Prob(JB):                       0.0267
Kurtosis:                       4.497   Cond. No.                         29.7
==============================================================================
``````

The `summary()` function allows us to print the results and coefficients of the regression. The `R-Squared`, and `Adjusted R-Squared` tell us about the efficiency of the regression.

Use the `numpy.linalg.lstsq` to Perform Multiple Linear Regression in Python

The `numpy.linalg.lstsq` method returns the least squares solution to a provided equation by solving the equation as `Ax=B` by computing the vector x to minimize the normal `||B-Ax||`.

We can use it to perform multiple regression as shown below.

``````import numpy as np

y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]
X = np.transpose(X)  # transpose so input vectors
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
linreg = np.linalg.lstsq(X, y, rcond=None)[0]
print(linreg)
``````

Output:

``````[ 0.1338682   0.26840334 -0.02874936  1.5122571 ]
``````

We can compare the coefficients for each variable with the previous method and notice that the result is the same. Here the final result is in a `NumPy` array.

Use the `scipy.curve_fit()` Method to Perform Multiple Linear Regression in Python

This model uses a function that is further used to calculate a model for some values, and the result is used with non-linear least squares to fit this function to the given data.

See the code below.

``````from scipy.optimize import curve_fit
import scipy
import numpy as np

def function_calc(x, a, b, c):
return a + b * x[0] + c * x[1]

y = [1, 2, 3, 4, 3, 4, 5, 3, 5, 5, 4, 5, 4, 5, 4, 5, 6, 0, 6, 3, 1, 3, 1]
X = [
[0, 2, 4, 1, 5, 4, 5, 9, 9, 9, 3, 7, 8, 8, 6, 6, 5, 5, 5, 6, 6, 5, 5],
[4, 1, 2, 3, 4, 5, 6, 7, 5, 8, 7, 8, 7, 8, 7, 8, 6, 8, 9, 2, 1, 5, 6],
[4, 1, 2, 5, 6, 7, 8, 9, 7, 8, 7, 8, 7, 4, 3, 1, 2, 3, 4, 1, 3, 9, 7],
]

popt, pcov = curve_fit(function_calc, x, y)
print(popt)
print(pcov)
``````

Output:

``````[1.44920591 0.12720273 0.26001833]
[[ 0.84226681 -0.06637804 -0.06977243]
[-0.06637804  0.02333829 -0.01058201]
[-0.06977243 -0.01058201  0.02288467]]
``````