Polynomial Regression in R

Polynomial regression can be defined as linear regression in which the relationship between the independent x and dependent y will be modeled as the nth degree polynomial. This tutorial demonstrates how to perform polynomial regression in R.

Polynomial Regression in R

The polynomial regression will fit a nonlinear relationship between x and the mean of y. It will add the polynomial or quadratic terms to the regression.

This regression is used for one resultant variable and a predictor. The polynomial regression is mainly used in:

  1. Progression of epidemic diseases
  2. Calculation of the growth rate of tissues
  3. Distribution of isotopes of carbon in sediments

We can use ggplot2 to plot the polynomial regression in R. If this package is not yet installed; first, we need to install it:

install.packages('ggplot2')

Here is the step-by-step process of polynomial regression.

Data Creation

We create a data frame with the data of delftstack students: the number of hours studied, final exam marks, and the total number of students in the class, which is 60.

Example:

#create data frame
delftstack <- data.frame(hours = runif(60, 6, 20), marks=60)
delftstack$marks = delftstack$marks + delftstack$hours^3/160 + delftstack$hours*runif(60, 1, 2)

#view the head of the data
head(delftstack)

This code will create the data used later for the polynomial regression.

Output:

      hours     marks
1  7.106636  71.33509
2  8.501039  74.93339
3 18.051042 124.92229
4 19.153316 141.40656
5 18.306620 118.47464
6  6.240467  70.53522

Data Visualization

The next step is to visualize the data. Before creating the regression model, we need to show the relationship between the number of hours studied and final exam marks.

Example:

# Visualization
library(ggplot2)

ggplot(delftstack, aes(x=hours, y=marks)) + geom_point()

The code above will plot the graph of our data:

Visualization Plot

Fit the Polynomial Regression Models

The next step is to fit the polynomial regression models with degrees 1 to 6 and k-fold cross-validation with k=10.

Example:

#shuffle data
delftstack.shuffled <- delftstack[sample(nrow(df)),]

# number of k-fold cross-validation
K <- 10

#define the degree of polynomials to fit
degree <- 6

# now create k equal-sized folds
fold <- cut(seq(1,nrow(delftstack.shuffled)),breaks=K,labels=FALSE)

#The object to hold MSE's of models
mse_object = matrix(data=NA,nrow=K,ncol=degree)

#K-fold cross validation
for(i in 1:K){
    #testing and training data
    test_indexes <- which(fold==i,arr.ind=TRUE)
    test_data <- delftstack.shuffled[test_indexes, ]
    train_data <- delftstack.shuffled[-test_indexes, ]

    # using k-fold cv for models evaluation
    for (j in 1:degree){
        fit.train = lm(marks ~ poly(hours,j), data=train_data)
        fit.test = predict(fit.train, newdata=test_data)
        mse_object[i,j] = mean((fit.test-test_data$marks)^2)
    }
}

# MSE for each degree
colMeans(mse_object)

Output:

[1] 26.13112 15.45428 15.87187 16.88782 18.13103 19.10502

There are six models, and the MSEs for each model are given in the output of the code above. This output is for degrees h=1 to h=6, respectively.

The model with the lowest MSE will be our polynomial regression model, which is h=2 because the MSE value of h=2 is less than all others.

Analyze the Final Model

Finally, let’s analyze the final model and show the summary of the best-fitted model:

Example:

#fitting the best model
best_model = lm(marks ~ poly(hours,2, raw=T), data=delftstack)

#summary of the best model
summary(best_model)

The code above will show the summary of the best-fitted model.

Output:

Call:
lm(formula = marks ~ poly(hours, 2, raw = T), data = delftstack)

Residuals:
   Min     1Q Median     3Q    Max
-8.797 -2.598  0.337  2.443  9.872

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)              68.42847    5.54533  12.340  < 2e-16 ***
poly(hours, 2, raw = T)1 -1.07557    0.93476  -1.151    0.255
poly(hours, 2, raw = T)2  0.22958    0.03577   6.418 2.95e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.204 on 57 degrees of freedom
Multiple R-squared:  0.9669,    Adjusted R-squared:  0.9657
F-statistic: 831.9 on 2 and 57 DF,  p-value: < 2.2e-16

From the output, we can see that marks = 68.42847 - 1.07557*(hours) + .22958*(hours)2. Using this equation, we can predict how many marks a student will get based on the number of hours studied.

For example, if a student studied for 5 hours, the calculation will be:

marks = 68.42847 - 1.07557*(5) + .22958*(5)2

marks = 68.42847 - 1.07557*5 + .22958*25

marks = 68.42847 - 5.37785 + 5.7395

marks = 68.79012

If a student studies for 5 hours, he will get 68.79012 marks in the final exam.

Lastly, we can plot the fitted model to check how well it corresponds to the raw data:

ggplot(delftstack, aes(x=hours, y=marks)) +
    geom_point() +
    stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) +
    xlab('Hours Studied') +
    ylab('Marks')

Output (Plot):

Final Plot

Related Article - R Regression

  • Piecewise Regression in R
  • Lasso Regression in R
  • Logistic Regression in R