How to Use the Predict Function on a Linear Regression Model in R

Jesse John Feb 02, 2024
How to Use the Predict Function on a Linear Regression Model in R

After we have built a linear regression model using the lm() function, one of the things we can do with it is to predict values of the response (also called output or dependent) variable for new values of the feature (also called the input or independent) variables.

In this article, we will look at the basic arguments of R’s predict() function in the context of linear regression. In particular, we will see that the function expects the input to be in a specific format with specific column names.

Use the predict() Function on a Linear Regression Model in R

To demonstrate the predict() function, we will first build a linear regression model with some sample data.

Observe the column names in the data frame, and note how they are used in the linear regression formula.

Example Code:

Feature = c(15:24)
set.seed(654)
Response = 2* c(15:24) + 5 + rnorm(10, 0,3)
DFR = data.frame(Response, Feature)
DFR

# The arguments' formula and data are named for clarity.
LR_mod = lm(formula = Response~Feature, data = DFR)
LR_mod

Output:

> Feature = c(15:24)
> set.seed(654)
> Response = 2* c(15:24) + 5 + rnorm(10, 0,3)
> DFR = data.frame(Response, Feature)
> DFR
   Response Feature
1  32.71905      15
2  35.83089      16
3  44.06888      17
4  40.71729      18
5  43.28590      19
6  47.45182      20
7  50.19730      21
8  51.81954      22
9  53.22364      23
10 51.69406      24
> # The arguments' formula and data are named for clarity.
> LR_mod = lm(formula = Response~Feature, data = DFR)
> LR_mod

Call:
lm(formula = Response ~ Feature, data = DFR)

Coefficients:
(Intercept)      Feature
      2.096        2.205

Use predict() to Predict the Response

Now that we have a linear regression model, we can use R’s predict() function to predict values of the response corresponding to new values of the feature variables.

The predict() function needs at least two arguments for a linear regression model.

  1. A model object.
  2. New data.

In this context, there are two important considerations that we need to take into account.

  1. We need to provide the new data as a data frame. In our example, the feature is a single variable.

    If we give a vector to the predict() function, we will get an error.

  2. If the column name of the feature variable in the new data frame differs from the name of the corresponding column in the original data frame, we get unexpected output.

Example Code:

# First, let us create new values for the feature variable.
NewFeature = c(20.5, 16.5, 22.5)

# If we provide a vector, we get an error.
predict(object = LR_mod, newdata = NewFeature)

# Make a data frame.
DFR2 = data.frame(NewFeature)

# Another error.
# R saw the correct number of rows in the new data but did not use them.
predict(LR_mod, newdata = DFR2)

Output:

> # First, let us create new values for the feature variable.
> NewFeature = c(20.5, 16.5, 22.5)

> # If we provide a vector, we get an error.
> predict(object = LR_mod, newdata = NewFeature)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
  'data' must be a data.frame, environment, or list

> # Make a data frame.
> DFR2 = data.frame(NewFeature)

> # Another error.
> # R saw the correct number of rows in the new data but did not use them.
> predict(LR_mod, newdata = DFR2)
       1        2        3        4        5        6        7        8        9       10
35.17674 37.38209 39.58745 41.79280 43.99816 46.20351 48.40887 50.61422 52.81958 55.02494
Warning message:
'newdata' had 3 rows but variables found have 10 rows

We must ensure two things to get the correct output from the predict() function.

  1. We must pass a data frame to the newdata argument of the predict() function. This was done above after the first error.
  2. The column name of the feature variables should be the same as those used in the original data frame to build the model. We will make this change in the following code segment.

It is also good practice to name the arguments.

With these aspects addressed, we get the expected output from the predict() function.

Example Code:

# The feature column of the new data frame is given the same name as in the original data frame.
DFR3 = data.frame(Feature = NewFeature)

# Finally, predict() works as expected.
predict(LR_mod, newdata = DFR3)

Output:

> # The feature column of the new data frame is given the same name as in the original data frame.
> DFR3 = data.frame(Feature = NewFeature)

> # Finally, predict() works as expected.
> predict(LR_mod, newdata = DFR3)
       1        2        3
47.30619 38.48477 51.71690
Author: Jesse John
Jesse John avatar Jesse John avatar

Jesse is passionate about data analysis and visualization. He uses the R statistical programming language for all aspects of his work.

Related Article - R Error