How to Calculate the R-Squared Statistic in R

Jesse John Feb 02, 2024
  1. Formula for the R-Squared Statistic
  2. Obtain the R-Squared Statistic From the Linear Regression Model
  3. Calculate the R-Squared Statistic Manually
  4. Create a Custom Function to Compute R-Squared
  5. Conclusion
How to Calculate the R-Squared Statistic in R

The R-squared statistic is the number used to assess how well a linear regression model fits the data. It gives the proportion of variance of the dependent variable explained by the model’s independent variables.

The R-squared statistic pertains to linear regression models only. In a linear regression model, the dependent variable is quantitative.

The model assumes that the dependent variable is linearly dependent on the independent variables. R-squared is also relevant for simple extensions of the linear model, including polynomial and interaction terms.

The other points to note about the R-squared statistic are:

  • It is a proportion, so it does not have a unit.
  • Since it is a proportion, it always ranges from 0 to 1.
  • A value of 0 means that the model explains no variance.
  • A value of 1 means that the model explains all variance.
  • Usually, a model that gives a higher value of R-squared is considered better.
  • Adding more independent variables always increases the value of R-squared. This may overfit the model to the data. Therefore, sometimes it may be useful to choose a model that gives a slightly lower R-squared than another that overfits the data.
  • The linear model will give a valid R-squared statistic only for the data used to create the model. The model can predict a vector of dependent variable values using a different data set. But we should not compute an R-squared statistic for such predicted values.

Formula for the R-Squared Statistic

Suppose we try to predict a variable Y using some independent variables. We build our linear regression model using complete observations in which the values of all variables, including the dependent variable, Y, are known.

The arithmetic mean of Y is the estimated value of Y in the absence of a linear regression model. It is taken as the baseline.

The differences between the observed values of Y and the arithmetic mean of Y are the total deviations. The sum of the squares of these differences is called the Total Sum of Squares, TSS.

The differences between the observed values of Y and Y that the linear regression model predicts are the residual deviations. These deviations remain even after we attempt to predict Y using the independent variables in a linear regression model.

The sum of the squares of these differences is called the Residual Sum of Squares, RSS. The R-squared statistic is computed as (TSS - RSS)/TSS. Thus, the model has explained the proportion of Y’s variance.

A better model will usually have a higher R-squared statistic because RSS will be lower.

Obtain the R-Squared Statistic From the Linear Regression Model

The basic installation of R includes the linear modeling function, lm(), as part of the stats package, which is loaded by default.

It helps us easily build linear regression models. The first argument for this function is the specification of the model.

We will specify that we want to see how Y is dependent on X. The second argument is the source of the data. We specify the data frame where R will find the variables X and Y.

The summary() function reports the R-squared statistic of the model. In the following example, we will build a simple linear regression model and get R to report the value of R-squared.

Example Code:

# Create sample data.
# Independent variable, X.
X = c(1, 2, 3, 4, 5)
# Dependent Variable, Y, of the form 2x+1, with added error.
Y = c(2, 6, 7, 10, 9)

# Make a data frame, df.
df = data.frame(X,Y)

# Build the linear regression model named lin.mod.
lin.mod = lm(formula = Y~X, data = df)

# Check R^2 in the model summary. It is labeled as Multiple R-squared.
summary(lin.mod)

Output:

Call:
lm(formula = Y ~ X, data = df)

Residuals:
   1    2    3    4    5
-1.2  1.0  0.2  1.4 -1.4

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.4000     1.5319   0.914    0.428
X             1.8000     0.4619   3.897    0.030 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.461 on 3 degrees of freedom
Multiple R-squared:  0.8351,	Adjusted R-squared:  0.7801
F-statistic: 15.19 on 1 and 3 DF,  p-value: 0.02998

The call to the summary() function generated a lot of output. The value of Multiple R-squared is what we set out to obtain.

Calculate the R-Squared Statistic Manually

The R-squared statistic can be obtained using the function for the correlation coefficient, cor(). However, we need to ensure that we use it in the right context, only when appropriate.

  • In the case of simple linear regression, R-squared is equal to the square of the correlation coefficient between Y and X, where X is the independent variable.
  • In the case of multiple linear regression, R-squared equals the square of the correlation coefficient between Y and the values of Y predicted by the model.

The following example illustrates this calculation with the simple linear regression model we created.

Example Code:

# The cor() function gives the correlation coefficient.
# Its square is equal to the R-squared statistic.
cor(X,Y)^2

Output:

> cor(X,Y)^2
[1] 0.8350515

We find that the square of the correlation coefficient is the same as the R-squared value reported by the summary of the linear regression model.

Create a Custom Function to Compute R-Squared

If we often need to manually compute the R-squared statistic between two numeric vectors of equal length, we can write a custom function to simplify our task.

The custom function will be of the form function_name = function(argument list) {body of function}. It will be called as follows: function_name(parameter list). We will use the cor() function in our custom function.

Example Code:

# Define the custom function.
VectorRSq = function(x, y) {cor(x, y)^2}

# Call the custom function.
VectorRSq(X, Y)

Output:

> VectorRSq(X, Y)
[1] 0.8350515

Getting Help

To learn more about the R-squared statistic, read the chapter on Linear Regression in the excellent textbook, An Introduction to Statistical Learning, available online for free.

To learn more about making custom functions in R, read the chapter on Functions in Hadley Wickham’s online book R for Data Science.

For help with the lm(), cor(), or c() functions, click Help > Search R Help in the R Studio menu, and enter the function name without the parentheses in the search box.

Conclusion

The R-squared statistic is used to assess how well a linear regression model fits the data. It is only valid for the data used to create the model. The summary generated from the linear modeling function, lm(), gives us the value of the R-squared statistic.

If required, we can compute the R-squared statistic between two numeric vectors of equal length using the cor() function. It can be done directly or through a custom function.

Author: Jesse John
Jesse John avatar Jesse John avatar

Jesse is passionate about data analysis and visualization. He uses the R statistical programming language for all aspects of his work.