# Logistic Regression in R

1. Logistic Regression in R
2. Steps to Perform Logistic Regression in R

The logistic regression is a model in which the response variable has values like `True`, `False`, or `0`, `1`, which are categorical values. It measures the probability of a binary response.

This tutorial will demonstrate how to perform logistic regression in R.

## Logistic Regression in R

The `glm()` method is used in R to create a regression model. It takes three parameters.

First is the `formula`, which is the symbol that represents the relationship between variables; second is the `data` which is the data set containing the values of these variables; and third is the `family`, which is the R object that specifies the details of the model. For logistic regression, the value is binomial.

The mathematical expression for logistic regression is given below:

``````y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
``````

Where:

• `a` and `b` are the numeric constants which are coefficients
• `y` is the response variable
• `x` is the predictor variable

## Steps to Perform Logistic Regression in R

Now let’s perform logistic regression in R. Here is the step-by-step process.

### Load the Data

Let’s use the default data set from the ISLR package. First, we need to install the package if it is not already installed.

``````install.packages('ISLR')
``````

Once the package is successfully installed, the next step is to load the data.

``````require(ISLR)
data_set <- ISLR::Default

## The total observations in data
nrow(data_set)

#the summary of the dataset
summary(data_set)
``````

The code will load the ISLR default data set and show the number of observations and data summary.

Output:

`````` 10000

default    student       balance           income
No :9667   No :7056   Min.   :   0.0   Min.   :  772
Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340
Median : 823.6   Median :34553
Mean   : 835.4   Mean   :33517
3rd Qu.:1166.3   3rd Qu.:43808
Max.   :2654.3   Max.   :73554
``````

The data set above contains 10000 individuals.

The `default` shows whether the individual is defaulted or not, and the `student` indicates if the individual is a student. The `balance` means the average balance of an individual, and the `income` is the individual’s income.

### Train and Test Samples

The next step is to split the data set into a training and testing set to train and test the model.

``````#this will make the example reproducible
set.seed(1)

# We use 70% as training and 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
train <- data_set[sample, ]
test <- data_set[!sample, ]
``````

### Create the Logistic Regression Model

We use `glm()` to create the logistic regression model with `family = binomial`.

``````# the logistic regression model
logistic_model <- glm(default~student+balance+income, family="binomial", data=train)

# just disable scientific notation for summary
options(scipen=999)

# model summary
summary(logistic_model)
``````

The code above creates a logistic regression model and shows the model summary from the data above.

Output:

``````Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.4881  -0.1327  -0.0509  -0.0176   3.5912

Coefficients:
Estimate    Std. Error z value             Pr(>|z|)
(Intercept) -10.961385538   0.602044982 -18.207 < 0.0000000000000002 ***
studentYes   -0.835485760   0.284855225  -2.933              0.00336 **
balance       0.005893470   0.000289649  20.347 < 0.0000000000000002 ***
income       -0.000001611   0.000009942  -0.162              0.87124
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1997.4  on 7014  degrees of freedom
Residual deviance: 1050.9  on 7011  degrees of freedom
AIC: 1058.9

Number of Fisher Scoring iterations: 8
``````

The logistic regression model has been successfully created. Then the next step is to use the model to make predictions.

### Use the Model to Make Predictions

Once the logistic regression model is fitted, we can use it to predict if the individual will be in `default` based on the `income`, `student`, or `balance` status.

``````#defining two individuals
demo <- data.frame(balance = 1500, income = 3000, student = c("Yes", "No"))

#predict the probability of defaulting
predict(logistic_model, demo, type="response")
``````

The code above predicts the probability of defaulting of two defined individuals.

Output:

``````         1          2
0.04919576 0.10659389
``````

The probability of an individual defaulting with a `balance` of `\$1500`, `income` of `\$3000`, and `student` status of `yes` is `0.0491`. The same with a `student` status of `no`, the probability of `default` is `0.1065`.

Now let’s calculate the probability of defaulting of each individual in our test data set.

``````#probability of default for the test dataset
prediction <- predict(logistic_model, test, type="response")
``````

The code will calculate the probability of defaulting for each individual in our test data set. The output will be a large sum of data.

### The Logistic Regression Model Diagnostics

Now it’s time to check how well our model will perform with the test data set. We find the optimal probability using the `optimalCutoff()` method from the `informationvalue` library.

Example:

``````library(InformationValue)

#from "Yes" and "No" to 1's and 0's
test\$default <- ifelse(test\$default=="Yes", 1, 0)

#optimal cutoff probability to use for maximize accuracy
optimal <- optimalCutoff(test\$default, prediction)
optimal
``````

The optimal probability cutoff to use is given below. Any individual with a higher probability will be considered defaulted.

Output:

`````` 0.5209985
``````

Next, we can use the confusion matrix to show the comparison of our prediction with actual defaults.

Example:

``````confusionMatrix(test\$default, prediction)
``````

Output:

``````     0  1
0 2868 71
1   10 36
``````

We can also calculate the true positive rate (sensitivity), the true negative rate (specificity), and the misclassification error:

``````# sensitivity
sensitivity(test\$default, prediction)

# specificity
specificity(test\$default, prediction)

# total misclassification error rate
misClassError(test\$default, prediction, threshold=optimal)
``````

Output:

`````` 0.3364486

 0.9965254

 0.0265
``````

The total misclassification error rate is 2.65% for our model, which means our model can predict outcomes easily because the error rate is very low.

Finally, let’s plot the ROC curve for the test data set with the prediction: ### Complete Example Code

Here is the complete code used in this tutorial for your convenience.

``````install.packages('ISLR')

require(ISLR)
data_set <- ISLR::Default

## The total observations in data
nrow(data_set)

#the summary of the dataset
summary(data_set)

#this will make the example reproducible
set.seed(1)

#We use 70% of as training set and 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
train <- data_set[sample, ]
test <- data_set[!sample, ]

# the logistic regression model
logistic_model <- glm(default~student+balance+income, family="binomial", data=train)

# just disable scientific notation for summary
options(scipen=999)

#view model summary
summary(logistic_model)

#defining two individuals
demo <- data.frame(balance = 1500, income = 3000, student = c("Yes", "No"))

#predict the probability of defaulting
predict(logistic_model, demo, type="response")

#probability of default for the test dataset
prediction <- predict(logistic_model, test, type="response")

install.packages('InformationValue')

library(InformationValue)

#from "Yes" and "No" to 1's and 0's
test\$default <- ifelse(test\$default=="Yes", 1, 0)

#optimal cutoff probability to use for maximize accuracy
optimal <- optimalCutoff(test\$default, prediction)
optimal

confusionMatrix(test\$default, prediction)

# sensitivity
sensitivity(test\$default, prediction)

# specificity
specificity(test\$default, prediction)

# total misclassification error rate
misClassError(test\$default, prediction, threshold=optimal)

#the ROC curve
plotROC(test\$default, prediction)
``````

## Related Article - R Regression

• Piecewise Regression in R
• Lasso Regression in R
• Polynomial Regression in R