fastDummiesPackage in R
dummy_cols()Function to Create Dummy Columns in R
- Interpret Dummy Variables
This article will teach how to create dummy variables using the
dummy_cols() function of the
fastDummies package in R. The words dummy variable and dummy column will be used interchangeably.
fastDummies Package in R
We need to install the
fastDummies package and load it.
# Install the fastDummies package. install.packages("fastDummies") # Load the fastDummies package. library(fastDummies)
We will now create a small data frame with a categorical variable.
# Vectors. cv = c("Bd", "Ba", "F", NA, "F", "F", "Ba") nv = seq(1:7) # Data Frame orig_datf = data.frame(Num_V = nv,Cat_V=as.factor(cv)) # View the data frame. orig_datf str(orig_datf)
> str(orig_datf) 'data.frame': 7 obs. of 2 variables: $ Num_V: int 1 2 3 4 5 6 7 $ Cat_V: Factor w/ 3 levels "Ba","Bd","F": 2 1 3 NA 3 3 1
As displayed, our data frame has a categorical variable with 3-factor levels.
R assigns factor levels based on alphabetical order. This detail matters when we create dummy variables.
dummy_cols() Function to Create Dummy Columns in R
If we do not specify the columns from which to create dummy variables, the function creates dummy columns from all factor or character type columns.
new_datf_default_all = dummy_cols(orig_datf) new_datf_default_all names(new_datf_default_all)
> names(new_datf_default_all)  "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd" "Cat_V_F" "Cat_V_NA"
Observe the following in the list of columns.
- Because the categorical variable had 3 categories, we see 3 new columns.
- Because our categorical column had missing values (NA), we also have one column indicating NAs with the value 1. All the other dummy columns have NA, whereas the original column had an NA.
Create Dummy Variables From Selected Columns in R
To create dummy variables from only selected columns, we can use the
select_columns argument. We can pass a single column name as a string and multiple columns in a vector.
# Pass a single column. new_datf_select_cols = dummy_cols(orig_datf, select_columns = "Cat_V") # Pass multiple columns using a vector. new_datf_select_cols = dummy_cols(orig_datf, select_columns = c("Cat_V"))
Remove One Column to Avoid Multicollinearity in R
When we create dummy variables using all levels of a factor column, the new columns are linearly dependent. In other words, for each row, given the values of all other columns, we can predict the value of the last column.
This affects the results of statistical analysis (such as linear regression). Therefore, we need to remove one of the dummy columns for each original column from which we are creating dummy variables.
dummy_cols() function gives us two options. We can set either
remove_first_dummy = TRUE, or
remove_most_frequent_dummy = TRUE.
The following code examines both options.
# Remove first. new_datf_remove_first = dummy_cols(orig_datf, remove_first_dummy = TRUE) # After removing first. names(new_datf_remove_first) # Remove most frequent. new_datf_remove_most_frequent = dummy_cols(orig_datf, remove_most_frequent_dummy = TRUE) # After removing most frequent names(new_datf_remove_most_frequent)
> # After removing first. > names(new_datf_remove_first)  "Num_V" "Cat_V" "Cat_V_Bd" "Cat_V_F" "Cat_V_NA" > # After removing most frequent > names(new_datf_remove_most_frequent)  "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd"
Notice the following in the output of the two commands.
- The argument
remove_first_dummy = TRUEremoved the column corresponding to the first level of the factor.
- The argument
remove_most_frequent_dummy = TRUEdropped the column corresponding to the level that appeared most frequently in the original column.
However, it also had the effect of dropping the column that showed where the NAs were. Even setting
ignore_na = FALSE did not affect the output.
We can use the following workaround if we want to keep the NA column and drop the most frequent factor.
relevelthe factor column using the
relevel()function. Make the most frequent value the first level.
- Then use
remove_first_dummy = TRUE.
releveled_datf = orig_datf # Relevel the desired column manually. releveled_datf$Cat_V = relevel(releveled_datf$Cat_V, ref = "F") # View the new levels. levels(releveled_datf$Cat_V) # NOW, remove first. releveled_datf_remove_first = dummy_cols(releveled_datf, remove_first_dummy = TRUE) # After removing first. names(releveled_datf_remove_first)
> levels(releveled_datf$Cat_V)  "F" "Ba" "Bd" > # After removing first. > names(releveled_datf_remove_first)  "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd" "Cat_V_NA"
Interpret Dummy Variables
In the linear regression setting, the intercept coefficient is said to include the effect of the base level (or the level that was removed) of the original column. Remember that we removed one column when we created the dummy columns.
The removed factor is interpreted as having the value 0 for all the dummy columns created from the same original column. Therefore, its effect is included in the intercept.
The coefficient for each dummy column corresponds to the difference caused by that factor level compared to the base level. This can be a positive or negative effect compared to the baseline, depending on the value of this coefficient.
Because of this interpretation, it is useful to drop the column corresponding to the most frequent factor.