How to Drop Multiple Columns From a Data Frame Using Dplyr

Jesse John Feb 02, 2024
  1. How to Set Up the R Session
  2. Use dplyr to Drop Multiple Columns by Name Directly in R
  3. Use dplyr to Drop Multiple Columns Using a Character Vector in R
  4. Use dplyr to Drop Consecutive Columns in R
  5. Use dplyr to Drop Columns Using Pattern Matching Functions in R
  6. Use dplyr to Drop Column Names in a Numeric Range in R
  7. Use dplyr to Drop Multiple Columns Using a Function in R
  8. Conclusion
How to Drop Multiple Columns From a Data Frame Using Dplyr

When working with tabular data, we often need to select columns for display. We can either select the columns we want to display or remove the columns that we do not want to display.

This article will learn various ways to use the select() function of the dplyr package to drop multiple columns from a data frame.

How to Set Up the R Session

The dplyr is an R package for performing common data manipulation tasks. The select() function of dplyr is designed to select columns from a data frame.

The ! operator is used to take the complement of a set of variables. It will help us drop columns using the select() function.

We will load the dplyr package in the following code, create a data frame, and then select two particular columns from this data frame. The dplyr package can be loaded directly or by loading the tidyverse package.

We will create a data frame with eight columns and three rows.

We will use the pipe operator %>%> % to make our code readable. This operator helps us avoid nesting functions and creating/saving intermediate results as objects.

The select() function takes the data frame’s name followed by the columns’ names (or positions) to select. In the example code in this article, we will supply the data frame’s name using the pipe operator.

Example Code:

# Load the dplyr package directly.
# Alternately, load the entire tidyverse by running the following one line of code.
# library(tidyverse) # Un-comment to run.
library(dplyr)

# We will create a small data frame for this article.
Col1 = c(10, 11, 12)
Col2 = c(20, 21, 22)
Col7 = c(70, 71, 72)
Col8 = c(80, 81, 82)
dplyrA = c('dA1', 'dA2', 'dA3')
dplyrAA = c('AA1', 'AA2', 'AA3')
Bdplyr = c('dB1', 'dB2', 'dB3')
BBdplyr = c('BB1', 'BB2', 'BB3')

dplyr_df = data.frame(Col1, Col2, Col7, Col8, dplyrA, dplyrAA, Bdplyr, BBdplyr)


# Check the type of object that we created.
class(dplyr_df)

# Display the data frame.
dplyr_df

# Select two columns using their names.
dplyr_df %>% select(Col2, BBdplyr)

Output of the last command:

> dplyr_df %>% select(Col2, BBdplyr)
  Col2 BBdplyr
1   20     BB1
2   21     BB2
3   22     BB3

When column names are listed directly in the select() function, they are specified like variables. Unlike strings, they are not given in quotes.

Use dplyr to Drop Multiple Columns by Name Directly in R

There are three equivalent ways to drop multiple columns by name directly.

In the first method, we will combine column names into a vector of variables using the c() function. To drop all the columns in this vector, we will use the ! operator. It gives the complement of those variables.

In the second method, we take the intersection of the complement of each column that we want to drop. The & operator gives us an intersection.

In the third method, we complement a union of column names. The | operator gives us a union.

Example Code:

# Select the complement of a vector of column names.
dplyr_df %>% select(!c(Col1, dplyrA, BBdplyr))

# Select the intersection of the complement of each column.
dplyr_df %>% select(!Col1 & !dplyrA & !BBdplyr)

# Select the complement of the union of column names.
dplyr_df %>% select(!(Col1 | dplyrA | BBdplyr))

Output (identical for all three methods):

  Col2 Col7 Col8 dplyrAA Bdplyr
1   20   70   80     AA1    dB1
2   21   71   81     AA2    dB2
3   22   72   82     AA3    dB3

The select() function also takes column positions. It is equivalent to using column names directly.

Example Code:

# Select the complement of a vector of column positions.
dplyr_df %>% select(!c(1, 5, 8))

# Select the intersection of the complement of each column.
dplyr_df %>% select(!1 & !5 & !8)

# Select the complement of the union of column positions.
dplyr_df %>% select(!(1 | 5 | 8))

Use dplyr to Drop Multiple Columns Using a Character Vector in R

Rather than directly specify column names in the select() function, we can save the column names in an object and use that object in the function.

However, there are two key differences when this approach is used.

  1. The column names need to be stored as a character vector, not a vector of variable names. In other words, the names have to be strings surrounded by quotes.
  2. We will need to use a selection helper function, either all_of() or any_of(). We will use all_of() in the example code.

Example Code:

# Create a character vector using the names of the columns to remove.
# Note the quotes around the column names.
to_remove = c('Col2', 'Col7', 'dplyrAA', 'Bdplyr')

# Select the complement of the column names in the vector 'to_remove'.
dplyr_df %>% select(!all_of(to_remove))

Output:

> dplyr_df %>% select(!all_of(to_remove))
  Col1 Col8 dplyrA BBdplyr
1   10   80    dA1     BB1
2   11   81    dA2     BB2
3   12   82    dA3     BB3

Use dplyr to Drop Consecutive Columns in R

To drop consecutive columns, we will use the : operator. We can use column names or column positions. Both give the same output.

We will remove columns 2 to 7 from our data frame; columns from Col2 to Bdplyr. We will be left with the first and last columns, Col1 and BBdplyr.

Example Code:

# Drop a range of columns specified by column numbers.
dplyr_df %>% select(!(2:7))

# Drop a range of columns specified by column names.
# Note that the variable names are not in quotes.
dplyr_df %>% select(!(Col2:Bdplyr))

Output is identical for both commands:

  Col1 BBdplyr
1   10     BB1
2   11     BB2
3   12     BB3

Use dplyr to Drop Columns Using Pattern Matching Functions in R

We can use pattern matching functions to drop multiple columns. These functions take a string or a vector of strings as an argument.

They return all columns that match the pattern. To drop those columns, we use the ! operator.

It is important to note that, by default, these functions are not case-sensitive. So the string cat is matched by cat, Cat, CAT, etc.

  1. The starts_with() function matches column names from the start of the names.
  2. The ends_with() function matches column names from the end of the names.
  3. The contains() function matches any part of the column names.

We will use strings expected to return at least two names in the example code. We can check the output to verify that the function worked as expected.

Example Code:

# Look at the column names in our data frame.
names(dplyr_df)

# Four columns start with 'Col'. We will drop them.
dplyr_df %>% select(!starts_with('Col'))

# There are two column names that end with 'A'. We will drop them.
dplyr_df %>% select(!ends_with('A'))

# There are four column names that contain the string 'dplyr'.
# We will drop these four columns.
dplyr_df %>% select(!contains('dplyr'))

# We can give a vector of strings as an argument to these functions.
# We will drop columns that start with 'Co' or 'B'.
# 6 columns should get dropped.
dplyr_df %>% select(!starts_with(c('Co', 'B')))

The output of the first and last commands:

> # Look at the column names in our data frame.
> names(dplyr_df)
[1] "Col1"    "Col2"    "Col7"    "Col8"    "dplyrA"  "dplyrAA" "Bdplyr"  "BBdplyr"

> dplyr_df %>% select(!starts_with(c('Co', 'B')))
  dplyrA dplyrAA
1    dA1     AA1
2    dA2     AA2
3    dA3     AA3

Besides these three functions, dplyr provides another pattern matching helper function for a regular expression.

The matches() function takes a regular expression as an argument. It’s not case-sensitive by default.

For example, we will drop columns with an l followed immediately by 7 or y anywhere in their name. Users need to be familiar with regular expressions to take advantage of this function.

Example Code:

dplyr_df %>% select(!matches('l+[7y]'))

Output:

> dplyr_df %>% select(!matches('l+[7y]'))
  Col1 Col2 Col8
1   10   20   80
2   11   21   81
3   12   22   82

Use dplyr to Drop Column Names in a Numeric Range in R

Sometimes, we may have a data frame with column names that begin with a fixed string and end with numbers. dplyr provides the num_range() selection helper function to help us select and drop columns that share a common prefix and end in a specified numeric range.

To illustrate, we will first create a data frame with six columns. The first argument to num_range() is the prefix, and the second is the numeric range specified with the : operator.

The ! operator (complement) helps us drop the selected columns.

Example Code:

# Create vectors of the same length.
MyVar10 = seq(1, 5)
MyVar11 = seq(6, 10)
MyVar12 = seq(11, 15)
MyVar13 = seq(16, 20)
MyVar14 = seq(21, 25)
MyVar15 = seq(26, 30)

# Combine the vectors into a data frame.
num_df = data.frame(MyVar10, MyVar11, MyVar12, MyVar13, MyVar14, MyVar15)
num_df

# Drop columns that end in the range 12 to 14.
num_df %>% select(!num_range('MyVar', 12:14))

The output of the last two commands:

> num_df
  MyVar10 MyVar11 MyVar12 MyVar13 MyVar14 MyVar15
1       1       6      11      16      21      26
2       2       7      12      17      22      27
3       3       8      13      18      23      28
4       4       9      14      19      24      29
5       5      10      15      20      25      30
> # Drop columns that end in the range 12 to 14.
> num_df %>% select(!num_range('MyVar', 12:14))
  MyVar10 MyVar11 MyVar15
1       1       6      26
2       2       7      27
3       3       8      28
4       4       9      29
5       5      10      30

Use dplyr to Drop Multiple Columns Using a Function in R

The where() helper function applies a function that returns TRUE or FALSE to the column data. The columns for which the function returns TRUE are selected.

As usual, to drop columns, we use the ! operator.

In the example, we use a simple custom function to select all columns with more than 10. The code drops these and returns the remaining columns.

This example code works because all columns in the data frame are numeric. With real data, the function will have to be more comprehensive.

Example Code:

# Since all columns are numeric, there is no error.
# Otherwise, calculate the mean only for numeric columns.
num_df %>% select(!where(function(y) {mean(y)>10}))

Output:

> num_df %>% select(!where(function(y) {mean(y)>10}))
  MyVar10 MyVar11
1       1       6
2       2       7
3       3       8
4       4       9
5       5      10

References and Help

The dplyr package is part of the Tidyverse collection of packages.

The select() function is documented at the web page Subset columns using their names and types. The selection helper functions are all linked to this web page.

The tidyselect package forms the backend of the dplyr selection functions. Its Selection Language web page gives more details and examples.

The pipe operator, %>%, is provided by the magrittr package of the tidyverse.

If the select() function is not working as expected, we must verify that no other loaded package has a select() function. A quick way to check if this is the case is to use the package name as a prefix when using the function: dplyr::select().

If it works with the package prefix, we have two options: always use the prefix or load dplyr (or tidyverse) last. Functions in packages loaded later mask the same name’s functions in earlier packages.

For help with R functions in R Studio, click Help > Search R Help and type the function name in the search box without parentheses.

Alternately, type a question mark followed by the function name at the command prompt in the R Console. For example, ?select.

Conclusion

The dplyr package provides many selection helper functions and operators which allow us to drop multiple columns from a data frame using a single line of code.

We use the complement operator ! to drop the selected columns in all cases.

Author: Jesse John
Jesse John avatar Jesse John avatar

Jesse is passionate about data analysis and visualization. He uses the R statistical programming language for all aspects of his work.