Remove Duplicate Rows by Column in R

  1. Use the distinct Function of the dplyr Package to Remove Duplicate Rows by Column in R
  2. Use group_by, filter and duplicated Functions to Remove Duplicate Rows by Column in R
  3. Use group_by and slice Functions to Remove Duplicate Rows by Column in R

This article will introduce how to remove duplicate rows by column in R.

Use the distinct Function of the dplyr Package to Remove Duplicate Rows by Column in R

The dplyr package provides the distinct function, one of the most common data manipulation libraries used in R language. distinct selects unique rows in the given data frame. It takes the data frame as the first argument and then the variables that need to be considered during the selection. Multiple column variables can be supplied for filtering the unique rows, but in the following code snippet, we demonstrate the single variable examples. The third argument is optional and has the default value - FALSE, but if the user explicitly passes TRUE, the function will keep all variables in the data frame after filtering. Note that dplyr uses an operator function called pipes of form - %>%, which is interpreted as supplying the left variable as the first argument of the right function. Namely, x %?% f(y) notation becomes f(x, y).

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% distinct(id, .keep_all = TRUE)
t2 <- df1 %>% distinct(gender, .keep_all = TRUE)
t3 <- df1 %>% distinct(variant, .keep_all = TRUE)

df2 <- mtcars

tmp1 <- df2 %>% distinct(cyl, .keep_all = TRUE)
tmp2 <- df2 %>% distinct(mpg, .keep_all = TRUE)

Use group_by, filter and duplicated Functions to Remove Duplicate Rows by Column in R

Another solution to remove duplicate rows by column values is to group the data frame with the column variable and then filter elements using filter and duplicated functions. The first step is done with the group_by function that is part of the dplyr package. Next, the output of the previous operation is redirected to the filter function to eliminate duplicate rows.

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% group_by(id) %>% filter (! duplicated(id))
t2 <- df1 %>% group_by(gender) %>% filter (! duplicated(gender))
t3 <- df1 %>% group_by(variant) %>% filter (! duplicated(variant))

df2 <- mtcars

tmp3 <- df2 %>% group_by(cyl) %>% filter (! duplicated(cyl))
tmp4 <- df2 %>% group_by(mpg) %>% filter (! duplicated(mpg))

Use group_by and slice Functions to Remove Duplicate Rows by Column in R

Alternatively, one can utilize the group_by function together with slice to remove duplicate rows by column values. slice is also part of the dplyr package, and it selects rows by index. Interestingly, when the data frame is grouped, then slice will select the rows on the given index in each group, as demonstrated in the following sample code.

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% group_by(id) %>% slice(1)
t2 <- df1 %>% group_by(gender) %>% slice(1)
t3 <- df1 %>% group_by(variant) %>% slice(1)

df2 <- mtcars

tmp5 <- df2 %>% group_by(cyl) %>% slice(1)
tmp6 <- df2 %>% group_by(mpg) %>% slice(1)
Contribute
DelftStack is a collective effort contributed by software geeks like you. If you like the article and would like to contribute to DelftStack by writing paid articles, you can check the write for us page.

Related Article - R Data Frame

  • Count Number of Observations in R
  • Create a Large Data Frame in R