How to Remove Duplicate Rows by Column in R

Jinku Hu Feb 02, 2024
  1. Use the distinct Function of the dplyr Package to Remove Duplicate Rows by Column in R
  2. Use group_by, filter and duplicated Functions to Remove Duplicate Rows by Column in R
  3. Use group_by and slice Functions to Remove Duplicate Rows by Column in R
How to Remove Duplicate Rows by Column in R

This article will introduce how to remove duplicate rows by column in R.

Use the distinct Function of the dplyr Package to Remove Duplicate Rows by Column in R

The dplyr package provides the distinct function, one of the most common data manipulation libraries used in R language. distinct selects unique rows in the given data frame. It takes the data frame as the first argument and then the variables that need to be considered during the selection. Multiple column variables can be supplied for filtering the unique rows, but in the following code snippet, we demonstrate the single variable examples. The third argument is optional and has the default value - FALSE, but if the user explicitly passes TRUE, the function will keep all variables in the data frame after filtering. Note that dplyr uses an operator function called pipes of form - %>%, which is interpreted as supplying the left variable as the first argument of the right function. Namely, x %?% f(y) notation becomes f(x, y).

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% distinct(id, .keep_all = TRUE)
t2 <- df1 %>% distinct(gender, .keep_all = TRUE)
t3 <- df1 %>% distinct(variant, .keep_all = TRUE)

df2 <- mtcars

tmp1 <- df2 %>% distinct(cyl, .keep_all = TRUE)
tmp2 <- df2 %>% distinct(mpg, .keep_all = TRUE)

Use group_by, filter and duplicated Functions to Remove Duplicate Rows by Column in R

Another solution to remove duplicate rows by column values is to group the data frame with the column variable and then filter elements using filter and duplicated functions. The first step is done with the group_by function that is part of the dplyr package. Next, the output of the previous operation is redirected to the filter function to eliminate duplicate rows.

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% group_by(id) %>% filter (! duplicated(id))
t2 <- df1 %>% group_by(gender) %>% filter (! duplicated(gender))
t3 <- df1 %>% group_by(variant) %>% filter (! duplicated(variant))

df2 <- mtcars

tmp3 <- df2 %>% group_by(cyl) %>% filter (! duplicated(cyl))
tmp4 <- df2 %>% group_by(mpg) %>% filter (! duplicated(mpg))

Use group_by and slice Functions to Remove Duplicate Rows by Column in R

Alternatively, one can utilize the group_by function together with slice to remove duplicate rows by column values. slice is also part of the dplyr package, and it selects rows by index. Interestingly, when the data frame is grouped, then slice will select the rows on the given index in each group, as demonstrated in the following sample code.

library(dplyr)

df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
                 gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
                 variant = c("a", "b", "c", "d", "e", "f", "g", "h"))

t1 <- df1 %>% group_by(id) %>% slice(1)
t2 <- df1 %>% group_by(gender) %>% slice(1)
t3 <- df1 %>% group_by(variant) %>% slice(1)

df2 <- mtcars

tmp5 <- df2 %>% group_by(cyl) %>% slice(1)
tmp6 <- df2 %>% group_by(mpg) %>% slice(1)
Author: Jinku Hu
Jinku Hu avatar Jinku Hu avatar

Founder of DelftStack.com. Jinku has worked in the robotics and automotive industries for over 8 years. He sharpened his coding skills when he needed to do the automatic testing, data collection from remote servers and report creation from the endurance test. He is from an electrical/electronics engineering background but has expanded his interest to embedded electronics, embedded programming and front-/back-end programming.

LinkedIn Facebook

Related Article - R Data Frame