How to Use the group_by Function in R Dplyr

Jesse John Feb 02, 2024
  1. Set Up dplyr Package in R
  2. Use the group_by() Function in R
  3. Use group_by() With summarize() in R
  4. Use group_by() With filter() in R
  5. Use group_by() With mutate() in R
  6. Ungroup a Tibble in R
  7. References
How to Use the group_by Function in R Dplyr

The group_by() function of the dplyr package helps us group rows based on values in different columns. We can then use these groups to create summaries, select specific groups for further analysis, or create new columns based on group properties.

Set Up dplyr Package in R

We need to install and load the dplyr package and create a small tibble to illustrate the working of the group_by() function.

Example Code:

# Install dplyr. Or install the tidyverse.
# UNCOMMENT THE FOLLOWING LINE TO INSTALL.
# install.packages("dplyr")

# Load dplyr
library(dplyr)

# Create vectors.
set.seed(11)
Col_code = sample(2200:7200, 10, replace=FALSE)
set.seed(222)
Col_one = sample(c("RD", "GN", "YW"), 10, replace = TRUE)
set.seed(4444)
Col_two = sample(c(3, 6), 10, replace = TRUE)

# Create a tibble.
my_t = tibble(Col_code, Col_one, Col_two)

# View the tibble.
my_t

Use the group_by() Function in R

Nothing seems to happen when we use group_by() on a tibble. The group_by() function only marks the columns for grouping.

Example Code:

# Use group_by().
group_by(my_t, Col_two)

Output:

# A tibble: 10 x 3
# Groups:   Col_two [2]
   Col_code Col_one Col_two
      <int> <chr>     <dbl>
 1     3985 RD            6
 2     2233 GN            6
 3     2895 YW            6
 4     3120 GN            6
 5     6439 YW            3
 6     4819 GN            6
 7     2573 GN            6
 8     5484 RD            6
 9     6509 GN            3
10     4309 RD            3

The code returns a tibble with the same number of rows as the original. But notice the remark in the second line of the output.

The specified column has been marked for grouping.

Use group_by() With summarize() in R

In many cases, group_by() is used in combination with summarize(). The function can also be spelt as summarise().

Since we have grouped the data, we can use each group’s summarize() function. In the example, we will use the n() function inside summarize() to count the number of rows in each group.

We will also use the pipe operator, %>%, to improve the readability of the code.

Example Code:

# Group by one column.
my_t %>% group_by(Col_two) %>% summarize(n())

Output:

# A tibble: 2 x 2
  Col_two `n()`
    <dbl> <int>
1       3     3
2       6     7

We have one row for each value of Col_two in the output.

We can group by more than one column at a time, as shown below.

Example Code:

# Group by more than one column.
my_t %>% group_by(Col_one, Col_two) %>% summarize(Num_Rows = n())

Output:

# A tibble: 6 x 3
# Groups:   Col_one [3]
  Col_one Col_two Num_Rows
  <chr>     <dbl>    <int>
1 GN            3        1
2 GN            6        4
3 RD            3        1
4 RD            6        2
5 YW            3        1
6 YW            6        1

In the output, we have one row for each combination of Col_one and Col_two, which exists in the original tibble. The last column, created with n(), shows how many rows each combination has.

The summarize() function can calculate several group statistics, for example, the mean.

Example Code:

# Calculate the mean.
# The output has 3 significant digits by default.
my_t %>% group_by(Col_one, Col_two) %>% summarize(mean(Col_code))

# Convert the output to a data frame to see the decimal places.
my_t %>% group_by(Col_one, Col_two) %>% summarize(mean(Col_code)) %>% as.data.frame()

Output:

> my_t %>% group_by(Col_one, Col_two) %>% summarize(mean(Col_code))
`summarise()` has grouped output by 'Col_one'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups:   Col_one [3]
  Col_one Col_two `mean(Col_code)`
  <chr>     <dbl>            <dbl>
1 GN            3            6509
2 GN            6            3186.
3 RD            3            4309
4 RD            6            4734.
5 YW            3            6439
6 YW            6            2895

> # Convert the output to a data frame to see the decimal places.
> my_t %>% group_by(Col_one, Col_two) %>% summarize(mean(Col_code)) %>% as.data.frame()
`summarise()` has grouped output by 'Col_one'. You can override using the `.groups` argument.
  Col_one Col_two mean(Col_code)
1      GN       3        6509.00
2      GN       6        3186.25
3      RD       3        4309.00
4      RD       6        4734.50
5      YW       3        6439.00
6      YW       6        2895.00

The first line returned a tibble. By default, tibbles print numbers with three significant digits.

We can convert the output to a data frame to get the output in the usual format.

We can also use the num() function of the tibble package (or pillar package) to show decimal digits. We will use a negative integer to specify the maximum number of decimal digits to display.

Example Code:

library(tibble)
my_t %>% group_by(Col_one, Col_two) %>% summarize(tMean = num(mean(Col_code),digits=-2))

Output:

# A tibble: 6 x 3
# Groups:   Col_one [3]
  Col_one Col_two    tMean
  <chr>     <dbl> <num:.2>
1 GN            3  6509
2 GN            6  3186.25
3 RD            3  4309
4 RD            6  4734.5
5 YW            3  6439
6 YW            6  2895

A specific point to note is that summarize() drops the last grouping level. To see this effect, we will save the intermediate results to new objects and use the group_vars() function to check the grouping.

Example Code:

# Create a tibble with two levels of groupings.
tib_2_gr = my_t %>% group_by(Col_one, Col_two)

# Check that the tibble is grouped by two variables.
group_vars(tib_2_gr)

# Use the summarize() function once.
tib_1_gr = my_t %>% group_by(Col_one, Col_two) %>% summarize(Num_Rows = n())

# Check that the new tibble is grouped by only one variable after using summarize().
group_vars(tib_1_gr)

Output:

> group_vars(tib_2_gr)
[1] "Col_one" "Col_two"

> group_vars(tib_1_gr)
[1] "Col_one"

Use group_by() With filter() in R

Unlike SQL, which has separate where and having clauses, dplyr’s filter() function works on ungrouped and grouped data.

We will first use filter() on a value from the original data in a grouped tibble.

Example Code:

# Create a tibble with groups.
t_fil = my_t %>% group_by(Col_one, Col_two)

# Remove rows where Col_one is 'RD'.
t_fil %>% filter(Col_one != "RD")

Output:

# A tibble: 7 x 3
# Groups:   Col_one, Col_two [4]
  Col_code Col_one Col_two
     <int> <chr>     <dbl>
1     2233 GN            6
2     2895 YW            6
3     3120 GN            6
4     6439 YW            3
5     4819 GN            6
6     2573 GN            6
7     6509 GN            3

The data is filtered, and the grouping is still present. We can now use summarize() to get group summaries.

Next, let us use a filter on a value that we calculate for a group.

Example Code:

# First summarize.
t_fil %>% summarize(AVE = num(mean(Col_code), digits=-2))

# Now filter the summarized data.
# We will provide the new summary column to the filter function.
t_fil %>% summarize(AVE = num(mean(Col_code), digits=-2)) %>% filter(AVE > 4000)

Output:

# A tibble: 4 x 3
# Groups:   Col_one [3]
  Col_one Col_two      AVE
  <chr>     <dbl> <num:.2>
1 GN            3   6509
2 RD            3   4309
3 RD            6   4734.5
4 YW            3   6439

Use group_by() With mutate() in R

In the following example, we will see that mutate() acts on the defined groups. The new column gives the minimum value of Col_code for the specified group, Col_one.

Example Code:

# Group data.
t_mut = my_t %>% group_by(Col_one)

# Mutate based on grouping.
t_mut %>% mutate(MIN_GR_CODE = min(Col_code)) %>% arrange(.by_group = TRUE)

# If we use summarize(), we do not get the columns that were not grouped.
t_mut %>% summarize(MIN_GR_CODE = min(Col_code))

Output:

> # Mutate based on grouping.
> t_mut %>% mutate(MIN_GR_CODE = min(Col_code)) %>% arrange(.by_group = TRUE)
# A tibble: 10 x 4
# Groups:   Col_one [3]
   Col_code Col_one Col_two MIN_GR_CODE
      <int> <chr>     <dbl>       <int>
 1     2233 GN            6        2233
 2     3120 GN            6        2233
 3     4819 GN            6        2233
 4     2573 GN            6        2233
 5     6509 GN            3        2233
 6     3985 RD            6        3985
 7     5484 RD            6        3985
 8     4309 RD            3        3985
 9     2895 YW            6        2895
10     6439 YW            3        2895

> # If we use summarize(), we do not get the columns that were not grouped.
> t_mut %>% summarize(MIN_GR_CODE = min(Col_code))
# A tibble: 3 x 2
  Col_one MIN_GR_CODE
  <chr>         <int>
1 GN             2233
2 RD             3985
3 YW             2895

Ungroup a Tibble in R

Once we finish our analysis on a tibble grouped using group_by(), we should use the ungroup() function to remove the grouping. This will ensure that subsequent analysis will be on ungrouped data.

We need to save the ungrouped version of the tibble as an object to make the change persistent.

Example Code:

# View a grouped tibble.
tib_2_gr
# The grouping is mentioned as the second line in the output.

# We can also check the grouping using the group_vars() function.
group_vars(tib_2_gr)

# ungroup() the tibble.
ungroup(tib_2_gr)

# Check the groups.
group_vars(tib_2_gr)
# The groups are still there because we did not save the change.

# Save to the same object name.
tib_2_gr = ungroup(tib_2_gr)

# Now check the groupings.
group_vars(tib_2_gr)
# There is no grouping.

Output of last two commands:

> # Save to the same object name.
> tib_2_gr = ungroup(tib_2_gr)
>
> # Now check the groupings.
> group_vars(tib_2_gr)
character(0)

References

For more details, refer to dplyr’s documentation.

For num(), see tibble’s documentation.

Author: Jesse John
Jesse John avatar Jesse John avatar

Jesse is passionate about data analysis and visualization. He uses the R statistical programming language for all aspects of his work.

Related Article - R Function