# How to Sort Data Frame by Column in R

Sheeraz Gul Feb 22, 2024

Sorting data frames by column is a fundamental operation in data analysis and manipulation tasks using the R programming language. Whether organizing datasets for analysis or preparing data for visualization, the ability to sort data frames efficiently is essential for extracting meaningful insights.

In this article, we’ll explore several methods to achieve this task, including built-in functions like `order()`, tools from popular packages like `dplyr`, as well as specialized functions like `setorder()` and `setkey()` from the `data.table` package.

## How to Sort a Data Frame by Column in R Using the `order()` Function

In R, one efficient way to sort a data frame by one or more columns is by using the `order()` function. This function provides a straightforward approach to rearrange the rows of a data frame based on the values of one or more columns.

The syntax of the `order()` function is as follows:

``````order(..., na.last = TRUE, decreasing = FALSE)
``````

Parameters:

• `...`: One or more numeric, character, or factor vectors to be sorted.
• `na.last`: A logical value indicating how missing values (NA) should be handled. If `TRUE`, NA values are put at the end of the sorted vector; if `FALSE`, they are put at the beginning.
• `decreasing`: A logical value indicating whether the sort should be in decreasing order (`TRUE`) or increasing order (`FALSE`). The default is `FALSE` (i.e., ascending order).

The `order()` function returns a vector of integers that represents the permutation that rearranges its first argument into ascending or descending order. If multiple vectors are provided, the sorting is first based on the first vector, then the second, and so on, with ties broken by subsequent vectors.

This function does not actually sort the original data; rather, it returns the indices that would sort the data. This permutation can then be used to reorder the rows of the data frame accordingly.

### Example: Sorting a Data Frame by Column Using the `order()` Function

Let’s consider a scenario where we have a data frame `employee_data` containing employee IDs and their corresponding salaries. We want to sort this data frame by the `salary` column in ascending order.

``````employee_data <- data.frame(
employeeId = c(10, 15, 14, 12, 13),
salary = c(3000, 2500, 1000, 3500, 2000)
)

print("Original Data Frame:")
print(employee_data)

sorted_indices <- order(employee_data\$employeeId)
sorted_df <- employee_data[sorted_indices, ]

print("Sorted Data Frame by Employee ID:")
print(sorted_df)
``````

In the provided example, we start by defining our data frame `employee_data` with the `data.frame()` function, specifying two vectors: `employeeId` and `salary`. These vectors represent the employee IDs and their corresponding salaries, respectively.

We then utilize the `order()` function to determine the order of indices that would sort the `salary` column in ascending order. This function extracts the values from the `salary` column, sorts them, and returns the corresponding indices.

Finally, we use these sorted indices to rearrange the rows of the `employee_data` data frame, resulting in `sorted_df`, which now displays the original data frame sorted by the `salary` column in ascending order.

Output:

In the output, we can observe that the original data frame is sorted based on the `employeeId` column in ascending order, resulting in the rearrangement of rows accordingly.

## How to Sort a Data Frame by Column in R Using the `arrange()` Function From the `dplyr` Package

In addition to the `order()` function, the `dplyr` package offers a convenient way to sort data frames by column using the `arrange()` function.

This function enables easy sorting of data frames by one or more columns. It allows you to specify column names along with sorting orders (ascending or descending) directly within the function call.

The syntax of the `arrange()` function is as follows:

``````arrange(.data, ..., .by_group = FALSE)
``````

Parameters:

• `.data`: The data frame to be sorted.
• `...`: The column names or expressions used for sorting. Multiple columns can be specified, separated by commas.
• `.by_group`: A logical value indicating whether to arrange within groups (if any). The default value is `FALSE`.

The `arrange()` function sorts rows of a data frame based on the specified columns. It arranges the data in ascending order by default.

To sort in descending order, you can use the `desc()` function from the `dplyr` package.

### Example: Sorting a Data Frame by Column Using the `arrange()` Function

Let’s continue with the scenario of sorting a data frame containing employee IDs and salaries, this time using the `arrange()` function.

``````library(dplyr)

employee_data <- data.frame(
employeeId = c(10, 15, 14, 12, 13),
salary = c(3000, 2500, 1000, 3500, 2000)
)

print("Original Data Frame:")
print(employee_data)

sorted_df <- arrange(employee_data, employeeId)

print("Sorted Data Frame by Employee ID:")
print(sorted_df)
``````

Here, we start by loading the `dplyr` package using the `library(dplyr)` command. This package provides a suite of functions for data manipulation tasks, including the `arrange()` function, which we’ll use for sorting the data frame.

Next, we create a sample data frame named `employee_data` using the `data.frame()` function. This data frame contains the same columns as the previous example.

After creating the data frame, we display the original contents using the `print()` function for comparison purposes. This allows us to observe the initial arrangement of rows in the data frame before sorting.

Then, we utilize the `arrange()` function from the `dplyr` package to sort the data frame based on the `employeeId` column in ascending order. This is achieved by specifying the column name (`employeeId`) as the argument to the `arrange()` function.

``````sorted_df <- arrange(employee_data, employeeId)
``````

The sorted data frame is stored in a new variable named `sorted_df`. This variable holds the rearranged version of the original data frame, with rows sorted based on the values in the `employeeId` column.

Finally, we print both the original and sorted data frames using the `print()` function to visually inspect the changes.

Output:

As observed in the output, the data frame has been sorted based on the `employeeId` column in ascending order, similar to the result obtained using the `order()` function.

## How to Sort a Data Frame by Column in R Using the `setorder()` Function From the `data.table` Package

In addition to base R functions and packages like `dplyr`, the `data.table` package offers efficient ways to manipulate and sort large datasets. One of its key functions, `setorder()`, provides a high-performance method for sorting data frames by one or more columns.

The `data.table` package is known for its fast and memory-efficient methods for working with large datasets in R. The `setorder()` function is one such feature, designed specifically for sorting data frames by columns.

Unlike other sorting functions, such as `order()` or `arrange()`, which return a sorted copy of the data, `setorder()` sorts the data table in place. This results in significant memory and time savings, particularly for large datasets.

The syntax of the `setorder()` function is as follows:

``````setorder(x, ..., na.last = NA, by = NULL)
``````

Parameters:

• `x`: The data table to be sorted.
• `...`: The column names or expressions used for sorting. Multiple columns can be specified, separated by commas.
• `na.last`: A logical value indicating whether missing values should be placed at the end (`TRUE`) or the beginning (`FALSE`) of the sorted result. The default value is `NA`.
• `by`: A variable or expression by which to group the data table before sorting.

### Example: Sorting a Data Frame by Column Using the `setorder()` Function

Let’s continue with our example of sorting a data frame containing employee IDs and salaries, this time using the `setorder()` function from the `data.table` package.

``````library(data.table)

employee_data <- data.frame(
employeeId = c(10, 15, 14, 12, 13),
salary = c(3000, 2500, 1000, 3500, 2000)
)

setDT(employee_data)

print("Original Data Table:")
print(employee_data)

setorder(employee_data, employeeId)

print("Sorted Data Table by Employee ID:")
print(employee_data)
``````

In this example code, we begin by loading the `data.table` package using the `library(data.table)` command. This package provides enhanced functionalities for data manipulation, including the `setorder()` function.

Next, we create a sample data frame named `employee_data` containing employee IDs and corresponding salaries, similar to the previous examples.

Before sorting, we convert the data frame to a `data.table` object using the `setDT()` function. This conversion is necessary as the `setorder()` function specifically operates on `data.table` objects.

We then display the original contents of the data table using the `print()` function for reference.

Using the `setorder()` function, we sort the data table based on the `employeeId` column in ascending order. Unlike other sorting methods, `setorder()` modifies the original data table in place, resulting in improved efficiency for large datasets.

Finally, we print the sorted data table using the `print()` function to observe the changes.

Output:

As we can see in the output, the data table has been sorted based on the `employeeId` column in ascending order.

## How to Sort a Data Frame by Column in R Using the `setkey()` Function From the `data.table` Package

In addition to the `setorder()` function, the `data.table` package offers another method for sorting data frames by column: the `setkey()` function. This function not only sorts the data frame but also sets the specified column as the primary key, enabling subsequent fast lookups and joins based on that column.

The syntax of the `setkey()` function is as follows:

``````setkey(x, ..., verbose = getOption("datatable.verbose"))
``````

Parameters:

• `x`: The data table to be sorted and keyed.
• `...`: The column names or expressions used for sorting and keying. Multiple columns can be specified, separated by commas.
• `verbose`: A logical value indicating whether verbose output should be printed. The default value is obtained from the global option `datatable.verbose`.

The `setkey()` function both sorts the data table and sets the specified column(s) as the primary key(s). This primary key is used for indexing and facilitates fast data access, especially when performing joins and subsetting operations.

Similar to `setorder()`, `setkey()` operates directly on the data table, resulting in efficient memory usage and performance improvements, particularly for large datasets.

### Example: Sorting a Data Frame by Column Using the `setkey()` Function

Let’s continue with our example of sorting a data frame containing employee IDs and salaries, this time using the `setkey()` function from the `data.table` package.

``````library(data.table)

employee_data <- data.frame(
employeeId = c(10, 15, 14, 12, 13),
salary = c(3000, 2500, 1000, 3500, 2000)
)

setDT(employee_data)

print("Original Data Table:")
print(employee_data)

setkey(employee_data, employeeId)

print("Sorted and Keyed Data Table by Employee ID:")
print(employee_data)
``````

Here, we begin by loading the `data.table` package using the `library(data.table)` command. Next, we create a sample data frame named `employee_data` containing employee IDs and corresponding salaries.

To use `data.table`, we convert this data frame into a `data.table` object using the `setDT()` function. This step prepares our data for efficient handling and manipulation.

Before sorting, we display the original contents of the data table using the `print()` function. This allows us to observe the initial arrangement of rows in the data table before any sorting or keying operations.

Using the `setkey()` function, we sort the data table based on the `employeeId` column and set it as the primary key. Unlike other sorting methods, such as `setorder()`, `setkey()` not only sorts the data table but also establishes the specified column as the primary key.

Lastly, we print the sorted and keyed data table using the `print()` function to observe the changes.

Output:

As we can see, the data table has been sorted based on the `employeeId` column and set as the primary key. This enables efficient data access based on the sorted column, facilitating faster lookups and joins in subsequent operations.

## Conclusion

In conclusion, sorting a data frame by column in R is a fundamental operation in data analysis and manipulation tasks. Throughout this article, we’ve explored various methods to accomplish this task, including the `order()` function, the `arrange()` function from the `dplyr` package, the `setorder()` function, and the `setkey()` function from the `data.table` package.

Each method offers its advantages and may be preferred depending on the specific requirements of the task at hand. The `order()` function provides a basic yet effective way to sort data frames by a single column, while the `arrange()` function offers more flexibility and ease of use, especially for those familiar with the `dplyr` package.

On the other hand, the `setorder()` function from the `data.table` package provides efficient in-place sorting, making it ideal for handling large datasets. Additionally, the `setkey()` function not only sorts the data but also sets the specified column as the primary key, enabling efficient indexing and fast data access.

By understanding and utilizing these sorting methods, data analysts and scientists can efficiently organize and manipulate data frames to derive meaningful insights and make informed decisions in R programming.

Author: Sheeraz Gul

Sheeraz is a Doctorate fellow in Computer Science at Northwestern Polytechnical University, Xian, China. He has 7 years of Software Development experience in AI, Web, Database, and Desktop technologies. He writes tutorials in Java, PHP, Python, GoLang, R, etc., to help beginners learn the field of Computer Science.