How to Count Unique Lines in a File in Linux

Olorunfemi Akinlua Feb 02, 2024
  1. Use the sort and uniq Commands to Count Unique Lines in a File
  2. Use the awk Command to Count Unique Lines in a File
How to Count Unique Lines in a File in Linux

Counting the unique lines in a file is a common task in Linux, and several different tools and methods can be used to perform this operation. In general, the appropriate method will depend on the specific requirements and constraints of the task, such as the size of the input file, the performance and memory requirements, and the format and content of the data.

Use the sort and uniq Commands to Count Unique Lines in a File

One approach for counting unique lines in a file in Linux is to use the sort and uniq commands. The sort command sorts the input data in a specified order, and the uniq command filters out duplicate lines from the sorted data.

The data.txt file contains the content below for the examples in this article.

arg1
arg2
arg3
arg2
arg2
arg1

To count the number of unique lines in the file, you can use the following command:

sort data.txt | uniq -c | wc -l

Output:

3

This command sorts the data.txt file in ascending order (by default) and pipes the output to the uniq command. The uniq command filters out any duplicate lines from the sorted data and adds a count of the number of times each line appears in the input.

The output is then piped to the wc command, which counts the number of lines in the input and prints the result to the terminal.

The sort and uniq commands are simple and efficient tools for counting unique lines in a file and are suitable for most common scenarios. However, they have some limitations and drawbacks, such as the need for sorting the input data, which can be slow and memory-intensive for large files.

In addition, the uniq command only removes adjacent duplicate lines from the sorted data, so it may not provide the expected result for some inputs.

Use the awk Command to Count Unique Lines in a File

Another approach for counting unique lines in a file in Linux is to use the awk command, a powerful text processing tool that can perform various operations on text files. The awk command has a built-in associative array data structure, which can store and count the occurrences of each line in the input.

For example, to count the number of unique lines in a file called data.txt, you can use the following command:

awk '!a[$0]++' data.txt | wc -l

Output:

3

This command uses the awk command to read the data.txt file and applies a simple condition to each input line. The condition uses the !a[$0]++ expression, which increments the value of the a array for each line read.

This effectively counts the number of times each line appears in the input and stores the count in the a array.

The awk command then applies the ! operator to the a[$0] expression, which negates the array element’s value. This means that only lines with a count of 0 in the a array will pass the condition and be printed to the output.

The output is then piped to the wc command, which counts the number of lines in the input and prints the result to the terminal.

The awk command also provides several options and features that can be used to control its behavior and customize its output. For example, you can use the -F option to specify a different field separator or the -v option to define a variable that can be used in the script.

You can also use the printf function to format the output of the awk command in various ways.

Here is an example of a more complex awk script that uses these features to count the number of unique lines in a file called data.txt, where each line is a comma-separated list of fields:

awk -F, '{a[$1]++} END {for (i in a) { printf "%s,%d\n", i, a[i] }}' data.txt | wc -l

Output:

3

This script uses the -F option to specify the , character as the field separator, and it defines an a array that is used to store and count the occurrences of each field in the input.

The awk command then reads each line of the data.txt file and increments the value of the a array for each field read. This effectively counts the number of times each unique field appears in the input.

The END block of the script is executed after all lines of the input have been read, and it iterates over the a array using the for loop. The printf function is used to format the output of the awk command, and it prints each unique field and its count to the output.

The output is then piped to the wc command, which counts the number of lines in the input and prints the result to the terminal.

In conclusion, there are several ways to count unique lines in a file in Linux, and the appropriate method will depend on the specific requirements and constraints of the task. The sort and uniq commands are simple and efficient tools for counting unique lines, and the awk command provides more advanced features and options for customizing the output and behavior of the script.

Olorunfemi Akinlua avatar Olorunfemi Akinlua avatar

Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.

LinkedIn

Related Article - Linux File