Sort data on the Linux command-line

Unix was first used to process text files, so Unix and Linux both contain a variety of commands that let you act on text contained in files. And even in 2023, these problems come up all the time. Knowing how to work the Linux command line can make some of these problems easier.

This came up recently when I was part of a meeting where we discussed product names, based on an 8-page list of different names from a web search. An important our discussion required knowing the most common instances of the names. Many of the names were repeated, others were minor variations on the names, such as hyphenation or capitalization.

How could we simplify the list so we could discuss it? One person suggested running the data through a spreadsheet; another proposed using a statistical analysis package. I opened a Linux terminal, copied the list of names into a text file, and ran a few Linux commands to reduce the list of names to a manageable grouping. That meant we could discuss the list immediately.

Here’s how I used the Linux command line to quickly find repeated names in a long list.

Remove hyphens

Let’s say I have a list of names that are similar, except some have hyphens and others do not, plus one line that is different. I’ll save this in a plain text file called hyphens:

$ cat hyphens
Lorem-Ipsum
Lorem Ipsum
Dolor Sit Amet
Lorem Ipsum
Lorem-Ipsum
Lorem Ipsum
Lorem-Ipsum

$ cat hyphens

Lorem-Ipsum

Lorem Ipsum

Dolor Sit Amet

Lorem Ipsum

Lorem-Ipsum

Lorem Ipsum

Lorem-Ipsum

To make comparisons easier, it doesn’t matter if the hyphen is there or not, so let’s remove it. You can use the Linux tr command to easily translate or convert one character to another. In this case, we want “Lorem-Ipsum” to be the same as “Lorem Ipsum.” The easiest way to do that is to convert the hyphen to a space:

$ tr - ' ' < hyphens
Lorem Ipsum
Lorem Ipsum
Dolor Sit Amet
Lorem Ipsum
Lorem Ipsum
Lorem Ipsum
Lorem Ipsum

$ tr - ' ' < hyphens

Lorem Ipsum

Dolor Sit Amet

Lorem Ipsum

Now every instance of “Lorem-Ipsum” will become “Lorem Ipsum.” To count the identical names, we can use the sort command to generate a sorted list, then the uniq command to print unique instances of each. The -c option to uniq prints a count of the repeated instances, which is the count we want:

$ tr - ' ' < hyphens | sort | uniq -c
      1 Dolor Sit Amet
      6 Lorem Ipsum

$ tr - ' ' < hyphens | sort | uniq -c

1 Dolor Sit Amet

6 Lorem Ipsum

Remove trailing spaces

When I processed the 8-page list of names, I realized some of the entries had spaces at the ends of the lines. The name “Lorem Ipsum” with a space after it is different from the name “Lorem Ipsum” with no space. If your list might have trailing spaces, you can use a quick sed command to remove them.

Let’s start with a different list of names called spaces where some lines have trailing spaces, saved in a file called spaces. Here, we can use the -E or --show-ends option with cat to display a marker at the end of each line, effectively showing where we have trailing spaces:

$ cat --show-ends spaces
Lorem Ipsum  $
Lorem Ipsum  $
Lorem Ipsum$
Lorem Ipsum  $
Lorem Ipsum$
Lorem Ipsum    $

$ cat --show-ends spaces

Lorem Ipsum $

Lorem Ipsum$

Lorem Ipsum $

Lorem Ipsum$

Lorem Ipsum $

sed is the standard stream editor, and allows you to perform several kinds of automated manipulations on text files. sed acts on regular expressions, a string of characters that matches text on lines in the file. For example, in a regular expression, ^ means the start of a line and $ means the end of a line. Also, + means one or more of the preceding character and * is zero or more of the preceding character.

To strip all trailing spaces from a file, we should match *$ which means any amount of spaces at the end of a line, and replace it with nothing, which deletes the extra spaces:

$ sed -e 's/ *$//' spaces | cat -e
Lorem Ipsum$
Lorem Ipsum$
Lorem Ipsum$
Lorem Ipsum$
Lorem Ipsum$
Lorem Ipsum$

$ sed -e 's/ *$//' spaces | cat -e

Lorem Ipsum$

Convert to lowercase

Some of the names in my file were the same, except for variations on capitalization. But we weren’t interested in differences of uppercase and lowercase letters; we just wanted the names that were otherwise the same.

You can easily convert text to uppercase or lowercase with the tr command. tr can do more than translate single characters; it can also convert between character groups.

For this example, let’s start with three variations on the same name, stored in a file called capitals. One uses uppercase for both “Lorem” and “Ipsum,” one uses capitalization only on “Lorem,” and the last doesn’t use any capitalization:

$ cat capitals
Lorem Ipsum
Lorem ipsum
lorem ipsum

$ cat capitals

Lorem Ipsum

Lorem ipsum

lorem ipsum

To convert all uppercase letters to lowercase letters, specify [:upper:] and [:lower:] as character groups in the tr command. This tr command converts the text to all lowercase:

$ tr '[:upper:]' '[:lower:]' < capitals
lorem ipsum
lorem ipsum
lorem ipsum

$ tr '[:upper:]' '[:lower:]' < capitals

lorem ipsum

Replace words

In the 8-page list of names, some phrases used “and” while others used an ampersand. For example, “Lorem and Ipsum” and “Lorem & Ipsum.” But for our discussion, the ampersand was unimportant. This is another instance of replacing words, which is a great use case for sed.

Let’s start with a sample list of words which are basically the same except two spell out the word “and” while one uses an ampersand. Assume this plain text file called and:

$ cat and
Lorem and Ipsum
Lorem & Ipsum
Lorem and Ipsum

$ cat and

Lorem and Ipsum

Lorem & Ipsum

Lorem and Ipsum

In this case, we want to match the ampersand as the regular expression. This is a special character when the ampersand is on the right side of the sed replacement; used on the right side, the ampersand means replace with the matching text. If you want to convert “and” to an ampersand, you need to be careful about this special case:

$ sed -e 's/and/&/' and
Lorem and Ipsum
Lorem & Ipsum
Lorem and Ipsum

$ sed -e 's/and/&/' and

Lorem and Ipsum

Lorem & Ipsum

Lorem and Ipsum

That sed command means find and replace any instances of “and” with the text that matched it, which means replace “and” with “and.” If you want to replace the text with a literal ampersand, you need to “escape” it with a backslash:

$ sed -e 's/and/\&/' and
Lorem & Ipsum
Lorem & Ipsum
Lorem & Ipsum

$ sed -e 's/and/\&/' and

Lorem & Ipsum

This works well, but it can cause problems for words with the letters “—and” next to each other, like the word “ampersand.” The same sed command blindly replaces all instances of “and” with an ampersand:

$ echo ampersand | sed -e 's/and/\&/' 
ampers&

1 2	$ echo ampersand \| sed -e 's/and/\&/' ampers&

Instead, it’s more reliable to go the other way: convert an ampersand to the word “and.” This is also easier for humans to read:

$ sed -e 's/&/and/' and
Lorem and Ipsum
Lorem and Ipsum
Lorem and Ipsum

$ sed -e 's/&/and/' and

Lorem and Ipsum

Putting it all together

To process a long list of names, ignoring capitalization and hyphens, and assuming “and” and ampersand mean the same, we can combine these commands to generate a simplified list. I’ll demonstrate with a randomized list called names of over two hundred entries that are just variations of two phrases:

$ wc -l names
238 names

1 2	$ wc -l names 238 names

In this sample, the lines differ in capitalization, hyphens, trailing spaces, and ampersands. Using the sort and uniq commands isn’t quite enough to reduce the list, because the variations of how each line is written means uniq can’t find truly unique entries:

$ sort names | uniq -c | wc -l
24

1 2	$ sort names \| uniq -c \| wc -l 24

By using both tr and sed, I can quickly reduce the list to a more meaningful set:

$ tr - ' ' < names | sed -e 's/ *$//' | tr '[:upper:]' '[:lower:]' | sed -e 's/&/and/' | sort | uniq -c
    199 lorem ipsum
     39 one and two

199 lorem ipsum

39 one and two

Processing plain text files like this isn’t a common task today, but it is exactly what Unix was written to do. Using a few tools on the Linux command line, every Linux systems administrator can quickly reduce a list of similar entries to a manageable grouping. Let the command line do the hard work for you.

Author

Jim Hall

Jim Hall is an open source software advocate, developer, and technical writer. At work, Jim is CEO of Hallmentum, providing workshops and training to organizations. Jim is also the editor-in-chief of Technically We Write, an article-based community website about technical writing and technical communication.

View all posts

Sort data on the Linux command-line

Published by Jim Hall on 2023-06-232023-06-23

Remove hyphens

Remove trailing spaces

Convert to lowercase

Replace words

Putting it all together

Author

Jim Hall

0 Comments

Leave a Reply Cancel reply

Certification

Understand /dev, your filesystem of hardware

Career

Top 10 essential Linux commands

Command Line

Install Minecraft mods on Linux

Sort data on the Linux command-line

Published by Jim Hall on 2023-06-232023-06-23

Remove hyphens

Remove trailing spaces

Convert to lowercase

Replace words

Putting it all together

Author

Jim Hall

0 Comments

Leave a Reply Cancel reply

Related Posts

Certification

Understand /dev, your filesystem of hardware

Career

Top 10 essential Linux commands

Command Line

Install Minecraft mods on Linux