Unix was first used to process text files, so Unix and Linux both contain a variety of commands that let you act on text contained in files. And even in 2023, these problems come up all the time. Knowing how to work the Linux command line can make some of these problems easier.

This came up recently when I was part of a meeting where we discussed product names, based on an 8-page list of different names from a web search. An important our discussion required knowing the most common instances of the names. Many of the names were repeated, others were minor variations on the names, such as hyphenation or capitalization.

How could we simplify the list so we could discuss it? One person suggested running the data through a spreadsheet; another proposed using a statistical analysis package. I opened a Linux terminal, copied the list of names into a text file, and ran a few Linux commands to reduce the list of names to a manageable grouping. That meant we could discuss the list immediately.

Here’s how I used the Linux command line to quickly find repeated names in a long list.

Remove hyphens

Let’s say I have a list of names that are similar, except some have hyphens and others do not, plus one line that is different. I’ll save this in a plain text file called hyphens:

To make comparisons easier, it doesn’t matter if the hyphen is there or not, so let’s remove it. You can use the Linux tr command to easily translate or convert one character to another. In this case, we want “Lorem-Ipsum” to be the same as “Lorem Ipsum.” The easiest way to do that is to convert the hyphen to a space:

Now every instance of “Lorem-Ipsum” will become “Lorem Ipsum.” To count the identical names, we can use the sort command to generate a sorted list, then the uniq command to print unique instances of each. The -c option to uniq prints a count of the repeated instances, which is the count we want:

Remove trailing spaces

When I processed the 8-page list of names, I realized some of the entries had spaces at the ends of the lines. The name “Lorem Ipsum” with a space after it is different from the name “Lorem Ipsum” with no space. If your list might have trailing spaces, you can use a quick sed command to remove them.

Let’s start with a different list of names called spaces where some lines have trailing spaces, saved in a file called spaces. Here, we can use the -E or --show-ends option with cat to display a marker at the end of each line, effectively showing where we have trailing spaces:

sed is the standard stream editor, and allows you to perform several kinds of automated manipulations on text files. sed acts on regular expressions, a string of characters that matches text on lines in the file. For example, in a regular expression, ^ means the start of a line and $ means the end of a line. Also, + means one or more of the preceding character and * is zero or more of the preceding character.

To strip all trailing spaces from a file, we should match  *$ which means any amount of spaces at the end of a line, and replace it with nothing, which deletes the extra spaces:

Convert to lowercase

Some of the names in my file were the same, except for variations on capitalization. But we weren’t interested in differences of uppercase and lowercase letters; we just wanted the names that were otherwise the same.

You can easily convert text to uppercase or lowercase with the tr command. tr can do more than translate single characters; it can also convert between character groups.

For this example, let’s start with three variations on the same name, stored in a file called capitals. One uses uppercase for both “Lorem” and “Ipsum,” one uses capitalization only on “Lorem,” and the last doesn’t use any capitalization:

To convert all uppercase letters to lowercase letters, specify [:upper:] and [:lower:] as character groups in the tr command. This tr command converts the text to all lowercase:

Replace words

In the 8-page list of names, some phrases used “and” while others used an ampersand. For example, “Lorem and Ipsum” and “Lorem & Ipsum.” But for our discussion, the ampersand was unimportant. This is another instance of replacing words, which is a great use case for sed

Let’s start with a sample list of words which are basically the same except two spell out the word “and” while one uses an ampersand. Assume this plain text file called and:

In this case, we want to match the ampersand as the regular expression. This is a special character when the ampersand is on the right side of the sed replacement; used on the right side, the ampersand means replace with the matching text. If you want to convert “and” to an ampersand, you need to be careful about this special case:

That sed command means find and replace any instances of “and” with the text that matched it, which means replace “and” with “and.” If you want to replace the text with a literal ampersand, you need to “escape” it with a backslash:

This works well, but it can cause problems for words with the letters “—and” next to each other, like the word “ampersand.” The same sed command blindly replaces all instances of “and” with an ampersand:

Instead, it’s more reliable to go the other way: convert an ampersand to the word “and.” This is also easier for humans to read:

Putting it all together

To process a long list of names, ignoring capitalization and hyphens, and assuming “and” and ampersand mean the same, we can combine these commands to generate a simplified list. I’ll demonstrate with a randomized list called names of over two hundred entries that are just variations of two phrases:

In this sample, the lines differ in capitalization, hyphens, trailing spaces, and ampersands. Using the sort and uniq commands isn’t quite enough to reduce the list, because the variations of how each line is written means uniq can’t find truly unique entries:

By using both tr and sed, I can quickly reduce the list to a more meaningful set:

Processing plain text files like this isn’t a common task today, but it is exactly what Unix was written to do. Using a few tools on the Linux command line, every Linux systems administrator can quickly reduce a list of similar entries to a manageable grouping. Let the command line do the hard work for you.

Author

  • Jim Hall

    Jim Hall is an open source software advocate, developer, and technical writer. At work, Jim is CEO of Hallmentum, providing workshops and training to organizations. Jim is also the editor-in-chief of Technically We Write, an article-based community website about technical writing and technical communication.


Jim Hall

Jim Hall is an open source software advocate, developer, and technical writer. At work, Jim is CEO of Hallmentum, providing workshops and training to organizations. Jim is also the editor-in-chief of Technically We Write, an article-based community website about technical writing and technical communication.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *