Do you want to get better at Linux? I have a few command-line tricks you can learn faster than drinking your early morning coffee. Here are my must-learn commands, not because you need to know them to use Linux, but because you’ll want to know them to make you better at Linux:

  • find
  • xargs and nproc
  • taskset
  • numactl
  • inotify-tools

In this article, I’m going to present you with a challenge, and the tools demonstrating how to solve each problem.

1. Directories with lots of files

You may have encountered this problem once or twice. You tried to do a ls on a directory with a very large number of files, and the command throws an argument list too long error:

The reason is that a POSIX system has a limit for the maximum number of bytes you can
pass as an argument:

Two million bytes may be not enough, depending on who you ask, but it’s a protection against attacks or innocent mistakes with bad consequences. In any case, you can bypass this limitation with a few different tricks.

Use a shell built-in

Bash doesn’t have the ARG_MAX limitation by default:

This is probably the simplest solution, but let’s look at another way.

Use find when you want formatting options

You can use this well-known find flag:

Or with formatting, to mimic ls:

This is fast and also the most complete solution.

Use xargs

There is, of course, yet another way. The following works:

This works but is admittedly inefficient. You’re forking 3 processes to display the contents of the directory, and on top of that xargs is throttling how many files get passed to the ls command.

Let’s move on to a different problem.

2. Run more programs at once

First you walk, then you run. That’s a serial process. Suppose you want to compress all files in a given directory. You might first do this:

That’s a serial process, so it takes a long time. Running just the gzip command processes one file at the time. So you might instead try something that processes files in parallel:

But ARG_MAX strikes again. So what if you do this:

That either makes your server run out of memory, or it all but crushes your server under a very heavy CPU load, because you’re forking a gzip instance for every file found.

Is there a better way?

Parallelism and throttling (the art of self control)

What you need is a way to throttle your compression requests, so you don’t launch more processes than the number of CPUs you have.

Let’s try compressing files again with find and xargs:

That looks like a fancy one-liner. Let me explain how it works:

  1. Use find to get all files in a given directory, and use the null character as separator to be able to process ones with weird names.
  2. nproc tells you how many CPUs you have, then subtracts 1 using Bash arithmetic like this using sub-shells: $(($(nproc)-1))
  3. Finally, xargs runs no more than -P processes. In my case, that’s 8 CPUs – 1, for a total of 7 jobs. The percent (%) character is dynamically replaced with the name of the file to compress.

There are other ways to get the number of CPUs on a machine. You can parse /proc/cpuinfo, for example.

There are also more efficient compression algorithms out there, but gzip is available on pretty much any Linux or Unix system.

It is time to see our next problem.

3.Maximize execution time with taskset

Despite limiting the number of CPUs, intensive jobs can slow down other processes on your machine as they all compete for resources. There are a few things you can do to keep the performance of your server under control, like using taskset.

The taskset command can set or retrieve the CPU affinity of a running process (by PID), or it can launch a new command with a given CPU affinity.

CPU affinity is a scheduler property. It binds a process to a given set of CPUs on a system.

The kernel is normally pretty good about keeping running processes glued to a specific CPU to avoid context switching, but if you want to enforce which CPU a process runs on, you can use taskset. In general, you want to leave one of your CPUS free for operating system tasks.

4. Overcome physical limitations with numactl

From What is NUMA and why you should care:

There are physical limitations to hardware that are encountered when many CPUs and lots of memory are required. The important limitation is that there is limited communication bandwidth between the CPUs and the memory. One architecture modification that was introduced to address this is Non-Uniform Memory Access (NUMA).

Most desktop machines only have a single NUMA node, like mine:

If you have more than one NUMA node, you may want to “pin” (set the affinity) your program so that it uses a CPU and memory in the same node. For example, on a machine with 16 cores (0-7 on node 0, and 8-15 on node 1), you could force your compression program to run on all CPUs on node 1, and to use the memory of node 1:

Enough CPU talk. Time to learn how to watch things.

5. Keep an eye on things

The watch command allows you to periodically run a command, and even shows you the differences between calls. Here’s the output of watch showing the output of the ls command every 10 seconds:

That’s fine to detect changes within a directory, but it’s not easy to automate and it’s definitely not efficient. Wouldn’t it be nice if the kernel was able to tell you about changes to directories?

A better way to watch with inotify-tools

You may need to install this separately, but it should be easy to do. On Ubuntu:

On Fedora or similar:

To monitor for events on a given directory, run inotifywait:

Open another terminal and touch some files to simulate an event:

The original terminal gets the first event and then exits:

That’s not very useful. The command detects only the first event. To make it listen forever, add the --monitor option:

If you touch a file again in a separate terminal, you see all events:

This is less taxing to the operating system than asking for directory changes at random, and filtering the differences yourself.

Commands for quality of life

There is so much more to explore. The tips above have introduced you to some important concepts, so why not to learn much more about them?

  • The Ubuntu forum has a great conversation about xargs, find, ulimit, and much more. Knowledge is power.
  • Red Hat as a nice page about NUMA, taskset, and interrupt handling. If you’re serious about fine-tuning the performance of your processes, then you have to read this.
  • You liked inotify and want to use it from your Python script. Take a look at pynotify.
  • The find command can be intimidating, but this tutorial makes it easy to understand.
  • The source code for this tutorial is available in my Git repository.

Author


Jose Vicente Nunez

System Administrator/ DevOps.