Do you want to get better at Linux? I have a few command-line tricks you can learn faster than drinking your early morning coffee. Here are my must-learn commands, not because you need to know them to use Linux, but because you’ll want to know them to make you better at Linux:
- find
- xargs and nproc
- taskset
- numactl
- inotify-tools
In this article, I’m going to present you with a challenge, and the tools demonstrating how to solve each problem.
1. Directories with lots of files
You may have encountered this problem once or twice. You tried to do a ls
on a directory with a very large number of files, and the command throws an argument list too long
error:
1 2 |
$ ls * -bash: /usr/bin/ls: Argument list too long |
The reason is that a POSIX system has a limit for the maximum number of bytes you can
pass as an argument:
1 2 |
$ getconf ARG_MAX 2097152 |
Two million bytes may be not enough, depending on who you ask, but it’s a protection against attacks or innocent mistakes with bad consequences. In any case, you can bypass this limitation with a few different tricks.
Use a shell built-in
Bash doesn’t have the ARG_MAX
limitation by default:
1 2 3 4 5 |
$ echo *|ls ... test_file055554 test_file111110 test_file166666 test_file222222 test_file277778 test_file333334 test_file388890 test_file444446 test_file055555 test_file111111 test_file166667 test_file222223 test_file277779 test_file333335 test_file388891 test_file444447 test_file055556 test_file111112 test_file166668 test_file222224 test_file277780 test_file333336 test_file388892 test_file444448 |
This is probably the simplest solution, but let’s look at another way.
Use find when you want formatting options
You can use this well-known find
flag:
1 |
find /data/test_xargs -type f -ls -printf '%name' |
Or with formatting, to mimic ls
:
1 |
find /data/test_xargs -type f -printf '%f\n |
This is fast and also the most complete solution.
Use xargs
There is, of course, yet another way. The following works:
1 |
find /data/test_xargs -type f -print0 | xargs -0 ls |
This works but is admittedly inefficient. You’re forking 3 processes to display the contents of the directory, and on top of that xargs
is throttling how many files get passed to the ls
command.
Let’s move on to a different problem.
2. Run more programs at once
First you walk, then you run. That’s a serial process. Suppose you want to compress all files in a given directory. You might first do this:
1 |
gzip * |
That’s a serial process, so it takes a long time. Running just the gzip
command processes one file at the time. So you might instead try something that processes files in parallel:
1 2 |
$ for file in $(ls data/test_xargs/*); do gzip $file &; done -bash: /usr/bin/ls: Argument list too long |
But ARG_MAX
strikes again. So what if you do this:
1 2 3 |
for file in $(find $PWD); do echo gzip $file &; done wait echo "All files compressed?" |
That either makes your server run out of memory, or it all but crushes your server under a very heavy CPU load, because you’re forking a gzip
instance for every file found.
Is there a better way?
Parallelism and throttling (the art of self control)
What you need is a way to throttle your compression requests, so you don’t launch more processes than the number of CPUs you have.
Let’s try compressing files again with find
and xargs
:
1 |
find /data/test_xargs -type f -print0| xargs -0 -P $(($(nproc)-1)) -I % gzip % |
That looks like a fancy one-liner. Let me explain how it works:
- Use
find
to get all files in a given directory, and use the null character as separator to be able to process ones with weird names. nproc
tells you how many CPUs you have, then subtracts 1 using Bash arithmetic like this using sub-shells:$(($(nproc)-1))
- Finally,
xargs
runs no more than-P
processes. In my case, that’s 8 CPUs – 1, for a total of 7 jobs. The percent (%
) character is dynamically replaced with the name of the file to compress.
There are other ways to get the number of CPUs on a machine. You can parse /proc/cpuinfo
, for example.
There are also more efficient compression algorithms out there, but gzip
is available on pretty much any Linux or Unix system.
It is time to see our next problem.
3.Maximize execution time with taskset
Despite limiting the number of CPUs, intensive jobs can slow down other processes on your machine as they all compete for resources. There are a few things you can do to keep the performance of your server under control, like using taskset.
The taskset command can set or retrieve the CPU affinity of a running process (by PID), or it can launch a new command with a given CPU affinity.
CPU affinity is a scheduler property. It binds a process to a given set of CPUs on a system.
The kernel is normally pretty good about keeping running processes glued to a specific CPU to avoid context switching, but if you want to enforce which CPU a process runs on, you can use taskset
. In general, you want to leave one of your CPUS free for operating system tasks.
1 2 |
taskset -c 1,2,3,4,5,6,7 find /data/test_xargs -type f -print0 | \ xargs -0 -P $(($(nproc)-1)) -I % gzip % |
4. Overcome physical limitations with numactl
From What is NUMA and why you should care:
There are physical limitations to hardware that are encountered when many CPUs and lots of memory are required. The important limitation is that there is limited communication bandwidth between the CPUs and the memory. One architecture modification that was introduced to address this is Non-Uniform Memory Access (NUMA).
Most desktop machines only have a single NUMA node, like mine:
1 2 3 4 5 6 7 8 9 10 11 12 |
$ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 15679 MB node 0 free: 5083 MB node distances: node 0 0: 10 # Or with lscpu $ lscpu |rg NUMA NUMA node(s): 1 NUMA node0 CPU(s): 0-7 |
If you have more than one NUMA node, you may want to “pin” (set the affinity) your program so that it uses a CPU and memory in the same node. For example, on a machine with 16 cores (0-7 on node 0, and 8-15 on node 1), you could force your compression program to run on all CPUs on node 1, and to use the memory of node 1:
1 2 3 |
numactl --physcpubind 8-15 --membind=1 \ find /data/test_xargs -type f -print0 | \ xargs -0 -P $(($(nproc)-1)) -I % gzip % |
Enough CPU talk. Time to learn how to watch things.
5. Keep an eye on things
The watch
command allows you to periodically run a command, and even shows you the differences between calls. Here’s the output of watch
showing the output of the ls
command every 10 seconds:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Every 10.0s: ls orangepi5: Wed May 24 22:46:33 2023 test_file000001.gz test_file000002.gz test_file000003.gz test_file000004.gz test_file000005.gz test_file000006.gz test_file000007.gz test_file000008.gz test_file000009.gz test_file000010.gz ... |
That’s fine to detect changes within a directory, but it’s not easy to automate and it’s definitely not efficient. Wouldn’t it be nice if the kernel was able to tell you about changes to directories?
A better way to watch with inotify-tools
You may need to install this separately, but it should be easy to do. On Ubuntu:
1 |
sudo apt install inotify-tools |
On Fedora or similar:
1 |
sudo dnf install inotify-tools |
To monitor for events on a given directory, run inotifywait
:
1 2 3 |
$ inotifywait --recursive /data/test_xargs/ Setting up watches. Beware: because -r was given, this may take a while! Watches established. |
Open another terminal and touch some files to simulate an event:
1 2 3 |
$ pwd /data/test_xargs $ touch test_file285707.gz test_file357136.gz test_file428565.gz |
The original terminal gets the first event and then exits:
1 2 |
Watches established. /data/test_xargs/ OPEN test_file285707.gz |
That’s not very useful. The command detects only the first event. To make it listen forever, add the --monitor
option:
1 |
inotifywait --recursive --monitor /data/test_xargs/ |
If you touch a file again in a separate terminal, you see all events:
1 2 3 4 5 6 7 8 9 10 11 |
Setting up watches. Beware: since -r was given, this may take a while! Watches established. /data/test_xargs/ OPEN test_file285707.gz /data/test_xargs/ ATTRIB test_file285707.gz /data/test_xargs/ CLOSE_WRITE,CLOSE test_file285707.gz /data/test_xargs/ OPEN test_file357136.gz /data/test_xargs/ ATTRIB test_file357136.gz /data/test_xargs/ CLOSE_WRITE,CLOSE test_file357136.gz /data/test_xargs/ OPEN test_file428565.gz /data/test_xargs/ ATTRIB test_file428565.gz /data/test_xargs/ CLOSE_WRITE,CLOSE test_file428565.gz |
This is less taxing to the operating system than asking for directory changes at random, and filtering the differences yourself.
Commands for quality of life
There is so much more to explore. The tips above have introduced you to some important concepts, so why not to learn much more about them?
- The Ubuntu forum has a great conversation about
xargs
,find
,ulimit
, and much more. Knowledge is power. - Red Hat as a nice page about NUMA,
taskset
, and interrupt handling. If you’re serious about fine-tuning the performance of your processes, then you have to read this. - You liked
inotify
and want to use it from your Python script. Take a look at pynotify. - The
find
command can be intimidating, but this tutorial makes it easy to understand. - The source code for this tutorial is available in my Git repository.