News

Welcome to End Point’s blog

Ongoing observations by End Point people

Efficiency of find -exec vs. find | xargs

This is a quick tip for anyone writing a cron job to purge large numbers of old files.

Without xargs, this is a pretty common way to do such a purge, in this case of all files older than 31 days:

find /path/to/junk/files -type f -mtime +31 -exec rm -f {} \;

But that executes rm once for every single file to be removed, which adds a ton of overhead just to fork and exec rm so many times. Even on modern operating systems that are so efficient with fork, it can easily increase the I/O and load and runtime by 10 times or more than just running a single rm command with a lot of file arguments.

Instead do this:

find /path/to/junk/files -type f -mtime +31 -print0 | xargs -0 -r rm -f

That will run xargs once for each very long list of files to be removed, so the overhead of fork & exec is incurred very rarely, and the job can spend most of its effort actually unlinking files. (The xargs -r option says not to run the command if there is no input to xargs.)

How long can the argument list to xargs be? It depends on the system, but xargs --show-limits will tell us. Here's output from a RHEL 5 x86_64 system (using findutils 4.2.27):

% xargs --show-limits                                                                                                   
Your environment variables take up 2293 bytes                                                                                        
POSIX lower and upper limits on argument length: 2048, 129024                                                                        
Maximum length of command we could actually use: 126731                                                                              
Size of command buffer we are actually using: 126731

The numbers are similar on Debian Etch and Lenny.

And here's output from an Ubuntu 10.04 x86_64 system (using findutils 4.4.2):

% xargs --show-limits
Your environment variables take up 1370 bytes
POSIX upper limit on argument length (this system): 2093734
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2092364
Size of command buffer we are actually using: 131072

Roughly 2 megabytes of arguments is a lot. But even the POSIX minimum of 4 kB is a lot better than processing one file at a time.

It doesn't usually make much of a difference, but we can tune even more. Make sure the maximum number of files is processed at one time by first changing to the base directory so that the relative pathnames are shorter:

cd /path/to/junk/files && find . -type f -mtime +31 -print0 | xargs -0 -r rm -f

That way each file argument is shorter, e.g. ./junkfile compared to /path/to/junk/files/junkfile.

The above assumes you're using GNU findutils, which includes find -print0 and xargs -0 for processing ASCII NUL-delimited filenames for safety when filenames include embedded spaces, newlines, etc.

5 comments:

tante said...

One thing many people forget is that especially for deleting you shouldn't use "-exec ..." but just "-delete".

Joshua Tolley said...

Note that with xargs' -n option, you *can* force xargs to run once per file, if whatever you're passing the file names to can only handle one at a time. Of course, you could also use -exec for that. Another helpful argument is GNU xargs' --no-run-if-empty option (or the short version: -r). Surprisingly, by default xargs runs the command once even if there is no input.

Jon Jensen said...

tante, good point about -delete, but note also that there are some pretty serious limitations (can't use -prune) and risks (can accidentally delete everything before other matches!) to using it. Quoting from the man page:

"Warnings: Don't forget that the find command line is evaluated as an expression, so putting -delete first will make find try to delete everything below the starting points you specified. When testing a find command line that you later intend to use with -delete, you should explicitly specify -depth in order to avoid later surprises. Because -delete implies -depth, you cannot usefully use -prune and -delete together."

Because of that, I find xargs a better option in most cases, but certainly I like -delete better than -exec rm.

Brian said...

You don't have to use xargs to get this behavious. Just use -exec command {} +. Note the plus sign instead of the semicolon.

From the man page:

This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its command lines. Only one instance of ‘{}’ is allowed within the command. The command is executed in the starting directory.

Jon Jensen said...

That's neat, Brian. I didn't know about that.

It looks like it's an extension only available in the more recent GNU versions of find, but is nice to have.