[RndTbl] fast counting with find

Adam Thompson athompso at athompso.net
Sat Nov 5 17:11:07 CDT 2011


> -----Original Message-----
> From: roundtable-bounces at muug.mb.ca [mailto:roundtable-
> bounces at muug.mb.ca] On Behalf Of Trevor Cordes
> Sent: Saturday, November 05, 2011 10:00
> To: MUUG RndTbl
> Subject: [RndTbl] fast counting with find
>
> I found myself needing a type of -limit -quit option in find.  I
> couldn't
> see a built-in way to do it, even with GNU find.  GNU find does let
> you
> count to 1 and quit, simply by using -quit, but not count to X then
> quit.
>
> Why do I want to quit at all?  Why not just do find|wc -l?  The
> dirs I'm
> scanning have about 200k files and are sometimes over NFS.  Either
> way, a
> full find|wc takes a long time and a lot of resources, especially
> if the
> find has to do a stat (for mtime, etc).  With find|wc my 1 find
> command
> took 10+ mins.  With my new method, it's a few seconds.
>
> Here's the best solution I could think up.  It's sub-optimal I'm
> sure
> (requires execs and a temp file), but I couldn't see an easier way
> to do
> it within the confines of find (without writing my own find, which
> I
> didn't want to do in this case).


Doesn't "	find /path -args | head -1000 | wc -l" give you nearly the same 
result?  It may generate more disk i/o in the background (depending on 
pipe buffering and signalling semantics) but should just as fast when used 
interactively.

(For the pedantic among us, that should read "find /path -args -print | 
head -n 1000 | wc -l" since direct specification of the line count to 
head(1) in option-style syntax is deprecated in POSIX.)

Head(1) will exit immediately upon counting 1000 (or whatever maximum 
number) of rows, which will generate SIGPIPE to find, which will (more or 
less) immediately exit.  So this might generate more I/O if the read(2) in 
head(1) blocks until the pipe(7) fills enough to satisfy a read(2) call, 
head(1) read(2)s from the pipe, counts to X and terminates, which sends 
SIGPIPE to find(1) but in the interim find(1) has continued processing to 
fill the next BUFSIZ's worth of pipe(7)...  Meanwhile, wc does not receive 
SIGPIPE because it hasn't blocked on a write(2) call and processes 
normally, exiting when head(1) generates EOF on the pipe.

Based on some naïve test I just ran, the amount of extra disk I/O involved 
is below the threshold of human measurement, but the numbers I generated 
indicated that the pipe writer generates anywhere between another 100bytes 
and 4k of output.  Four kilobytes of find(1) output could easily indicate 
a couple dozen megabytes of disk I/O, or even more in pathological cases. 
However, this isn't an issue (as long as you don't have disk IOPS 
contention from other processes) because the wall time remains the same.

Or am I misinterpreting what you want to accomplish altogether?

-Adam





More information about the Roundtable mailing list