[RndTbl] fast counting with find
athompso at athompso.net
Sat Nov 5 17:11:07 CDT 2011
> -----Original Message-----
> From: roundtable-bounces at muug.mb.ca [mailto:roundtable-
> bounces at muug.mb.ca] On Behalf Of Trevor Cordes
> Sent: Saturday, November 05, 2011 10:00
> To: MUUG RndTbl
> Subject: [RndTbl] fast counting with find
> I found myself needing a type of -limit -quit option in find. I
> see a built-in way to do it, even with GNU find. GNU find does let
> count to 1 and quit, simply by using -quit, but not count to X then
> Why do I want to quit at all? Why not just do find|wc -l? The
> dirs I'm
> scanning have about 200k files and are sometimes over NFS. Either
> way, a
> full find|wc takes a long time and a lot of resources, especially
> if the
> find has to do a stat (for mtime, etc). With find|wc my 1 find
> took 10+ mins. With my new method, it's a few seconds.
> Here's the best solution I could think up. It's sub-optimal I'm
> (requires execs and a temp file), but I couldn't see an easier way
> to do
> it within the confines of find (without writing my own find, which
> didn't want to do in this case).
Doesn't " find /path -args | head -1000 | wc -l" give you nearly the same
result? It may generate more disk i/o in the background (depending on
pipe buffering and signalling semantics) but should just as fast when used
(For the pedantic among us, that should read "find /path -args -print |
head -n 1000 | wc -l" since direct specification of the line count to
head(1) in option-style syntax is deprecated in POSIX.)
Head(1) will exit immediately upon counting 1000 (or whatever maximum
number) of rows, which will generate SIGPIPE to find, which will (more or
less) immediately exit. So this might generate more I/O if the read(2) in
head(1) blocks until the pipe(7) fills enough to satisfy a read(2) call,
head(1) read(2)s from the pipe, counts to X and terminates, which sends
SIGPIPE to find(1) but in the interim find(1) has continued processing to
fill the next BUFSIZ's worth of pipe(7)... Meanwhile, wc does not receive
SIGPIPE because it hasn't blocked on a write(2) call and processes
normally, exiting when head(1) generates EOF on the pipe.
Based on some naïve test I just ran, the amount of extra disk I/O involved
is below the threshold of human measurement, but the numbers I generated
indicated that the pipe writer generates anywhere between another 100bytes
and 4k of output. Four kilobytes of find(1) output could easily indicate
a couple dozen megabytes of disk I/O, or even more in pathological cases.
However, this isn't an issue (as long as you don't have disk IOPS
contention from other processes) because the wall time remains the same.
Or am I misinterpreting what you want to accomplish altogether?
More information about the Roundtable