[RndTbl] Command line challenge: trim garbage from start and end of a file.

Sat Dec 25 14:50:10 CST 2010

On 2010-11-10 Sean Walberg wrote:
> Adam and I were having an offline discussion, and some testing shows
> that AWK outperforms SED by a slight margin:

I know it's an old thread... but I had to have a go at you awk/sed
weenies. ;-)

My solution is perl regex:

perl -e '$/=undef;open I,$ARGV[0];$_=<I>;/(?:^|\n)(output start\n.*\noutput end\n)/s and print $1' infile

It's not a filter (requires a filename) but could probably easily be
made into one.

I recall reading in perl books that perl regex was faster than sed/awk
and the above takes advantage of the slurp-whole-file that $/ allows.

On my computer the awk/sed/perl times compare like so:

time sed -n '/output start/,/output end/p' < infile > /dev/null
0.264+0.002c 0:00.26s 100.0% 0+0<774k | 1+39cs 0+259pg 0sw 0sg

time awk '/output start/,/output end/' < infile > /dev/null
0.183+0.003c 0:00.18s 100.0% 0+0<774k | 1+28cs 0+298pg 0sw 0sg

time perl -e '$/=undef;open I,$ARGV[0];$_=<I>;/(?:^|\n)(output start\n.*\noutput end\n)/s and print $1' infile > /dev/null
0.032+0.017c 0:00.05s 80.0% 0+0<8168k | 1+19cs 0+4196pg 0sw 0sg

Wow!  But yikes, look at the mem usage.  Good thing RAM is plentiful
these days.  In 1980 sed would be the better bet for sure.

> [sean at bob tmp]$ W=/usr/share/dict/words
> [sean at bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo
> output end; head -1000 $W) > infile
> [sean at bob tmp]$ wc -l infile
> 481831 infile