[RndTbl] Command line challenge: trim garbage from start and end of a file.

Gilles Detillieux grdetil at scrc.umanitoba.ca
Wed Nov 10 12:37:35 CST 2010


Interesting!  Which version of awk did you test?  I have to admit I 
haven't looked into awk performance in quite some time.  My early 
experience, on older Unix systems (pre-Linux), confirmed what I had read 
about awk being pretty slow.  But I seem to recall that even on older 
Linux systems, gawk wasn't exactly speedy then either.  I imagine the 
GNU awk developers must have remedied that since, though, if that is 
indeed what you were testing.

Searching online for discussions on awk performance found one from 2002 
suggesting gawk was much faster than nawk, and another from this past 
August that suggested the opposite.  Perhaps the developers of the two 
have been leap-frogging each other with optimizations to their code?

On 11/10/2010 11:56 AM, Sean Walberg wrote:
> Adam and I were having an offline discussion, and some testing shows 
> that AWK outperforms SED by a slight margin:
> 
> [sean at bob tmp]$ W=/usr/share/dict/words
> [sean at bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output 
> end; head -1000 $W) > infile
> [sean at bob tmp]$ wc -l infile
> 481831 infile
> [sean at bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null
> 
> real    0m0.411s
> user    0m0.393s
> sys     0m0.016s
> [sean at bob tmp]$ time  sed -n '/output start/,/output end/p' < infile > 
> /dev/null
> 
> real    0m0.678s
> user    0m0.631s
> sys     0m0.029s
> 
> I ran it a bunch more times and the results were similar.  YMMV, 
> benchmarks are lies, etc.
> 
> Sean
> 
> On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux 
> <grdetil at scrc.umanitoba.ca <mailto:grdetil at scrc.umanitoba.ca>> wrote:
> 
>     I may have misinterpreted the question before.  If you want the "output
>     start" and "output end" marker lines in the output (which I guess your
>     grep pipeline would do), then Adam's sed script will do that.  Mine,
>     using the "d" commands, will output only the data in between.  The
>     shortest awk script to do the same would be:
> 
>     awk '/output start/{s=1};s==1;/output end/{s=0};'
> 
>     or
> 
>     awk '/output end/{s=0};s==1;/output start/{s=1};'
> 
>     The first is a simplification of Adam's, which outputs the output marker
>     lines, while the second, using the same statements in the opposite
>     order, suppresses the markers.  Of perl, awk and sed, I suspect sed is
>     the most lightweight, and probably the quickest, unless perl can
>     outperform sed on larger files.  awk has a reputation for being pretty
>     slow.  I tend to favour sed unless awk or perl makes the job a lot
>     easier.
> 
>     Gilles
> 
>     On 11/10/2010 11:13 AM, Adam Thompson wrote:
>      > The AWK version is functionally identical, and not very much
>     shorter, or
>      > any more elegant:
>      >
>      >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
>     {s=0}’
>      >
>      > (the perl version can generally be made that small, too.)
>      >
>      >
>      >
>      > I would instead suggest sed(1), since this is precisely what it’s
>      > designed for:
>      >
>      >     sed –n ‘/output start/,/output end/p’ < infile
>      >
>      >
>      >
>      > -Adam
>      >
>      >
>      >
>      >
>      >
>      > *From:* roundtable-bounces at muug.mb.ca
>     <mailto:roundtable-bounces at muug.mb.ca>
>      > [mailto:roundtable-bounces at muug.mb.ca
>     <mailto:roundtable-bounces at muug.mb.ca>] *On Behalf Of *Sean Walberg
>      > *Sent:* Wednesday, November 10, 2010 10:56
>      > *To:* Continuation of Round Table discussion
>      > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
>     start
>      > and end of a file.
>      >
>      >
>      >
>      > OTTOMH:
>      >
>      >
>      >
>      > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output
>      > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
>      > ($state == 1)' < infile > outfile
>      >
>      > I'll bet there's a shorter AWK version though.
>      >
>      >
>      >
>      > Sean
>      >
>      >
>      >
>      > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john at johnlange.ca
>     <mailto:john at johnlange.ca>
>      > <mailto:john at johnlange.ca <mailto:john at johnlange.ca>>> wrote:
>      >
>      > I have files with the following structure:
>      >
>      > garbage
>      > garbage
>      > garbage
>      > output start
>      > .. good data
>      > .. good data
>      > .. good data
>      > .. good data
>      > output end
>      > garbage
>      > garbage
>      > garbage
>      >
>      > How can I extract the good data from the file trimming the garbage
>      > from the beginning and end?
>      >
>      > The following works just fine but it's dirty because I don't like the
>      > fact that I have to pick an arbitrarily large number for the "before"
>      > and "after" values.
>      >
>      > grep -A 999999 "output start" <infile> | grep -B 999999 "output
>     end" >
>      > newfile
>      >
>      > Can anyone come up with something more elegant?
>      >
>      > --
>      > John Lange
>      > www.johnlange.ca <http://www.johnlange.ca> <http://www.johnlange.ca>
> 
>     --
>     Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca
>     <mailto:grdetil at scrc.umanitoba.ca>>
>     Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
>     Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
>     _______________________________________________
>     Roundtable mailing list
>     Roundtable at muug.mb.ca <mailto:Roundtable at muug.mb.ca>
>     http://www.muug.mb.ca/mailman/listinfo/roundtable
> 
> 
> 
> 
> -- 
> Sean Walberg <sean at ertw.com <mailto:sean at ertw.com>>    http://ertw.com/
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Roundtable mailing list
> Roundtable at muug.mb.ca
> http://www.muug.mb.ca/mailman/listinfo/roundtable

-- 
Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)


More information about the Roundtable mailing list