[RndTbl] Command line challenge: trim garbage from start and end of a file.

Sean Walberg sean at ertw.com
Wed Nov 10 12:50:27 CST 2010


I used whatever was on my Fedora 13 box:

[sean at bob ~]$ awk --version
GNU Awk 3.1.8
[sean at bob ~]$ sed --version
GNU sed version 4.2.1

The difference gets much bigger if you use a more complex regexp.

[sean at bob tmp]$ time awk '/.*output.*start.*/,/.*output.*end.*/' < infile >
/dev/null

real    0m0.450s
user    0m0.393s
sys     0m0.010s
[sean at bob tmp]$ time  sed -n '/.*output.*start.*/,/.*output.*end.*/p' <
infile > /dev/null

real    0m1.726s
user    0m1.495s
sys     0m0.017s

Awk didn't seem to blink an eye. Strangely enough, since the beginning and
ending .*'s are completely superfluous, they seem to throw sed for a loop,
even if the middle .* is replaced with a space.

Sean

On Wed, Nov 10, 2010 at 12:37 PM, Gilles Detillieux <
grdetil at scrc.umanitoba.ca> wrote:

> Interesting!  Which version of awk did you test?  I have to admit I
> haven't looked into awk performance in quite some time.  My early
> experience, on older Unix systems (pre-Linux), confirmed what I had read
> about awk being pretty slow.  But I seem to recall that even on older
> Linux systems, gawk wasn't exactly speedy then either.  I imagine the
> GNU awk developers must have remedied that since, though, if that is
> indeed what you were testing.
>
> Searching online for discussions on awk performance found one from 2002
> suggesting gawk was much faster than nawk, and another from this past
> August that suggested the opposite.  Perhaps the developers of the two
> have been leap-frogging each other with optimizations to their code?
>
> On 11/10/2010 11:56 AM, Sean Walberg wrote:
> > Adam and I were having an offline discussion, and some testing shows
> > that AWK outperforms SED by a slight margin:
> >
> > [sean at bob tmp]$ W=/usr/share/dict/words
> > [sean at bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output
> > end; head -1000 $W) > infile
> > [sean at bob tmp]$ wc -l infile
> > 481831 infile
> > [sean at bob tmp]$ time awk '/output start/,/output end/' < infile >
> /dev/null
> >
> > real    0m0.411s
> > user    0m0.393s
> > sys     0m0.016s
> > [sean at bob tmp]$ time  sed -n '/output start/,/output end/p' < infile >
> > /dev/null
> >
> > real    0m0.678s
> > user    0m0.631s
> > sys     0m0.029s
> >
> > I ran it a bunch more times and the results were similar.  YMMV,
> > benchmarks are lies, etc.
> >
> > Sean
> >
> > On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux
> > <grdetil at scrc.umanitoba.ca <mailto:grdetil at scrc.umanitoba.ca>> wrote:
> >
> >     I may have misinterpreted the question before.  If you want the
> "output
> >     start" and "output end" marker lines in the output (which I guess
> your
> >     grep pipeline would do), then Adam's sed script will do that.  Mine,
> >     using the "d" commands, will output only the data in between.  The
> >     shortest awk script to do the same would be:
> >
> >     awk '/output start/{s=1};s==1;/output end/{s=0};'
> >
> >     or
> >
> >     awk '/output end/{s=0};s==1;/output start/{s=1};'
> >
> >     The first is a simplification of Adam's, which outputs the output
> marker
> >     lines, while the second, using the same statements in the opposite
> >     order, suppresses the markers.  Of perl, awk and sed, I suspect sed
> is
> >     the most lightweight, and probably the quickest, unless perl can
> >     outperform sed on larger files.  awk has a reputation for being
> pretty
> >     slow.  I tend to favour sed unless awk or perl makes the job a lot
> >     easier.
> >
> >     Gilles
> >
> >     On 11/10/2010 11:13 AM, Adam Thompson wrote:
> >      > The AWK version is functionally identical, and not very much
> >     shorter, or
> >      > any more elegant:
> >      >
> >      >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
> >     {s=0}’
> >      >
> >      > (the perl version can generally be made that small, too.)
> >      >
> >      >
> >      >
> >      > I would instead suggest sed(1), since this is precisely what it’s
> >      > designed for:
> >      >
> >      >     sed –n ‘/output start/,/output end/p’ < infile
> >      >
> >      >
> >      >
> >      > -Adam
> >      >
> >      >
> >      >
> >      >
> >      >
> >      > *From:* roundtable-bounces at muug.mb.ca
> >     <mailto:roundtable-bounces at muug.mb.ca>
> >      > [mailto:roundtable-bounces at muug.mb.ca
> >     <mailto:roundtable-bounces at muug.mb.ca>] *On Behalf Of *Sean Walberg
> >      > *Sent:* Wednesday, November 10, 2010 10:56
> >      > *To:* Continuation of Round Table discussion
> >      > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
> >     start
> >      > and end of a file.
> >      >
> >      >
> >      >
> >      > OTTOMH:
> >      >
> >      >
> >      >
> >      > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and
> /output
> >      > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
> >      > ($state == 1)' < infile > outfile
> >      >
> >      > I'll bet there's a shorter AWK version though.
> >      >
> >      >
> >      >
> >      > Sean
> >      >
> >      >
> >      >
> >      > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john at johnlange.ca
> >     <mailto:john at johnlange.ca>
> >      > <mailto:john at johnlange.ca <mailto:john at johnlange.ca>>> wrote:
> >      >
> >      > I have files with the following structure:
> >      >
> >      > garbage
> >      > garbage
> >      > garbage
> >      > output start
> >      > .. good data
> >      > .. good data
> >      > .. good data
> >      > .. good data
> >      > output end
> >      > garbage
> >      > garbage
> >      > garbage
> >      >
> >      > How can I extract the good data from the file trimming the garbage
> >      > from the beginning and end?
> >      >
> >      > The following works just fine but it's dirty because I don't like
> the
> >      > fact that I have to pick an arbitrarily large number for the
> "before"
> >      > and "after" values.
> >      >
> >      > grep -A 999999 "output start" <infile> | grep -B 999999 "output
> >     end" >
> >      > newfile
> >      >
> >      > Can anyone come up with something more elegant?
> >      >
> >      > --
> >      > John Lange
> >      > www.johnlange.ca <http://www.johnlange.ca> <
> http://www.johnlange.ca>
> >
> >     --
> >     Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca
> >     <mailto:grdetil at scrc.umanitoba.ca>>
> >     Spinal Cord Research Centre       WWW:
> http://www.scrc.umanitoba.ca/
> >     Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
> >     _______________________________________________
> >     Roundtable mailing list
> >     Roundtable at muug.mb.ca <mailto:Roundtable at muug.mb.ca>
> >     http://www.muug.mb.ca/mailman/listinfo/roundtable
> >
> >
> >
> >
> > --
> > Sean Walberg <sean at ertw.com <mailto:sean at ertw.com>>    http://ertw.com/
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Roundtable mailing list
> > Roundtable at muug.mb.ca
> > http://www.muug.mb.ca/mailman/listinfo/roundtable
>
> --
> Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
> _______________________________________________
> Roundtable mailing list
> Roundtable at muug.mb.ca
> http://www.muug.mb.ca/mailman/listinfo/roundtable
>



-- 
Sean Walberg <sean at ertw.com>    http://ertw.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.muug.mb.ca/pipermail/roundtable/attachments/20101110/2a90bacc/attachment-0001.html 


More information about the Roundtable mailing list