[RndTbl] Command line challenge: trim garbage from start and end of a file.
Sean Walberg
sean at ertw.com
Wed Nov 10 12:50:27 CST 2010
I used whatever was on my Fedora 13 box:
[sean at bob ~]$ awk --version
GNU Awk 3.1.8
[sean at bob ~]$ sed --version
GNU sed version 4.2.1
The difference gets much bigger if you use a more complex regexp.
[sean at bob tmp]$ time awk '/.*output.*start.*/,/.*output.*end.*/' < infile >
/dev/null
real 0m0.450s
user 0m0.393s
sys 0m0.010s
[sean at bob tmp]$ time sed -n '/.*output.*start.*/,/.*output.*end.*/p' <
infile > /dev/null
real 0m1.726s
user 0m1.495s
sys 0m0.017s
Awk didn't seem to blink an eye. Strangely enough, since the beginning and
ending .*'s are completely superfluous, they seem to throw sed for a loop,
even if the middle .* is replaced with a space.
Sean
On Wed, Nov 10, 2010 at 12:37 PM, Gilles Detillieux <
grdetil at scrc.umanitoba.ca> wrote:
> Interesting! Which version of awk did you test? I have to admit I
> haven't looked into awk performance in quite some time. My early
> experience, on older Unix systems (pre-Linux), confirmed what I had read
> about awk being pretty slow. But I seem to recall that even on older
> Linux systems, gawk wasn't exactly speedy then either. I imagine the
> GNU awk developers must have remedied that since, though, if that is
> indeed what you were testing.
>
> Searching online for discussions on awk performance found one from 2002
> suggesting gawk was much faster than nawk, and another from this past
> August that suggested the opposite. Perhaps the developers of the two
> have been leap-frogging each other with optimizations to their code?
>
> On 11/10/2010 11:56 AM, Sean Walberg wrote:
> > Adam and I were having an offline discussion, and some testing shows
> > that AWK outperforms SED by a slight margin:
> >
> > [sean at bob tmp]$ W=/usr/share/dict/words
> > [sean at bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output
> > end; head -1000 $W) > infile
> > [sean at bob tmp]$ wc -l infile
> > 481831 infile
> > [sean at bob tmp]$ time awk '/output start/,/output end/' < infile >
> /dev/null
> >
> > real 0m0.411s
> > user 0m0.393s
> > sys 0m0.016s
> > [sean at bob tmp]$ time sed -n '/output start/,/output end/p' < infile >
> > /dev/null
> >
> > real 0m0.678s
> > user 0m0.631s
> > sys 0m0.029s
> >
> > I ran it a bunch more times and the results were similar. YMMV,
> > benchmarks are lies, etc.
> >
> > Sean
> >
> > On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux
> > <grdetil at scrc.umanitoba.ca <mailto:grdetil at scrc.umanitoba.ca>> wrote:
> >
> > I may have misinterpreted the question before. If you want the
> "output
> > start" and "output end" marker lines in the output (which I guess
> your
> > grep pipeline would do), then Adam's sed script will do that. Mine,
> > using the "d" commands, will output only the data in between. The
> > shortest awk script to do the same would be:
> >
> > awk '/output start/{s=1};s==1;/output end/{s=0};'
> >
> > or
> >
> > awk '/output end/{s=0};s==1;/output start/{s=1};'
> >
> > The first is a simplification of Adam's, which outputs the output
> marker
> > lines, while the second, using the same statements in the opposite
> > order, suppresses the markers. Of perl, awk and sed, I suspect sed
> is
> > the most lightweight, and probably the quickest, unless perl can
> > outperform sed on larger files. awk has a reputation for being
> pretty
> > slow. I tend to favour sed unless awk or perl makes the job a lot
> > easier.
> >
> > Gilles
> >
> > On 11/10/2010 11:13 AM, Adam Thompson wrote:
> > > The AWK version is functionally identical, and not very much
> > shorter, or
> > > any more elegant:
> > >
> > > awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/
> > {s=0}’
> > >
> > > (the perl version can generally be made that small, too.)
> > >
> > >
> > >
> > > I would instead suggest sed(1), since this is precisely what it’s
> > > designed for:
> > >
> > > sed –n ‘/output start/,/output end/p’ < infile
> > >
> > >
> > >
> > > -Adam
> > >
> > >
> > >
> > >
> > >
> > > *From:* roundtable-bounces at muug.mb.ca
> > <mailto:roundtable-bounces at muug.mb.ca>
> > > [mailto:roundtable-bounces at muug.mb.ca
> > <mailto:roundtable-bounces at muug.mb.ca>] *On Behalf Of *Sean Walberg
> > > *Sent:* Wednesday, November 10, 2010 10:56
> > > *To:* Continuation of Round Table discussion
> > > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from
> > start
> > > and end of a file.
> > >
> > >
> > >
> > > OTTOMH:
> > >
> > >
> > >
> > > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and
> /output
> > > start/); $state = 2 if ($state == 1 and /output end/) ; print if
> > > ($state == 1)' < infile > outfile
> > >
> > > I'll bet there's a shorter AWK version though.
> > >
> > >
> > >
> > > Sean
> > >
> > >
> > >
> > > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john at johnlange.ca
> > <mailto:john at johnlange.ca>
> > > <mailto:john at johnlange.ca <mailto:john at johnlange.ca>>> wrote:
> > >
> > > I have files with the following structure:
> > >
> > > garbage
> > > garbage
> > > garbage
> > > output start
> > > .. good data
> > > .. good data
> > > .. good data
> > > .. good data
> > > output end
> > > garbage
> > > garbage
> > > garbage
> > >
> > > How can I extract the good data from the file trimming the garbage
> > > from the beginning and end?
> > >
> > > The following works just fine but it's dirty because I don't like
> the
> > > fact that I have to pick an arbitrarily large number for the
> "before"
> > > and "after" values.
> > >
> > > grep -A 999999 "output start" <infile> | grep -B 999999 "output
> > end" >
> > > newfile
> > >
> > > Can anyone come up with something more elegant?
> > >
> > > --
> > > John Lange
> > > www.johnlange.ca <http://www.johnlange.ca> <
> http://www.johnlange.ca>
> >
> > --
> > Gilles R. Detillieux E-mail: <grdetil at scrc.umanitoba.ca
> > <mailto:grdetil at scrc.umanitoba.ca>>
> > Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/
> > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)
> > _______________________________________________
> > Roundtable mailing list
> > Roundtable at muug.mb.ca <mailto:Roundtable at muug.mb.ca>
> > http://www.muug.mb.ca/mailman/listinfo/roundtable
> >
> >
> >
> >
> > --
> > Sean Walberg <sean at ertw.com <mailto:sean at ertw.com>> http://ertw.com/
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Roundtable mailing list
> > Roundtable at muug.mb.ca
> > http://www.muug.mb.ca/mailman/listinfo/roundtable
>
> --
> Gilles R. Detillieux E-mail: <grdetil at scrc.umanitoba.ca>
> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)
> _______________________________________________
> Roundtable mailing list
> Roundtable at muug.mb.ca
> http://www.muug.mb.ca/mailman/listinfo/roundtable
>
--
Sean Walberg <sean at ertw.com> http://ertw.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.muug.mb.ca/pipermail/roundtable/attachments/20101110/2a90bacc/attachment-0001.html
More information about the Roundtable
mailing list