[RndTbl] Command line challenge: trim garbage from start and end of a file.

Sean Walberg sean at ertw.com
Wed Nov 10 11:56:39 CST 2010


Adam and I were having an offline discussion, and some testing shows that
AWK outperforms SED by a slight margin:

[sean at bob tmp]$ W=/usr/share/dict/words
[sean at bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end;
head -1000 $W) > infile
[sean at bob tmp]$ wc -l infile
481831 infile
[sean at bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null

real    0m0.411s
user    0m0.393s
sys     0m0.016s
[sean at bob tmp]$ time  sed -n '/output start/,/output end/p' < infile >
/dev/null

real    0m0.678s
user    0m0.631s
sys     0m0.029s

I ran it a bunch more times and the results were similar.  YMMV, benchmarks
are lies, etc.

Sean

On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux <
grdetil at scrc.umanitoba.ca> wrote:

> I may have misinterpreted the question before.  If you want the "output
> start" and "output end" marker lines in the output (which I guess your
> grep pipeline would do), then Adam's sed script will do that.  Mine,
> using the "d" commands, will output only the data in between.  The
> shortest awk script to do the same would be:
>
> awk '/output start/{s=1};s==1;/output end/{s=0};'
>
> or
>
> awk '/output end/{s=0};s==1;/output start/{s=1};'
>
> The first is a simplification of Adam's, which outputs the output marker
> lines, while the second, using the same statements in the opposite
> order, suppresses the markers.  Of perl, awk and sed, I suspect sed is
> the most lightweight, and probably the quickest, unless perl can
> outperform sed on larger files.  awk has a reputation for being pretty
> slow.  I tend to favour sed unless awk or perl makes the job a lot easier.
>
> Gilles
>
> On 11/10/2010 11:13 AM, Adam Thompson wrote:
> > The AWK version is functionally identical, and not very much shorter, or
> > any more elegant:
> >
> >     awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/ {s=0}’
> >
> > (the perl version can generally be made that small, too.)
> >
> >
> >
> > I would instead suggest sed(1), since this is precisely what it’s
> > designed for:
> >
> >     sed –n ‘/output start/,/output end/p’ < infile
> >
> >
> >
> > -Adam
> >
> >
> >
> >
> >
> > *From:* roundtable-bounces at muug.mb.ca
> > [mailto:roundtable-bounces at muug.mb.ca] *On Behalf Of *Sean Walberg
> > *Sent:* Wednesday, November 10, 2010 10:56
> > *To:* Continuation of Round Table discussion
> > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from start
> > and end of a file.
> >
> >
> >
> > OTTOMH:
> >
> >
> >
> > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output
> > start/); $state = 2 if ($state == 1 and /output end/)  ; print if
> > ($state == 1)' < infile > outfile
> >
> > I'll bet there's a shorter AWK version though.
> >
> >
> >
> > Sean
> >
> >
> >
> > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <john at johnlange.ca
> > <mailto:john at johnlange.ca>> wrote:
> >
> > I have files with the following structure:
> >
> > garbage
> > garbage
> > garbage
> > output start
> > .. good data
> > .. good data
> > .. good data
> > .. good data
> > output end
> > garbage
> > garbage
> > garbage
> >
> > How can I extract the good data from the file trimming the garbage
> > from the beginning and end?
> >
> > The following works just fine but it's dirty because I don't like the
> > fact that I have to pick an arbitrarily large number for the "before"
> > and "after" values.
> >
> > grep -A 999999 "output start" <infile> | grep -B 999999 "output end" >
> > newfile
> >
> > Can anyone come up with something more elegant?
> >
> > --
> > John Lange
> > www.johnlange.ca <http://www.johnlange.ca>
>
> --
> Gilles R. Detillieux              E-mail: <grdetil at scrc.umanitoba.ca>
> Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
> Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 0J9  (Canada)
> _______________________________________________
> Roundtable mailing list
> Roundtable at muug.mb.ca
> http://www.muug.mb.ca/mailman/listinfo/roundtable
>



-- 
Sean Walberg <sean at ertw.com>    http://ertw.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.muug.mb.ca/pipermail/roundtable/attachments/20101110/7c3853a5/attachment.html 


More information about the Roundtable mailing list