[RndTbl] RegEx help needed (sed?)

Trevor Cordes trevor at tecnopolis.ca
Thu Dec 1 00:29:13 CST 2016


On 2016-11-29 Kevin McGregor wrote:
> I'm using sed to massage some input. Specifically, I have input lines
> like
> 
> aaaaaaaaaa/BBBBB_ccc at 00000
> or
> aaaaaaaaaa/BBBBB at 00000
> 
> and I want the output to always be
> 
> BBBBB

The others pegged it with the "greedy" *.

I highly recommend using a pcre engine (or just perl!) as it gives you
non-greedy *? which can give you performance improvements, and also save
you some trouble/syntax.  In fact, in tons (most?) cases people who
aren't regex ninjas really are thinking non-greedy so plunking
non-greedy in everywhere usually DWIM.

> sed "s/^.*\/\(.*\)[_@].*$/\1/"

The reason the above (and any greedy .* solution) is inefficient is
that the first .* will eat up every char to the end of string, then
backtrack testing each one for the / char.  Same with the 2nd .*, it
will eat every char to end of string then backtrack looking for a _ or
@!  Inefficient.  The only time this doesn't matter is the last .* since
you're looking for the EOS anyhow.

May not seem like much, but if you were processing really large strings
(I work on multi-MB strings a lot) or GB-sized lists of these lines then
you'd see a difference.  In fact, with large inputs, you can cause a
regex engine to "hang" pretty easily with intermixed .*'s due to
backtracking.  Like /.*..*..*/ type things, which can happen even when
they aren't so obvious.  .*? will save you in many of those cases.

Plus, using perl we can make the code *much* terser.  I love terse!
Terse is good.

# shortest method, and fastest as we are yanking rather than modifying
# perl allows any char delimiter so we don't get backslashitis
use 5.18.0; # gives us say, which is print with NL appended
say((m#/([^_@]*)#)[0]);

# more like the sed call (replacement) but with non-greedy:
print s#.*?/(.*?)[_@].*#$1#r;

# or use from a cmd line:
perl -ne 'print ((m#/([^_@]*)#)[0]."\n")'

Perl is far and away the best program for working with regex and they
are 100% native and first-class types in perl, no need to mess with
most quoting, you get a choice of delimter, etc.  Every time I have to
pcre in other languages (php, etc) I want to vomit with how ugly the
syntax gets.

I highly highly highly highly recommend everyone who ever needs to do
this stuff read the O'Reilly Mastering Regular Expressions book.  It'll
make you grok the "greedy" and backtracking stuff like it's second
nature.  Might be the best book they've ever put out.

Just my $1.02!


More information about the Roundtable mailing list