[RndTbl] Wrong time of night for doing regex?

Hartmut W Sager hwsager at marityme.net
Sat Jan 4 12:23:54 CST 2020


After tons more experimenting, I figured it out!  But I don't know whether
it's a bug or a feature in Vedit, or proper regex behaviour (various online
regex documentation didn't help at all).

It turns out, at least in this regex implementation, that a pair of
enclosing parentheses can only serve one of two purposes, not both, at the
same time.  Those two purposes are:

1.  Mark a group that can then be referred to by a variable like "\3" in
the replacement string.
2.  Enclose a group with alternation (regex terminology) containing several
alternatives separated by the "or" operator "|".

Furthermore, at least in this regex implementation, even the type-2 usage
(above) increments the "\nnn" counter for variables that can be used in the
replacement string, even though the matching "\nnn" variable cannot
actually be used in the replacement string!

The solution I figured out (and tested - it works):  Enclose the search
segment in double (nested) parentheses "((" and "))", and the outer
parentheses are then a type-1 usage which can be referenced in the
replacement string.  But you have to make sure you use the correct "\nnn"
variable by numbering the opening parentheses "(" strictly from left to
right (which is normal in regex).  This unfortunately exhausts the 9
variables "\1" thru "\9" more rapidly.

E.g.
Search string: abc((def|ghi))jkl\s(mn[0-9])op((qrs|tuv))xy([0-9])z
Replacement string: Can use variables \1, \3, \4, \6, but not \2, \5.

Hartmut W Sager - Tel +1-204-339-8331


On Sat, 4 Jan 2020 at 05:00, Hartmut W Sager <hwsager at marityme.net> wrote:

> This might be the wrong time of night for doing regex (i.e., my mistake),
> or my trusty Vedit text editor has a bug in its regex implementation.
>
> Original search string: ^(From
> AncientBBS[1-2])\s+(Sun|Mon|Tue|Wed|Thu|Fri|Sat)[\s\,]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9][0-9]|\s[0-9])[\s\,]+(19[0-9][0-9])[\s\,]+([0-9][0-9]\:[0-9][0-9]\:[0-9][0-9])\s*$
> Replacement string: <Nah, skip it>
>
> The above search string gives a syntax error.  I am a bit suspicious of
> the ([0-9][0-9]|\s[0-9]) group re operator precedence of the "or", and
> proceeded to stepwise simplification to narrow it down.  I finally got down
> to:
>
> Search string:
> (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+([0-9])[\s\,]+
> Replacement string: \1\s0\2\s
>
> The new search works fine (as did some of the previous stepwise simplified
> ones), but the replacements are baffling me.
> The line
> From AncientBBS1 Thu  Jan  2, 1986  20:50:00
> gets changed to
> From AncientBBS1 Thu   02 1986  20:50:00
>
> I.e., the variable \1 seems to get lost.  In my previous stepwise
> simplified cases, multiple variables got lost when the search worked at all.
>
> Why am I doing this?  I need to massage some old BBS messages into the
> retarded mbox format, whose date format (on the "From " line) of "Tue Nov
> 05 19:02:00 1985" is particularly illogical.  Be that as it may, The two
> sources of these messages I am processing had further sloppiness in their
> dates, done by some ancient BBS bozos.  I did successfully fix a lot of
> that already with regex.
>
> Hartmut W Sager - Tel +1-204-339-8331
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://muug.ca/pipermail/roundtable/attachments/20200104/fdeee98d/attachment.html>


More information about the Roundtable mailing list