[RndTbl] Email address validation!

Tue Apr 13 05:59:41 CDT 2021

On 2021-04-11 Adam Thompson wrote:
> TL;DR: you think you know how to validate an email address?  You're
> wrong.
> 
> https://www.netmeister.org/blog/email.html

Here's my official 20-years-wisdom email validation regex I use in many
projects (in utf8 encoding, php single-quote format):

(?:[\\x21\\x23-\\x27\\x2A-\\x2B\\x2D\\x2F-\\x39\\x3F\\x41-\\x5A\\x5E-\\x60\\x61-\\x7A\\x7B-\\x7E\\x{0080}-\\x{FFFF}]+(?:\\x2E[\\x21\\x23-\\x27\\x2A-\\x2B\\x2D\\x2F-\\x39\\x3F\\x41-\\x5A\\x5E-\\x60\\x61-\\x7A\\x7B-\\x7E\\x{0080}-\\x{FFFF}]+)*|\\x22(?:[\\x21\\x23-\\x5B\\x5D-\\x7E\\x{0080}-\\x{FFFF}]|\\x5C[\\x20\\x09\\x21-\\x7E\\x{0080}-\\x{FFFF}])*\\x22)@[\\x2Da-zA-Z0-9\\x{0080}-\\x{FFFF}]+(?:\\.[\\x2Da-zA-Z0-9\\x{0080}-\\x{FFFF}]+)+

(I actually wrote a script that closely matches the RFC5322 grammar and
auto-gens the above craziness.)

Yes, the above regex makes some assumptions and errs on the side of
allowing rather than blocking (to try to capture real customers while
still blocking egregious fuzzers/spammers).

The page you reference is very good, and had some things I haven't
thought of in a while, however, I don't think he's actually written
something useful for "validation" vis a vis 2021.

Any source routing (and uucp bang) is evil and no one uses it anymore.
If a customer of my site wants to try to use it, they deserve to be
blocked. :-)  A good rule of thumb on validation should be "if a guy is
smart enough to use a really weird email address feature, they are
smart enough to know why it didn't work and why they should use
something simpler".  We can minorly tick off 0.001% of users (wizards),
that's ok, as long as we keep the 99.99% normal users from thinking.
So his points 1-3 are ignorable.

Points 4-8 are the important bits I like to address, hence all the
wacky utf8 craziness in my regex, with the big ones being special chars
in the proper places, + feature, and "" rules.  I think I capture some
of the dot rules, but I do allow consecutive dots because --who cares--.

Point 8 I just noticed I only allow 2-byte UTF8, oops, as I wrote this
before the db properly handled 3/4-byte.  Time to update the regex!
But if anyone is using poomoji as their local part...

Point 9, hard (impossible) to do in a regex as above.  Just trim local
and domain part and hope for the best.  Anyone going over that, they
know why it's failing...

Point 10, you just have to do a MX resolve if you want to avoid the
fuzzers... I disagree with this guy's point 10.  If you want to
register your email address at web sites before you register and setup
your domain... tough!

Point 11 is important, but it's pretty much already "free" with no
special logic because puny just looks like a normal domain to most code.

Point 12, dotless domains... Tough!  Bite me.  And though there may be
domains like that allowed by *domain* RFCs, do the *email* RFCs allow
it?

Point 13, half-evil, but everyone's validator will already allow a raw
dotted-quad, but I'm not making special rules for [] syntax.  Only
uber-gurus would try this, and they will know why it fails.

So I agree it's not as easy as it seems, and going down that rabbit
hole will take a few evenings of work and you still won't arrive at
"correct".  But you can get "good enough" if you want to allow 99.99%
of normal users to sign up at your site.  I'm open to arguments and
suggestions, as it's a bi-yearly ritual to try to improve that regex.
Like adding 4-byte UTF8 support so my customers can be poo.