[RndTbl] very strange DNS errors

Wed Apr 20 06:50:51 CDT 2016

Without taking the time to examine these carefully, I'd guess that those domains are being served off less-than-stellar DNS servers, and the fault is likely not at your end.
There's a disgustingly-large % of DNS service in the wild that's outright held together with chewing gum and bailing twine... bad glue records in particular are a problem.
Examine the chain of authoritative servers for each and I'll bet you find some commonalities.
Also there are dozens of DNS "lint" tools that will help you track down other people's errors as well as your own.
Best guess without testing: domain has 3-4 servers listed at gTLD, only 2-3 of those are authoritative for the domain, and something along the line has an illegally-short TTL.
-Adam

On April 20, 2016 2:03:59 AM CDT, Trevor Cordes <trevor at tecnopolis.ca> wrote:
>Some recurring email bounces have tipped me off to a really strange DNS
>
>issue on the boxes I admin.  I really need help on this as it's
>impacting 
>a real customer in a real way: delayed emails.
>
>Often, and somewhat randomly, DNS will fail to resolve a domain.  
>Sometimes it will fail to resolve it for >4 hours and trigger a
>diagnostic 
>bounce from my sendmail.
>
>When I run tests at the command line (dig) on the domains in question
>(the 
>ones I've seen in email bounces) I will often very quickly get a
>resolve 
>failure, but usually 5s later another dig to the same domain will
>resolve 
>100% ok!
>
>Every box I am testing on has a similar config with BIND named 9.10
>(9.8 
>on one box) running as the local recursive resolver.  /etc/resolv.conf
>on 
>all is 127.0.0.1.  So that means every lookup that isn't cached is
>going 
>to the root NS's.
>
>When it fails to resolve, named.log logs an entry like:
>20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971
>(artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A
>at query.c:7769
>20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at
>resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success
>[domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
>
>For manual tests it only logs one, for the real-life sendmail problem 
>ones, I'll see dozen/hundreds of the same thing trying hour after hour 
>(usually around the sendmail queue retry times).
>
>One of my boxes is *extremely* well connected in the US, and while it 
>seems to have errors slightly less often, it still has them.  All the
>rest 
>are on various levels of Shaw or MTS, res or business.
>
>This seems to have just started popping up maybe 6 months ago.  It
>feels 
>like it's getting worse.
>
>I've setup an easy test, on the actual domains with the most problems:
>
>rndc flush
>dig +short sportmanitoba.ca
>dig +short gymcan.org
>dig +short brandoneagles.ca
>dig +short interactivegym.org
>dig +short artscouncil.mb.ca
>
>rndc flush
>dig +trace sportmanitoba.ca
>dig +trace gymcan.org
>dig +trace brandoneagles.ca
>dig +trace interactivegym.org
>dig +trace artscouncil.mb.ca
>
>Maybe some others can run those tests on their boxes (but only if
>you're 
>running BIND as caching resolver, which many/most people won't be).
>
>Here's where it gets interesting the +short tests I can get to fail at 
>least 1 of the domains (at random) about 1/8 of the time!  On at least
>5 
>different boxes out there!  But +trace has never failed once on any box
>or 
>any domain.  It's like +trace does something different, maybe slowing
>the 
>process down or something, that allows it to always succeed.  (Failure
>for 
>the +short is a missing line in output, +trace you have to look at the 
>domain/ip returned near the bottom.)
>
>So, is it just these particular domains??  Something wrong on their
>(DNS) 
>side?  Or is it more domains, not just these?  Is there any way to 
>diagnose what exactly is failing?  I find it bizarre that *all* of
>these 
>domains regularly go down for 4+ hours causing an email bounce!?!  Or
>is 
>there something horribly wrong on my BIND caching DNS servers?
>
>Perhaps they are slow, and I'm my BINDs are just not waiting long
>enough.  
>Is there a way to tell BIND to be more patient waiting for DNS packets
>to 
>come in?
>
>Maybe it's something regarding IPv6?  (I'm doing this all in IPv4 and
>have 
>no current interest in 6.  And I'm only looking for A records.)
>
>I have packet traces of the above sample commands when the lookups
>fail, 
>but I can't really figure out what it's doing, other than one boatload
>of 
>traffic for a tiny dns query.  I can provide a trace privately on
>demand 
>if you think you can help.
>
>Even more odd, I never seem to have a problem with interactive things
>like 
>web browsing.  If this was a problem with all domains, I should see
>this 
>in firefox all the time, but I don't.  Maybe Firefox doesn't even obey 
>resolv.conf and does its own thing, or retries heavily itself?
>
>I also checked to ensure my iptables aren't dropping packets related to
>
>this.
>
>Lastly, answers of "just use 8.8.8.8" aren't helpful because I also
>need 
>to handle dynamic local, and in some cases, external DNS (often with 
>multiple views), all in the same BIND/box (and I like uniformity across
>
>boxes for ease of admin).  Sure, I could try another resolver, but I
>see 
>no reason BIND can't be made to work, as it has for me for 20 years. 
>And 
>if this is a BIND bug, I want to submit it to help solve it.
>
>Thanks guys!
>_______________________________________________
>Roundtable mailing list
>Roundtable at muug.mb.ca
>http://www.muug.mb.ca/mailman/listinfo/roundtable

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.muug.mb.ca/pipermail/roundtable/attachments/20160420/ddd28b70/attachment.html>