[RndTbl] very strange DNS errors

Wed Apr 20 12:24:20 CDT 2016

> On Apr 20, 2016, at 2:03 AM, Trevor Cordes <trevor at tecnopolis.ca> wrote:
> 
> When I run tests at the command line (dig) on the domains in question (the 
> ones I've seen in email bounces) I will often very quickly get a resolve 
> failure, but usually 5s later another dig to the same domain will resolve 
> 100% ok!
> 

I often use a utility called "check-soa" to check that each of the nameservers listed in the last NS response respond with an SOA
https://github.com/bortzmeyer/check-soa <https://github.com/bortzmeyer/check-soa>
When running it against each of the 5 domains below I did experience occasional delays in command line output, I'm guessing since none of the latency values reported went up, that this delay was from the original NS response

> Every box I am testing on has a similar config with BIND named 9.10 (9.8 
> on one box) running as the local recursive resolver.  /etc/resolv.conf on 
> all is 127.0.0.1.  So that means every lookup that isn't cached is going 
> to the root NS's.
> 
> When it fails to resolve, named.log logs an entry like:
> 20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769
> 20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
> 
> For manual tests it only logs one, for the real-life sendmail problem 
> ones, I'll see dozen/hundreds of the same thing trying hour after hour 
> (usually around the sendmail queue retry times).
> 
> One of my boxes is *extremely* well connected in the US, and while it 
> seems to have errors slightly less often, it still has them.  All the rest 
> are on various levels of Shaw or MTS, res or business.
> 
> This seems to have just started popping up maybe 6 months ago.  It feels 
> like it's getting worse.
> 
> I've setup an easy test, on the actual domains with the most problems:
> 
> rndc flush
> dig +short sportmanitoba.ca
> dig +short gymcan.org
> dig +short brandoneagles.ca
> dig +short interactivegym.org
> dig +short artscouncil.mb.ca
> 
> rndc flush
> dig +trace sportmanitoba.ca
> dig +trace gymcan.org
> dig +trace brandoneagles.ca
> dig +trace interactivegym.org
> dig +trace artscouncil.mb.ca
> 
> Maybe some others can run those tests on their boxes (but only if you're 
> running BIND as caching resolver, which many/most people won't be).
> 
> Here's where it gets interesting the +short tests I can get to fail at 
> least 1 of the domains (at random) about 1/8 of the time!  On at least 5 
> different boxes out there!  But +trace has never failed once on any box or 
> any domain.  It's like +trace does something different, maybe slowing the 
> process down or something, that allows it to always succeed.  (Failure for 
> the +short is a missing line in output, +trace you have to look at the 
> domain/ip returned near the bottom.)
> 

AFAIK +trace doesn't use your caching resolver except to get an answer for which nameservers/IPs to query the root/. at so this is definitely doing something different.

> So, is it just these particular domains??  Something wrong on their (DNS) 
> side?  Or is it more domains, not just these?  Is there any way to 
> diagnose what exactly is failing?  I find it bizarre that *all* of these 
> domains regularly go down for 4+ hours causing an email bounce!?!  Or is 
> there something horribly wrong on my BIND caching DNS servers?
> 

4 of 5 of these domains are on godaddy, the other has DNS handled by Westman.

> Perhaps they are slow, and I'm my BINDs are just not waiting long enough.  
> Is there a way to tell BIND to be more patient waiting for DNS packets to 
> come in?
> 

Normally DNS has a multi-second timeout. I'm not sure of the technical details of how bind handles SRVFAILs but Bind does note which servers respond quicker and weights those. 

> Maybe it's something regarding IPv6?  (I'm doing this all in IPv4 and have 
> no current interest in 6.  And I'm only looking for A records.)
> 

If you don't have v6 addresses on things, generally they won't be asking for quad A records so I don't think this should be an issue for you.

> I have packet traces of the above sample commands when the lookups fail, 
> but I can't really figure out what it's doing, other than one boatload of 
> traffic for a tiny dns query.  I can provide a trace privately on demand 
> if you think you can help.
> 
> Even more odd, I never seem to have a problem with interactive things like 
> web browsing.  If this was a problem with all domains, I should see this 
> in firefox all the time, but I don't.  Maybe Firefox doesn't even obey 
> resolv.conf and does its own thing, or retries heavily itself?
> 
> I also checked to ensure my iptables aren't dropping packets related to 
> this.
> 
> Lastly, answers of "just use 8.8.8.8" aren't helpful because I also need 
> to handle dynamic local, and in some cases, external DNS (often with 
> multiple views), all in the same BIND/box (and I like uniformity across 
> boxes for ease of admin).  Sure, I could try another resolver, but I see 
> no reason BIND can't be made to work, as it has for me for 20 years.  And 
> if this is a BIND bug, I want to submit it to help solve it.
> 
> Thanks guys!
> _______________________________________________
> Roundtable mailing list
> Roundtable at muug.mb.ca
> http://www.muug.mb.ca/mailman/listinfo/roundtable

Theodore Baschak - AS395089 - Hextet Systems
https://ciscodude.net/ - https://hextet.systems/
https://theodorebaschak.com/ - http://mbix.ca/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.muug.mb.ca/pipermail/roundtable/attachments/20160420/e0d11a0b/attachment.html>