[RndTbl] very strange DNS errors
theodore at ciscodude.net
Wed Apr 20 12:24:20 CDT 2016
> On Apr 20, 2016, at 2:03 AM, Trevor Cordes <trevor at tecnopolis.ca> wrote:
> When I run tests at the command line (dig) on the domains in question (the
> ones I've seen in email bounces) I will often very quickly get a resolve
> failure, but usually 5s later another dig to the same domain will resolve
> 100% ok!
I often use a utility called "check-soa" to check that each of the nameservers listed in the last NS response respond with an SOA
When running it against each of the 5 domains below I did experience occasional delays in command line output, I'm guessing since none of the latency values reported went up, that this delay was from the original NS response
> Every box I am testing on has a similar config with BIND named 9.10 (9.8
> on one box) running as the local recursive resolver. /etc/resolv.conf on
> all is 127.0.0.1. So that means every lookup that isn't cached is going
> to the root NS's.
> When it fails to resolve, named.log logs an entry like:
> 20-Apr-2016 00:37:28.276 query-errors: debug 1: client 127.0.0.1#33971 (artscouncil.mb.ca): query failed (SERVFAIL) for artscouncil.mb.ca/IN/A at query.c:7769
> 20-Apr-2016 00:37:28.276 query-errors: debug 2: fetch completed at resolver.c:3658 for artscouncil.mb.ca/A in 0.215778: SERVFAIL/success [domain:artscouncil.mb.ca,referral:1,restart:3,qrysent:2,timeout:0,lame:0,neterr:2,badresp:0,adberr:0,findfail:0,valfail:0]
> For manual tests it only logs one, for the real-life sendmail problem
> ones, I'll see dozen/hundreds of the same thing trying hour after hour
> (usually around the sendmail queue retry times).
> One of my boxes is *extremely* well connected in the US, and while it
> seems to have errors slightly less often, it still has them. All the rest
> are on various levels of Shaw or MTS, res or business.
> This seems to have just started popping up maybe 6 months ago. It feels
> like it's getting worse.
> I've setup an easy test, on the actual domains with the most problems:
> rndc flush
> dig +short sportmanitoba.ca
> dig +short gymcan.org
> dig +short brandoneagles.ca
> dig +short interactivegym.org
> dig +short artscouncil.mb.ca
> rndc flush
> dig +trace sportmanitoba.ca
> dig +trace gymcan.org
> dig +trace brandoneagles.ca
> dig +trace interactivegym.org
> dig +trace artscouncil.mb.ca
> Maybe some others can run those tests on their boxes (but only if you're
> running BIND as caching resolver, which many/most people won't be).
> Here's where it gets interesting the +short tests I can get to fail at
> least 1 of the domains (at random) about 1/8 of the time! On at least 5
> different boxes out there! But +trace has never failed once on any box or
> any domain. It's like +trace does something different, maybe slowing the
> process down or something, that allows it to always succeed. (Failure for
> the +short is a missing line in output, +trace you have to look at the
> domain/ip returned near the bottom.)
AFAIK +trace doesn't use your caching resolver except to get an answer for which nameservers/IPs to query the root/. at so this is definitely doing something different.
> So, is it just these particular domains?? Something wrong on their (DNS)
> side? Or is it more domains, not just these? Is there any way to
> diagnose what exactly is failing? I find it bizarre that *all* of these
> domains regularly go down for 4+ hours causing an email bounce!?! Or is
> there something horribly wrong on my BIND caching DNS servers?
4 of 5 of these domains are on godaddy, the other has DNS handled by Westman.
> Perhaps they are slow, and I'm my BINDs are just not waiting long enough.
> Is there a way to tell BIND to be more patient waiting for DNS packets to
> come in?
Normally DNS has a multi-second timeout. I'm not sure of the technical details of how bind handles SRVFAILs but Bind does note which servers respond quicker and weights those.
> Maybe it's something regarding IPv6? (I'm doing this all in IPv4 and have
> no current interest in 6. And I'm only looking for A records.)
If you don't have v6 addresses on things, generally they won't be asking for quad A records so I don't think this should be an issue for you.
> I have packet traces of the above sample commands when the lookups fail,
> but I can't really figure out what it's doing, other than one boatload of
> traffic for a tiny dns query. I can provide a trace privately on demand
> if you think you can help.
> Even more odd, I never seem to have a problem with interactive things like
> web browsing. If this was a problem with all domains, I should see this
> in firefox all the time, but I don't. Maybe Firefox doesn't even obey
> resolv.conf and does its own thing, or retries heavily itself?
> I also checked to ensure my iptables aren't dropping packets related to
> Lastly, answers of "just use 126.96.36.199" aren't helpful because I also need
> to handle dynamic local, and in some cases, external DNS (often with
> multiple views), all in the same BIND/box (and I like uniformity across
> boxes for ease of admin). Sure, I could try another resolver, but I see
> no reason BIND can't be made to work, as it has for me for 20 years. And
> if this is a BIND bug, I want to submit it to help solve it.
> Thanks guys!
> Roundtable mailing list
> Roundtable at muug.mb.ca
Theodore Baschak - AS395089 - Hextet Systems
https://ciscodude.net/ - https://hextet.systems/
https://theodorebaschak.com/ - http://mbix.ca/
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Roundtable