One of the two DNS server going down causes impacts

Our computing enviornment consists for Linux, Solaris , AIX, Windows. /etc/resolv.conf file of each *nix has two entries. When the 2nd one goes down we are seeing impacts on AIX hosted services. We are breaking our head, to no avail yet. We have not seen any impact on non-AIX hosted services. While 2nd DNS server remains down, nslookup returns hostnames immediately.

We are trying to avoid running tcpdump, and was trying to capture DNS traffic from client through netstat. But netstat does not capture DNS traffic either.

Would you please give us a hand?

Why are you focused on network traffic analysis?

Seems it is best to audit the DNS server which is failing, or "going down" in your words.

What does it mean "going down" ? What is crashing, exactly and why?

Hi Neo,
One DNS server out of two failed only once. And that caused impact. We are able to reproduce the problem. We want to find out why the impact is felt even though the other DNS server was fine. The impact was reported only for AIX hosted services

--- Post updated at 05:45 AM ---

Since we have not found any reason at upper layer, we want to investigate at "netstat" layer. There we found that netstat does not report DNS requests

Yes but what actually faiiled?

What process? What does "failed" mean??

DNS daemon process crashed? Needed to be restarted? A single DNS query failed?

What EXACTLY failed?

Hi Neo,
First level failure: The CPU DNS server (Windows DC) spiked. This DNS server appears as the 2nd server in /etc/resolv.conf file.
Second (result of the 1st): Application servers were unable to connect to DB server. Logs reported --unable to find DB connection stream

When we reproduced the situation (kept 2nd DNS server down), the application server was unable to connect to DB server. But "nslookup <host> " worked

--- Post updated at 07:22 AM ---

Neo,
Also please note that fixing the root cause (CPU spike or death of one DNS server) is not what I want. I want to solve the fact that resiliency did not work-- why app servers were unable to connect while only DNS server out of two was down.

I've seen this mostly related to DNS query timeouts setup from client side.
The defaults are quite high on most linux/unix operating system, from AIX man page online :

In practice if you have, for instance, two dns servers, and first one /etc/resolv.conf goes down...
The system will try to query first with timeout of 5 seconds and 4 attempts, totaling 20 seconds, until second is tried.

This will for sure hit some timeouts from application side, e.g application will timeout before system returns valid DNS entry.

As for nslookup working, i'm unsure. It this from the same box ?

Suggestion is to change to defaults to lower values and/or implement DNS caching mechanism locally on AIX box.

Hope that helps
Regards
Peasant.

2 Likes

Obviously the default timeout is too high.
Add two lines to /etc/resolv.conf

options timeout:2
options attempts:2

These values will give a total delay of 2 * 2 = 4 seconds when the first DNS server (nameserver) is down.

Further ensure that local is first for hosts in /etc/netsvc.conf (before a reference to bind or dns) - so for example a lookup for localhost is found in /etc/hosts (must be there of course) without querying DNS.
See also: AIX ClearCase server is not responsive during DNS outage

1 Like

all of the above - and you can as well add the option
options rotate
into your /etc/resolv.conf - which enables your box to use all nameservers, not just the first.

1 Like

Thanks all for answering. I still do not understand why impact is observed when the secondary is down and the primary is still up.

The "norma"l approach for a root-cause analysis is to turn on verbose syslog logging for your DNS daemon processes and review the log files for errors and anomalies.

Did you mention your OS and what version of named (or whatever DNS daemon process) you are running, and your current logging configuration for this process?

Sorry, if I missed it.... please post "the details' again. Thanks.

DNS servers are Windows 12. I do not have much control over those. Is there anyway to increase the verbosity of logging in AIX client?

The AIX clients are running AIX 6.x , as far as I remember. To me it's a puzzle that the services are impacted when the 2nd server in /etc/resolv.conf file goes down. I have not tested what happens if the first one goes down and the 2nd one remains up.

THANKS

Actually, if the 2nd nameserver in /etc/resolv.conf goes down, there is no impact at all, unless there is options rotate .

are we sure the first one works / is accessible from the host. I agree with MadeInGermany there should be no need to ever go to the second if the first works fine. You could switch on options debug to see whats really wrong.