Ypbind intermittent issue on Solaris 10

Hi,
I notice below messages on few Linux clients

[root@rhel-client22 ~]# cat /var/log/messages | grep -i yp
Jan 24 13:40:42 rhel-client22-prod crond: YPBINDPROC_DOMAIN: Domain not bound
Jan 24 16:00:42 rhel-client22-prod crond: YPBINDPROC_DOMAIN: Domain not bound
[root@rhel-client22 ~]#

Here is output from debug screen where the error was printed --

79820: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79817: ypbindproc_domain_2_svc (pre.abc.com)
79817: Pinging all active servers.
79817: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79820: Pinging all active servers.
79820: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79817: trylock = failed
79817: trylock = success
79817: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79817: Status: YPBIND_FAIL_VAL
79820: Pinging all active servers.
79820: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79820: Answer for domain 'pre.abc.com' from server 'wksp-dir1-prod'
79820: Pinging all active servers.
79820: ping host 'wksp-dir1-prod', domain 'pre.abc.com'
79820: Pinging all active servers.
79820: ping host 'wksp-dir1-prod', domain 'pre.abc.com

Since this is coming on multiple clients, I checked the NIS master server. Though timestamp is not matching with timestamp of Mastr NIS, when issue occurs, but it may be related

Jan 24 03:11:23 wksp-dir1-prod ypbind[22535]: [ID 337329 daemon.error] NIS server not responding for domain "pre.abc.com"; still trying
Jan 24 03:11:23 wksp-dir1-prod ypbind[24552]: [ID 647655 daemon.error] NIS server for domain "pre.abc.com" OK
Jan 24 05:01:58 wksp-dir1-prod ypbind[1685]: [ID 337329 daemon.error] NIS server not responding for domain "pre.abc.com"; still trying
Jan 24 05:11:48 wksp-dir1-prod ypbind[13158]: [ID 337329 daemon.error] NIS server not responding for domain "pre.abc.com"; still trying
Jan 24 05:20:30 wksp-dir1-prod ypbind[24312]: [ID 647655 daemon.error] NIS server for domain "pre.abc.com" OK
Jan 24 15:26:45 wksp-dir1-prod ypbind[7667]: [ID 337329 daemon.error] NIS server not responding for domain "pre.abc.com"; still trying
Jan 24 21:20:52 wksp-dir1-prod ypbind[15666]: [ID 647655 daemon.error] NIS server for domain "pre.abc.com" OK

How should I debug it? This server itself is NIS server, why it would not reach itself? Any pointers, please?

Thanks

Check the NIS server if/where it binds

ypbind

Check /etc/nsswitch.conf:
it must have files first, and files nis only for a mimimum of items (where it really makes sense).

Check with

snoop rpc ypprog

if a NIS client is excessively querying.

Check with

uptime

if the NIS server is busy/loaded with something.

Check with

ps -ef | grep '\<[y]p'

if the yp processes were restarted.

Thanks for your advice. I checked your asked commands. Can't find anything bad so far

bash-3.2# ypwhich
wksp-dir1-prod
bash-3.2# hostname
wksp-dir1-prod
bash-3.2# uptime
 10:53pm  up 4 day(s),  1:06,  2 users,  load average: 0.17, 0.27, 0.35
bash-3.2# ps -ef | grep '\<[y]p'
    root 24202 23554   1   Jan 20 ?          98:23 /usr/lib/netsvc/yp/ypserv -d
    root 24195 23554   0   Jan 20 ?           0:00 /usr/lib/netsvc/yp/rpc.yppasswdd -D /var/yp/maps -m
    root 24201 23554   0   Jan 20 ?           0:00 /usr/lib/netsvc/yp/ypxfrd
    root 24184 23554   0   Jan 20 ?           0:00 /usr/lib/netsvc/yp/rpc.ypupdated
    root 24231 23554   0   Jan 20 ?           0:25 /usr/lib/netsvc/yp/ypbind
bash-3.2#
bash-3.2# cat /etc/nsswitch.conf | grep -v "#"


passwd:     files nis
group:      files nis
hosts:      files dns
ipnodes:    files
networks:   files
protocols:  files
rpc:        files
ethers:     files
netmasks:   files
bootparams: files
publickey:  files
netgroup:   files nis
automount:  files
aliases:    files
services:   files
printers:       user files

auth_attr:  files
prof_attr:  files
project:    files nis

tnrhtp:     files
tnrhdb:     files
bash-3.2#
bash-3.2# svcs -a | grep yp
online         Jan_20   svc:/system/cryptosvc:default
bash-3.2#
bash-3.2# svcs -a | grep -i rpc/bind
online         Jan_20   svc:/network/rpc/bind:default
bash-3.2#

This is non global zone, so snoop can't run from bge1:1 here. So I ran it from its global server
snoop -d bge1 rpc ypprog

But gives me huge output because lot of servers are querying this Master and that NIS error doesn't appear frequently.

Yes, your figures are okay.
How many zones are there? Maybe another zone is temporarily using too many resources?

Total 4 zones. It is V240, old server with only 4GB ram. But utilization looks okay. Here is output from prstat -Z

ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE
     0       69  348M  366M   8.9% 204:53:40 1.3% global
    10       39   59M   74M   1.8%   1:58:48 0.7% wksp-dir1
     1       31  738M  448M    11% 398:20:41 0.1% wksp-logger1
     4       23   71M   80M   2.0%   4:22:11 0.0% wksp-ntp1
     5       34   86M   78M   1.9%   1:46:46 0.0% wksp-webserv1

As of now, I am enabling sar on this server and will check in the morning if there was any spike when this error occurs.

Yes, that's a good idea.

Do you have a valid Oracle account?
Check for patches for the bge NIC (these are often not part of the recommended patch clusters).