NFS mount hangs

robotronic · July 15, 2008, 2:39am

Last week I've installed Windows Services For UNIX on a Windows 2003 Server test machine (s2003), and exported a folder through NFS. The share is named "storage", and is used as a temporary place for very large export/dump files. Unfortunately, this is the only machine where I have access to a big, inexpensive disk and is accessible from both production and test network environments.

I was able to mount the nfs share in read/write mode on two Solaris' servers, using the command:

mount -F nfs -o hard,rsize=32768,wsize=32768 s2003:/storage /storage

The two machines are called, respectively:

eprod, SunOS 5.8
etest, SunOS 5.10

I've intensively used the mounted file system without any problems, from both machines. But today, while issuing a "ls" command on eprod, I've noticed the message:

NFS server s2003 not responding still trying

On etest, I didn't have any issue and the remote share is still mounted and working.

So I thought that maybe there was a network problem in the weekend, and tried to umount and remount "/storage" on eprod:

eprod/root> umount s2003:/storage
nfs umount: /storage: is busy

eprod/root> fuser /storage        # The command hangs indefinitely (interrupted with ^C)
/storage:

eprod/root> fuser -c /storage     # The command works and it doesn't report anything
/storage:

So, I've tried the hard way, and it worked:

umount -f /storage

The problem now is that I am not able to mount the filesystem anymore:

eprod/root> mount -F nfs -o hard,rsize=32768,wsize=32768 s2003:/storage /storage
NFS server s2003 not responding still trying

The command "hangs" indefinitely... I've already performed basic connection tests from eprod to s2003:

ping works
telnet s2003 on port 2049 works

I've also tried to monitor network traffic with netstat and snoop while issuing the mount command:

eprod/root> netstat -a | grep s2003
eprod.login         s2003.nfsd          0      0 24820      0 SYN_SENT
eprod.53759         s2003.nfsd      65415      0 24820      0 TIME_WAIT

eprod/root> snoop s2003
Using device /dev/ce (promiscuous mode)
       eprod ->  s2003   PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP
       s2003 ->  eprod   PORTMAP R GETPORT port=1048
       eprod ->  s2003   MOUNT3 C Null
       s2003 ->  eprod   MOUNT3 R Null 
       eprod ->  s2003   MOUNT3 C Mount /storage
       s2003 ->  eprod   MOUNT3 R Mount OK FH=7593 Auth=none,unix
       eprod ->  s2003   PORTMAP C GETPORT prog=100003 (NFS) vers=3 proto=TCP
       s2003 ->  eprod   PORTMAP R GETPORT port=2049
       eprod ->  s2003   TCP D=2049 S=51997 Syn Seq=1199030867 Len=0 Win=24820 Options=<nop,nop,sackOK,mss 1460>
       s2003 ->  eprod   TCP D=51997 S=2049 Syn Ack=1199030868 Seq=2849360585 Len=0 Win=16384 Options=<mss 1460,nop,nop,sackOK>
       eprod ->  s2003   TCP D=2049 S=51997     Ack=2849360586 Seq=1199030868 Len=0 Win=24820
       eprod ->  s2003   NFS C NULL3
       s2003 ->  eprod   NFS R NULL3 
       eprod ->  s2003   TCP D=2049 S=51997     Ack=2849360614 Seq=1199030988 Len=0 Win=24820
       eprod ->  s2003   TCP D=2049 S=51997 Fin Ack=2849360614 Seq=1199030988 Len=0 Win=24820
       s2003 ->  eprod   TCP D=51997 S=2049     Ack=1199030989 Seq=2849360614 Len=0 Win=65415
       s2003 ->  eprod   TCP D=51997 S=2049 Fin Ack=1199030989 Seq=2849360614 Len=0 Win=65415
       eprod ->  s2003   TCP D=2049 S=51997     Ack=2849360615 Seq=1199030989 Len=0 Win=24820
       eprod ->  s2003   RLOGIN R port=2049 
       eprod ->  s2003   RLOGIN R port=2049 
       eprod ->  s2003   RLOGIN R port=2049 
       ...
       eprod ->  s2003   RLOGIN R port=2049 
       eprod ->  s2003   RLOGIN R port=2049 
       ...
       ...
       ... and so on

I exclude possible problems on s2003, because I can mount "/storage" folder from other machines on the same network of eprod.

I've also found that using udp protocol instead of tcp (default) works:

mount -F nfs -o hard,rsize=32768,wsize=32768,proto=udp s2003:/storage /storage

What could be the problem? How can I solve this issue?

Thanks in advance!

zaxxon · July 15, 2008, 4:31am

Did you try to stop/start the NFS Demon service on Windows? Maybe it's "locking" the connection for just those 2 hosts.

You can also check on the Sun boxes, if there is a /var/lib/nfs/rmtab (or something similar) which you might want to empty or delete entries for the 2 specific Sun boxes. Maybe there is some similar file on the Windows server which you can edit.

You can also check:
Linux NFS faq

If this doesn't help and you already tested a lot you might think of not using NFS (I don't like it, too many problems on some machines) maybe try it with normal windows shares (SMB) and install/use a Samba client on the Sun boxes?

robotronic · July 15, 2008, 5:15am

Yes, I've already stopped and restarted the NFS service on Windows, and on Solaris boxes "/etc/rmtab" is empty, even on etest where the filesystem is mounted.

About Samba, as far as I know there is no Samba Client fo SunOS, except Sharity or Sharity Light which I've played with some time ago, with a little success..... However, since eprod is a production system, I can't and I don't want to install anything on it, also because I don't need a persistent shared directory: it's only for temporary staging a big amount of data, after that I will umount the shared filesystem from both Solaris boxes.

ramen_noodle · July 15, 2008, 1:20pm

Are you going to be using unix services for windows in production? Doesn't seem like a very wise decision if so, for reasons so obvious I won't go into them.

I've seen this behavior (udp based nfs working when tcp based nfs fails) on an extremely
busy network segment on an overutilized client. Are you monitoring these hosts via snmp?

robotronic · July 16, 2008, 2:36am

You are right, in fact I want to use that shared storage only for a big and not-so-critical data transfer between production and test environments, and then umount it. Unfortunately, I don't have enough disk space neither on eprod nor on etest, otherwise I should have created the files I need on eprod and then transferred them via ftp/scp on etest. With a shared storage, I also eliminate the need of a long data transfer.

I am not monitoring the network but I can't exclude that, for a short period, there could be a high load on s2003 or a network congestion. Given that, I can't believe that now I can't repair this situation, it sounds so strange...

ramen_noodle · July 16, 2008, 2:51pm

Why don't you post the output of netstat on both hosts, restart mountd and portmap (or analogues) on the server, and then restart portmap on the client and attempt to remount. it could just be that the windows nfsd and unix-like rpc are buggy.

robotronic · July 16, 2008, 3:30pm

I think that the only remaining step I didn't tried yet is restarting nfs client and/or related processes on eprod. Unfortunately I don't have the knowledge to do that.

Could you provide some commands for restarting nfs client service on Solaris 8 (and 10)? I've googled around but my greatest concern is disabling something vital or hampering current connections between clients and eprod, because also on eprod there could be something shared with nfs that other clients may be accessing.

ramen_noodle · July 16, 2008, 4:35pm

Sol 8 and 9 should be : /etc/init.d/rpc restart (or stop and start) & /etc/init.d/nfs.client restart (or stop and start).
solaris 10 uses svcadm for the service restarts and activation. The man page should be of help.

robotronic · July 18, 2008, 6:32am

I've rebooted s2003, stopped and restarted nfs.client on eprod but still no luck...

incredible · July 18, 2008, 6:51am

what does "dfshares" outputs?

robotronic · July 18, 2008, 9:00am

Here it is:

eprod/root> dfshares
RESOURCE                                  SERVER ACCESS    TRANSPORT
     eprod:/usr1                           eprod  -         -