Unable to mount previously-working NFS share from NIM to LPAR

Right, now that I've finally worked out this website, I'll ask my question!

I am having an absolute nightmare with NFS on AIX. I have used it many times, and I know what I'm doing, however I cannot fathom what is going on here. I have 2 LPARs, sitting on the same physical host. They are configured with an internal and external network. The internal network is being used here. Nothing has changed since this was working, as far as the network connections go. However, when I mount any exported filesystems from NIM to LPAR1, I get a timeout:

nfsmnthelp: NIMsvr: Connection timed out

I have checked the following:

  1. /etc/hosts is correct on both, and I have tried using both networks
  2. NFS is started on both NIM and LPAR1. I have tried restarting the services using `stopsrc -g nfs; stopsrc -s portmap` then starting them again
  3. Stopping services, then running `rm -rf /etc/state /etc/sm /etc/sm.bak /etc/xtab /etc/rmtab; startsrc -s portmap; startsrc -g nfs; exportfs -a; showmount -e NIMsvr`. The last command shows the mount is available
  4. Removing the export from NIM, removing it from LPAR1, then restarting NFS on both NIM and LPAR1, adding the mount back in and re-mounting (checking showmount -e before adding back in and after, and the mount shows up the second time)
  5. Telnet to port 111 from LPAR1 to NIM works fine

I am out of ideas, can anyone help please? I am about to pull my last few hairs out!

I'm not an AIX expert so I am only commenting from a generic point of view. That said, have you verified the NFS versions being implemented??

On modern Unix systems and storage NFS can come in Versions 2, 3 and 4.

If one implementation is a later version than the other, you can specify the version (2,3, or 4) on the mount command line.

Trying to inter-operate different versions often gives rise to odd-behaviour, errors, and malfunctions.

I'm still thinking about it and if I come up with anything else I'll post again.

And, of course, the access rights must allow the connection; BOTH the NFS share AND the protection mask on the directory itself.

So for testing only, you could set the rights on the shared directory to 777 and share the NFS handle '-o rw,root' to allow the incoming NFS mount request to get root rights. Dangerous to leave it like that but it will tell you something if it then works.

Hi,

Thanks for the suggestion, but unfortunately this didn't work. I get the same error as before.

When you said "share the NFS handle '-o rw,root' I suspected you meant to put this into the exports file for that share...?

Yes, indeed. Again speaking generically something like:

# share -F nfs -o rw,root  <directory>

or

# share -F nfs -o rw,root=<client>  <directory>

if <client> is in the hosts file of the NFS serving node.

OK yeah so that doesn't work. The share command is also not available on my NIM server:

[root@NIMsvr export]$ share -F nfs -o rw,root /export/archive
share: 1831-186 nfs not found in /etc/exports
share: 1831-186 -o not found in /etc/exports
share: 1831-186 rw,root not found in /etc/exports
share: 1831-190 unknown option: root

Perhaps I've missed something but I'm not sure how relevant that post is. They show use of mountd, then discuss use of TCP and UDP, and the post doesn't get resolved...have I missed something in that thread?

No, you didn't miss anything in that thread but it did say that different NFS versions use different protocols. The moderator that said that, Bakunin, is very knowledgeable on AIX.

He might chip in when he sees your thread.

Ah OK, that's fine I was already aware of that. I'm using NFSv3. TCP and UDP ports are available, and I can get from LPAR1 to NIM using the TCP port:

[root@NIMsvr /]$ rpcinfo -p | grep mountd
    100005    1   tcp  57906  mountd
    100005    2   tcp  57906  mountd
    100005    3   tcp  57906  mountd
    100005    1   udp  38084  mountd
    100005    2   udp  38084  mountd
    100005    3   udp  38084  mountd
[root@LPAR1 ~]$ telnet nimsvr 57906
Trying...
Connected to NIMsvr.
Escape character is '^]'.

Check RPC is running on both servers.

rpcinfo -p <remote server>

It is indeed running. I have recently restarted the NIM server, and it took a while to come up. It was stalling at starting up the rpc.mountd daemon. Then it finished OK:

0513-059 The nfsd Subsystem has been started. Subsystem PID is 5177506.
05/23 09:10:51 tftpd: [00000001] EZZ7001I Starting.
0513-059 The rpc.mountd Subsystem has been started. Subsystem PID is 3080418.
0513-059 The rpc.statd Subsystem has been started. Subsystem PID is 4325578.
0513-059 The rpc.lockd Subsystem has been started. Subsystem PID is 3277042.
Completed NFS services.

See output of the rpcinfo command below:

UKDCMMORA-1:>rpcinfo -p NIMsvr
   program vers proto   port  service
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    200006    1   udp   2049
    200006    4   udp   2049
    200006    1   tcp   2049
    200006    4   tcp   2049
    100005    1   tcp  32768  mountd
    100005    2   tcp  32768  mountd
    100005    3   tcp  32768  mountd
    100005    1   udp  32805  mountd
    100005    2   udp  32805  mountd
    100005    3   udp  32805  mountd
    400005    1   udp  32806
    100021    1   udp  32971  nlockmgr
    100021    2   udp  32971  nlockmgr
    100021    3   udp  32971  nlockmgr
    100021    4   udp  32971  nlockmgr
    100021    1   tcp  32770  nlockmgr
    100021    2   tcp  32770  nlockmgr
    100021    3   tcp  32770  nlockmgr
    100021    4   tcp  32770  nlockmgr
    100024    1   tcp  32774  status
    100024    1   udp  33021  status
    100133    1   tcp  32774
    100133    1   udp  33021
    200001    1   tcp  32774
    200001    1   udp  33021
    200001    2   tcp  32774
    200001    2   udp  33021

Hi, I seen something like this with systems with an external and internal network. Make sure your routing is setup correctly from the LPARs. I know, pretty generic to say.

Also any firewall between your systems? Also another reason for timeouts.
If there is a firewall consider updating /etc/services with new mountd entries then refresh the rpc.mountd service. By default, I think mountd can use anything between 32768 to 65535. So by forcing it to a port you can reduce the set of rules that need to be modified.