NFS with a NAS: permanently inconsistent directory state across clients

Hi,

I am having some NFS directory consistency problems with the below setup on a local (192.) network:

  1. Different permissions (chmod) for the same NFS dir are reflected on different clients.
  2. (more serious) an NFS dir created on client1 cannot be accessed on client2; this applies to some directories, not others; when this problem applies to a directory, it is a consistent problem.

Setup:

NFS server: Thecus N8800, 16Tb raw, RAID6
Client1: Sun Fire V210, Solaris 5.10 Generic_139555-08
Client2: Sun Fire V100, Solaris 5.10 Generic_118822-23

Both clients nfs-mount. Flags: vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=32768,wsize=32768,retrans=5,timeo=600
Attr cache: acregmin=3,acregmax=60,acdirmin=30,acdirmax=60

Use case on Client1:

CD to an nfs subdir:

cd /.../nfsdir
ls -la
drwxrwx---+ 56 user group   12288 Mar 13 15:28 .
drwxrwx---+  3 user group      30 Mar 17 13:57 ..
drwxrwx---+  3 user group   53248 Oct  3 04:41 somedir1
drwxrwxrwx+  7 user group    4096 Mar 13 15:29 somedir2

All good. CD to somedir1 works; can LS and see files. Same for somedir2. Note: somedir2 was mkdir'ed on Client1.

Use case On Client2:

CD to the same nfs subdir. Listing files works, but the permissions are different than what is listed on Client1:

cd /.../nfsdir
ls -la
drwx------+ 56 user group     12288 Mar 13 14:28 .
drwx------+  3 user group        30 Mar 17 12:57 ..
drwx------+  3 user group     53248 Oct  3 04:41 somedir1
drwx---rwx+  7 user group      4096 Mar 13 14:29 somedir2

PROBLEM1: the group permission for the same dir is different on client1 vs. client2.

CD to somedir1 works; can LS and see files.
PROBLEM2: *cannot* CD to somedir2:

bash: cd: somedir2/: Not a directory

On both clients, 'group' is defined in /etc/groups with the same id; 'user' is defined in /etc/passwd with the same id.

When I un-mount and re-mount the nfs dir on client2, I am able to access the directory in question (subdir2). Permissions, however, are still different across clients.

Does anyone have suggestions as to what is going wrong with my NFS setup? I'll be happy to post more information.

Thanks a lot!

What is your NFS server? What does nfsstat show on the server and on each client?

NFS server is a Thecus N8800 Network Attached Storage. It runs Linux and is configurable via a web interface. Unfortunately, I don't see much useful info via the web UI on NFS status. I also have not tried rooting the box.

"nfsstat -m" on the client is in my original post.

nfsstat on client:

Server rpc:
Connection oriented:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  dupreqs    
48         0          0          0          0          0          0          
Connectionless:
calls      badcalls   nullrecv   badlen     xdrcall    dupchecks  dupreqs    
0          0          0          0          0          0          0          

Server nfs:
calls     badcalls  
0         0         
Version 2: (0 calls)
null     getattr  setattr  root     lookup   readlink read     wrcache  
0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     
write    create   remove   rename   link     symlink  mkdir    rmdir    
0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     
readdir  statfs   
0 0%     0 0%     
Version 3: (0 calls)
null        getattr     setattr     lookup      access      readlink    
0 0%        0 0%        0 0%        0 0%        0 0%        0 0%        
read        write       create      mkdir       symlink     mknod       
0 0%        0 0%        0 0%        0 0%        0 0%        0 0%        
remove      rmdir       rename      link        readdir     readdirplus 
0 0%        0 0%        0 0%        0 0%        0 0%        0 0%        
fsstat      fsinfo      pathconf    commit      
0 0%        0 0%        0 0%        0 0%        
Version 4: (0 calls)
null                compound            
0 0%                0 0%                
Version 4: (0 operations)
reserved            access              close               commit              
0 0%                0 0%                0 0%                0 0%                
create              delegpurge          delegreturn         getattr             
0 0%                0 0%                0 0%                0 0%                
getfh               link                lock                lockt               
0 0%                0 0%                0 0%                0 0%                
locku               lookup              lookupp             nverify             
0 0%                0 0%                0 0%                0 0%                
open                openattr            open_confirm        open_downgrade      
0 0%                0 0%                0 0%                0 0%                
putfh               putpubfh            putrootfh           read                
0 0%                0 0%                0 0%                0 0%                
readdir             readlink            remove              rename              
0 0%                0 0%                0 0%                0 0%                
renew               restorefh           savefh              secinfo             
0 0%                0 0%                0 0%                0 0%                
setattr             setclientid         setclientid_confirm verify              
0 0%                0 0%                0 0%                0 0%                
write               release_lockowner   illegal             
0 0%                0 0%                0 0%                

Server nfs_acl:
Version 2: (0 calls)
null        getacl      setacl      getattr     access      getxattrdir 
0 0%        0 0%        0 0%        0 0%        0 0%        0 0%        
Version 3: (0 calls)
null        getacl      setacl      getxattrdir 
0 0%        0 0%        0 0%        0 0%        

Client rpc:
Connection oriented:
calls      badcalls   badxids    timeouts   newcreds   badverfs   timers     
35344415   0          0          0          0          0          0          
cantconn   nomem      interrupts 
0          0          0          
Connectionless:
calls      badcalls   retrans    badxids    timeouts   newcreds   badverfs   
6          1          0          0          0          0          0          
timers     nomem      cantsend   
3          0          0          

Client nfs:
calls     badcalls  clgets    cltoomany 
35090121  1         35089942  77        
Version 2: (5 calls)
null     getattr  setattr  root     lookup   readlink read     wrcache  
0 0%     3 60%    0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     
write    create   remove   rename   link     symlink  mkdir    rmdir    
0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     0 0%     
readdir  statfs   
0 0%     2 40%    
Version 3: (35000040 calls)
null         getattr      setattr      lookup       access       readlink     
0 0%         6894383 19%  105714 0%    295215 0%    87658 0%     0 0%         
read         write        create       mkdir        symlink      mknod        
13428324 38% 13929856 39% 34582 0%     1829 0%      0 0%         0 0%         
remove       rmdir        rename       link         readdir      readdirplus  
2988 0%      147 0%       117 0%       0 0%         423 0%       4927 0%      
fsstat       fsinfo       pathconf     commit       
76 0%        4 0%         0 0%         213797 0%    
Version 4: (0 calls)
null                compound            
0 0%                0 0%                
Version 4: (0 operations)
reserved            access              close               commit              
0 0%                0 0%                0 0%                0 0%                
create              delegpurge          delegreturn         getattr             
0 0%                0 0%                0 0%                0 0%                
getfh               link                lock                lockt               
0 0%                0 0%                0 0%                0 0%                
locku               lookup              lookupp             nverify             
0 0%                0 0%                0 0%                0 0%                
open                openattr            open_confirm        open_downgrade      
0 0%                0 0%                0 0%                0 0%                
putfh               putpubfh            putrootfh           read                
0 0%                0 0%                0 0%                0 0%                
readdir             readlink            remove              rename              
0 0%                0 0%                0 0%                0 0%                
renew               restorefh           savefh              secinfo             
0 0%                0 0%                0 0%                0 0%                
setattr             setclientid         setclientid_confirm verify              
0 0%                0 0%                0 0%                0 0%                
write               
0 0%                

Client nfs_acl:
Version 2: (1 calls)
null        getacl      setacl      getattr     access      getxattrdir 
0 0%        0 0%        0 0%        1 100%      0 0%        0 0%        
Version 3: (90042 calls)
null        getacl      setacl      getxattrdir 
0 0%        90042 100%  0 0%        0 0%        

Umm, no you didn't provide nfsstat data in your original post:

What's the inode number of the problem directories and/or files?

I suspect the problem is with the server - Solaris is VERY particular about NFS implementations being EXACTLY according to the NFS spec.

Can a 64-bit process access the files/directories?

Ensure the mount points are directories with 755 permission! (umount to check.)

That too.

I see that nfsstat data and I see non-zero values in NFSv2 stats. Is the server Linux-based? Or Irix-based - nothing like inode numbers greater than 2 gig - the files are invisible to 32-bit applications running on Solaris since an inode number above 2 gig violates the specs for a 32-bit system.

The inode number of the problem file in question is: 1080094

Yes, the NFS server is Linux-based.

Re. ensure mount points are 755: yes, they are; otherwise I wouldn't be able to create any directories/files. Note that the problem occurs only with some newly created directories, but when it does occur, it is consistent (directory can never be read).

Regarding the other problem (perhaps related): even with freshly mounted nfs dirs, the two clients show different group permissions for all files/dirs:

Example of same dir:
Client1:
drwxrwx---+ 58 user group
Client2:
drwx------+ 58 user group

Both user and group are defined with same ids in /etc/passwd, /etc/group. What config might be faulty to warrant this behavior?

Thanks.

How many groups are the user(s) that have the problems in?

Linux NFSv2 used to have a non-compliant implementation where group entries would be truncated for users in more than 16 groups. And I see non-zero NFSv2 stats in what you've posted. Seems a bit of a stretch, but are you certain your file system mounts are ALWAYS done with "vers=3". I remember a subcontractor who ignored direction from the customer and prime contractor to always use that mount option because their engineers were "smarter than the 'idiots' who told them that". Until one day the production system they delivered hiccuped during NFS negotiations and fell back to v2. On a system where many of the data files were MUCH larger than 2 gb....

When's the last time your NFS server was rebooted?

The user is only in one group.

Re. "ALWAYS done with vers=3": no, I cannot confirm that. Judging by the nfs flags from nfsstat, it claims to be vers=3, but I did not add this in vfstab.

I will go ahead and add the v=3 flag in vfstab for all of the nfs mount points and observe and report back in a while once I've thoroughly tested. Thank you for this advice already.

Re. "last time rebooted": 400+ days ago at the time of writing the original post. I have rebooted today.

What do the permissions look like when viewed on the NFS server?

A reply for posterity and momentary closure:

[Not in response to the problem stated in my original post] I ended up rebuilding the NAS (serving via NFS) after replacing all the disks and upgrading firmware. I don't know what un/related problems this may have solved.

I added the NFS v3 param to vfstab. Nevertheless, I continue to see a few client NFS v2 calls via 'nfsstat'. They are only a few and I wonder whether they originate from polling-like calls, i.e. "send a few v2, v3, v4 calls, see which ones work". Just a guess.

I have not yet re-encountered the problem, but will report back if I do. Thanks for all the suggestions for debugging.