strange "No such file or directory" errors on NFS volumes

we're seeing very strange "No such file or directory" errors on NFS volumes on one of our suse servers - can anyone please help?

we're seeing it for both our NetApp NAS Device and one of our Solaris NFS servers too

Here is what we're seeing:

stg-backup:~ # cd /rmt/sge
stg-backup:/rmt/sge # ls -l
/bin/ls: bin: No such file or directory
/bin/ls: doc: No such file or directory
/bin/ls: lib: No such file or directory
/bin/ls: man: No such file or directory
/bin/ls: mpi: No such file or directory
/bin/ls: pvm: No such file or directory
/bin/ls: ckpt: No such file or directory
/bin/ls: 3rd_party: No such file or directory
/bin/ls: qmon: No such file or directory
/bin/ls: util: No such file or directory
/bin/ls: default: No such file or directory
/bin/ls: install_qmaster: No such file or directory
/bin/ls: sge-5.3p6-doc.tar.gz: No such file or directory
/bin/ls: stg_config: No such file or directory
/bin/ls: README.inst_sgeee: No such file or directory
/bin/ls: ssh_comm_dir: No such file or directory
/bin/ls: inst_sgeee: No such file or directory
/bin/ls: sge-5.3p6-common.tar.gz: No such file or directory
/bin/ls: catman: No such file or directory
/bin/ls: sge-5.3p6-bin-glinux.tar.gz: No such file or directory
/bin/ls: utilbin: No such file or directory
/bin/ls: stg_config.old.tgz: No such file or directory
/bin/ls: examples: No such file or directory
/bin/ls: install_execd: No such file or directory
/bin/ls: inst_sge: No such file or directory
/bin/ls: sge-5.3p7-bin-solaris64.tar: No such file or directory
/bin/ls: sge-5.3p7-common.tar: No such file or directory
/bin/ls: sge-5.3p7-doc.tar: No such file or directory
/bin/ls: core: No such file or directory
/bin/ls: neilcopyrcsge.sh: No such file or directory
/bin/ls: neilstopoldinstallrcsge.sh: No such file or directory
total 0
dr-xr-xr-x  2 root root 0 2008-01-14 11:40 .
drwxr-xr-x  7 root root 0 2008-01-14 13:06 ..
stg-backup:/rmt/sge # ls -l
total 0
stg-backup:/rmt/sge # ls -l
total 0
stg-backup:/rmt/sge # pwd
/rmt/sge
stg-backup:/rmt/sge # logout
Connection to stg-backup closed.
rhobbs@stg-mkc5:~> ssh stg-backup -l root
Password:
Last login: Mon Jan 14 13:05:56 2008 from stg-mkc5.domain.co.uk
stg-backup:~ # cd /rmt/sge
stg-backup:/rmt/sge # ls -l
total 73174
drwxr-xr-x  18 sgeadmin sgeadmin     1024 2008-01-02 23:09 .
drwxr-xr-x   4 root     root            0 2008-01-14 13:09 ..
drwxr-xr-x   3 sgeadmin sgeadmin      512 2005-04-27 14:48 3rd_party
drwxr-xr-x   4 root     root          512 2007-12-04 13:43 bin
drwxr-xr-x   4 sgeadmin sgeadmin      512 2002-03-27 14:30 catman
drwxr-xr-x   2 sgeadmin sgeadmin     1024 2005-04-27 14:48 ckpt
-rw-------   1 root     root      8388403 2008-01-02 23:09 core
drwxr-xr-x   4 sgeadmin sgeadmin      512 2007-12-04 15:33 default
drwxr-xr-x   2 sgeadmin sgeadmin      512 2005-04-27 14:48 doc
drwxr-xr-x   4 sgeadmin sgeadmin      512 2005-04-14 13:39 examples
-rwxr-xr-x   1 sgeadmin sgeadmin     1354 2004-04-07 12:29 install_execd
-rwxr-xr-x   1 sgeadmin sgeadmin     1354 2004-04-07 12:29 install_qmaster
-rwxr-xr-x   1 sgeadmin sgeadmin    77667 2006-02-27 15:53 inst_sge
lrwxrwxrwx   1 sgeadmin sgeadmin        8 2007-12-04 13:37 inst_sgeee -> inst_sge
drwxr-xr-x   4 root     root          512 2007-12-04 13:43 lib
drwxr-xr-x   6 sgeadmin sgeadmin      512 2002-03-27 14:30 man
drwxr-xr-x   3 sgeadmin sgeadmin      512 2005-04-27 14:48 mpi
-rwxr-xr-x   1 root     root          125 2008-01-02 11:46 neilcopyrcsge.sh
-rwxr-xr-x   1 root     root           63 2008-01-02 11:46 neilstopoldinstallrcsge.sh
drwxr-xr-x   3 sgeadmin sgeadmin      512 2005-04-27 14:48 pvm
drwxr-xr-x   4 sgeadmin sgeadmin      512 2005-04-27 14:48 qmon
-rw-r--r--   1 root     bin           396 2004-04-07 12:29 README.inst_sgeee
-rw-r--r--   1 root     root      9312974 2005-04-27 14:46 sge-5.3p6-bin-glinux.tar.gz
-rw-r--r--   1 root     root       822815 2005-04-27 14:46 sge-5.3p6-common.tar.gz
-rw-r--r--   1 root     root      3082603 2005-04-27 14:46 sge-5.3p6-doc.tar.gz
-rw-r--r--   1 root     root     45015040 2007-12-04 10:44 sge-5.3p7-bin-solaris64.tar
-rw-r--r--   1 root     root      2508800 2007-12-04 10:43 sge-5.3p7-common.tar
-rw-r--r--   1 root     root      5580800 2007-12-04 10:43 sge-5.3p7-doc.tar
drwxrwxrwx   2 root     root          512 2007-04-17 13:09 ssh_comm_dir
drwxr-xr-x   4 root     root          512 2006-09-19 09:34 stg_config
-rw-r--r--   1 root     root         8404 2006-07-21 08:38 stg_config.old.tgz
drwxr-xr-x   5 sgeadmin sgeadmin      512 2006-02-27 16:18 util
drwxr-xr-x   4 root     root          512 2007-12-04 13:43 utilbin
stg-backup:/rmt/sge #

As you can see, periodically we get strange "ls" behaviour which happens repeatedly until i log out and in again, at which point it works.

Sometimes it works first time, and other times it errors until i log out and in again.

I hope someone knows what's causing this, because it's a nightmare! lol

Thanks in advance, people! :confused: :confused:

could it be tied into these messages that we're seeing in "/var/log/messages" and "/var/log/warn"?

stg-backup:/var/log # tail messages
Jan 14 14:57:09 stg-backup kernel: svc: bad direction 256, dropping request
Jan 14 14:57:09 stg-backup kernel: svc: short len 20, dropping request
Jan 14 14:57:30 stg-backup kernel: svc: bad direction 256, dropping request
Jan 14 14:57:30 stg-backup kernel: svc: short len 20, dropping request
Jan 14 14:57:39 stg-backup kernel: svc: bad direction 256, dropping request
Jan 14 14:57:39 stg-backup kernel: svc: short len 20, dropping request
Jan 14 14:58:00 stg-backup kernel: svc: bad direction 256, dropping request
Jan 14 14:58:00 stg-backup kernel: svc: short len 20, dropping request
Jan 14 14:58:09 stg-backup kernel: svc: bad direction 256, dropping request
Jan 14 14:58:09 stg-backup kernel: svc: short len 20, dropping request

This may be a red herring, posting these messages, but they are also strange and so may be somehow related...

strangely, i'm beginning to think it's some strange environment problem because i just suffered the problem again, and this time decided to open a second terminal to see if the problem could exist in two separate terminals.

Here are the results from both terminals:

TERMINAL 1:

stg-backup:/rmt/project2 # date; ls -l
Mon Jan 14 15:02:08 GMT 2008
total 0
stg-backup:/rmt/project2 #
TERMINAL 2:

stg-backup:/rmt/project2 # date; ls -l
Mon Jan 14 15:02:08 GMT 2008
total 44
drwxrwsr-x   9 root     stg  4096 2008-01-13 20:20 .
drwxr-xr-x   9 root     root    0 2008-01-14 15:01 ..
-rw-r--r--   1 root     stg    21 2008-01-13 21:16 .arkeiaNOBACKUP
-rw-r--r--   1 root     stg   480 2005-06-28 12:52 .arkeiaNOBACKUP.email
-rw-r--r--   1 root     stg    21 2008-01-12 21:15 .arkeiaNOBACKUP.old
drwxrwsr-x  11 stg      stg  4096 2008-01-11 13:27 ASR
drwxrwsr-x   8 mstuttle stg  4096 2007-08-10 15:34 demos
drwxrwxr-x  11 gwebster stg  4096 2007-03-23 18:39 gabe
drwxrwsr-x   2 root     stg  4096 2005-02-04 17:36 home
-rw-r--r--   1 root     stg    99 2004-03-18 17:46 Makefile
drwxr-xr-x   2 root     root 4096 2008-01-08 10:06 mysqlbackup
-rw-r--r--   1 root     stg     6 2004-10-29 07:50 neil.txt
-rw-------   1 root     stg   675 2004-03-22 09:58 nohup.out
lrwxrwxrwx   1 root     stg    20 2005-11-30 09:13 remote -> /rmt/sysadmin/remote
drwxrwxrwx  29 root     root 4096 2008-01-14 15:01 .snapshot
lrwxrwxrwx   1 root     stg    22 2005-09-09 17:00 stguser -> /rmt/stg14/TTS/stguser
drwxrwsrwx   4 kate     stg  4096 2004-07-27 14:48 sysadmin
stg-backup:/rmt/project2 #

as you can see, i had two terminals open on the same machine at exactly the same time, both in the same automounted NFS directory, running exactly the same command.

One terminal failed, and the other worked.

Therefore this cannot be a hardware problem, right?

to prove that this is not a coincidence, i ran the same test three more times, and got exactly the same results - TERMINAL 1 was "broken" and TERMINAL 2 was working.

I then logged out on TERMINAL 1 and logged back in again and it works again:

stg-backup:/rmt/project2 # date; ls -l
Mon Jan 14 15:05:26 GMT 2008
total 0
stg-backup:/rmt/project2 # logout
Connection to stg-backup closed.
rhobbs@stg-mkc5:~> ssh stg-backup -l root
Password:
Last login: Mon Jan 14 15:01:36 2008 from stg-mkc5.crl.toshiba.co.uk
stg-backup:~ # cd /rmt/project2
stg-backup:/rmt/project2 # ls -l
total 44
drwxrwsr-x   9 root     stg  4096 2008-01-13 20:20 .
drwxr-xr-x   7 root     root    0 2008-01-14 15:05 ..
-rw-r--r--   1 root     stg    21 2008-01-13 21:16 .arkeiaNOBACKUP
-rw-r--r--   1 root     stg   480 2005-06-28 12:52 .arkeiaNOBACKUP.email
-rw-r--r--   1 root     stg    21 2008-01-12 21:15 .arkeiaNOBACKUP.old
drwxrwsr-x  11 stg      stg  4096 2008-01-11 13:27 ASR
drwxrwsr-x   8 mstuttle stg  4096 2007-08-10 15:34 demos
drwxrwxr-x  11 gwebster stg  4096 2007-03-23 18:39 gabe
drwxrwsr-x   2 root     stg  4096 2005-02-04 17:36 home
-rw-r--r--   1 root     stg    99 2004-03-18 17:46 Makefile
drwxr-xr-x   2 root     root 4096 2008-01-08 10:06 mysqlbackup
-rw-r--r--   1 root     stg     6 2004-10-29 07:50 neil.txt
-rw-------   1 root     stg   675 2004-03-22 09:58 nohup.out
lrwxrwxrwx   1 root     stg    20 2005-11-30 09:13 remote -> /rmt/sysadmin/remote
drwxrwxrwx  29 root     root 4096 2008-01-14 15:01 .snapshot
lrwxrwxrwx   1 root     stg    22 2005-09-09 17:00 stguser -> /rmt/stg14/TTS/stguser
drwxrwsrwx   4 kate     stg  4096 2004-07-27 14:48 sysadmin
stg-backup:/rmt/project2 #

so now i'm really confused...

The annoying thing is that i have also just noticed that this problem is causing some of the cron jobs that access remote NFS volumes to fail as well!

Argh!

Someone help me, please! lol

It seems that we are seeing the same thing here.
Thanks