Intermittent connectivity issues with ROCKS on a compute cluster

gandalf85 · September 13, 2010, 10:49am

I have a cluster set up with a head node and compute nodes running TORQUE and MOAB. The distro is ROCKS 5.3. I've been having problems with the connectivity for the past couple weeks now. Every couple hours it seems like the network connectivity will just stop working: sometimes it'll start back up in 10-15 minutes, sometimes I have to reboot the machine. I have SAMBA set up, and the network drive I have mounted on my windows PC won't respond (often causing windows explorer to crash) and I can't putty in. During this time, if I already have a putty window open, I can do basic commands like "ls" and "cd" but qstat and pbsnodes don't work. If I'm putty'd into the head node, I can ssh into one of the compute nodes. Eventually the putty window will crash though. Also, I can ping the server just fine.

The SAMBA logs were reporting all sorts of problems:

[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
INTERNAL ERROR: Signal 7 in pid 9816 (3.0.33-3.15.el5_4)
[2010/09/10 03:51:29, 0] smbd/close.c:close_directory(430)
close_directory: Could not get share mode lock for Pao
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(44)

From:
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(41)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(45)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
[2010/09/10 03:51:29, 0] lib/util.c:smb_panic(1655)
INTERNAL ERROR: Signal 7 in pid 8475 (3.0.33-3.15.el5_4)
PANIC (pid 9816): internal error
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:30, 0] lib/util.c:log_stack_trace(1759)
[2010/09/10 03:51:30, 0] lib/fault.c:fault_report(44)

I turned off SAMBA, still have the same problems. /var/log/messages contained this:

Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1

bnx2i is some sort of driver for the broadcom network card. I updated the broadcom multi-function drivers and the firmware, still have problems. One thing I couldn't get working was the bnx2i iSCSI offload driver -- I ran into version issues with the RPMs. I've ran MEMTEST and a couple hardware diagnostic checks -- can't find any problems. Here's /var/log/messages from when I reboot the machine. Note that I hosed the x server somehow, and I'm not really worried about fixing that.

Sep 13 04:49:25 wantsh01 gdm[3930]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:49:29 wantsh01 mountd[3527]: Caught signal 15, un-registering and exiting.
Sep 13 04:52:12 wantsh01 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Sep 13 04:52:12 wantsh01 kernel: PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
Sep 13 04:52:12 wantsh01 kernel: PCI: Not using MMCONFIG.
Sep 13 04:52:13 wantsh01 kernel: intel_rng: FWH not detected
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:52:13 wantsh01 named[3028]: the working directory is not writable
Sep 13 04:52:19 wantsh01 sshd[3428]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Sep 13 04:52:19 wantsh01 xinetd[3445]: /etc/xinetd.d/RCS is not a regular file. It is being skipped.
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: In the system's table of devices NO devices found to scan
Sep 13 04:52:31 wantsh01 gdm[4042]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:40 wantsh01 gdm[4188]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:49 wantsh01 gdm[4210]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:53:19 wantsh01 gdm[3940]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:53:32 wantsh01 dhcpd: receive_packet failed on eth0: Network is down
Sep 13 04:53:33 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:53:33 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:53:37 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:53:37 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:54:50 wantsh01 snmpd[3379]: c64 32 bit check failed
Sep 13 04:55:20 wantsh01 snmpd[3379]: looks like a 64bit wrap, but prev!=new

Thanks for any help, I'd really appreciate some advice.