iSCSI issue on RHEL 5

Jeevanm · July 5, 2012, 5:22am

Hello Friends,

I am facing issue with the iSCSI configuration on some of our RHEL 5 servers,

When I restart the iSCSI service, it triggers the RHEL server reboot.

Could you please help me with this issue.

Below are the system details :

uname -a :
Linux za-rac-prd-01.abc.local 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

iSCSI Pkg :
iscsi-initiator-utils-6.2.0.742-0.5.el5

Jeevanm · July 6, 2012, 7:02am

Hi All,

In addition to above details,

The RHEL server is running on the Dell Blade, and have storage allocated from 1. Equillogic, and MD3200.

zer0sig · July 6, 2012, 9:24am

I've found that RHEL gets bound up pretty easily when the iSCSI/FCOE storage requests don't come back as expected.
I was building a small CAD cluster last year, and the build was fairly heavily automated (PXE boot to a kickstart minimal image to HP-SAS for provisioning (we were also automating loading/patching/etc through it, which was dandy..when it worked).. these servers were each using a volume group of ~20 SAN-allocated disks, and there were scripts to initialize, test and lay down a raw Oracle filesystem.

Well, these machines were running dual 10Gb NICS to eachother, and half of the volume group was allocated to each NIC. So the series of scripts, in between the base OS and Oracle installs runs the init disk routine. These machines have 16 cores, and the CPU utilization goes from almost nothing to 100% and the build is taking longer than expected. Noticing that the machine was not responding right, I start going through the logs. I notice one core is unresponsive and the kernel is throwing panic messages.

Turns out, one of the 10Gb cards was actually faulty, but with the 6 or so different links and the other 15 cores running normally, this one was stuck trying to execute a temporary script, and all it was doing was sending requests via iSCSI and not getting them back. The kernel was so caught up with this that it brought the entire machine to a crawl, as though it were in a race condition. kill the script and all is hunky-dory except for the build, which of course failed. Once the NOC team got to check on the links, they found the problem, replaced the card, and a little manual scrubbing of the now-complete volume group and the rest went smoothly.

I haven't seen how other PC-based Unix-style OSes fare under these conditions, but I can't help but think that there should be some failsafes to keep one core running one simple script from turning the machine into a $40K paperweight. I assume that the priority given to the high-end storage subsystems is likely to improve overall performance - as long as everything is working.