Interesting Disk Error Problem

gull04 · March 16, 2012, 2:28pm

Hi Folks,

Have an interesting problem here, have just upgraded some machines (Test and Development) to the latest and greatest as in.

SunOS ss063a 5.10 Generic_147440-13 sun4u sparc SUNW,Sun-Fire-V440

However on all the systems we are seeing the same problems, during the boot there is a long pause while the same error is displayed as in.

Mar 16 17:10:26 ss063a scsi: [ID 662515 kern.warning] WARNING: scsi_probe(2): scsi_reset failed(0) lun-reset cap(1)
Mar 16 17:10:53 ss063a last message repeated 3918 times

There doesn't seem to be any other problem with the operation of the machine, the disks all seem to be present and functioning correctly.

Has anyone seen this before - or is it something new.

This is Solaris 10 update 10 plus a bit.

Regards

Dave

jim_mcnamara · March 16, 2012, 11:20pm

What's in your scsi_vhci.conf? We ran into a series of goofy errors with iSCSI and with fibre channel on the same boxes. (Dell SANs). We had to contact the vendor, since Sunacle will not discuss the deep mysteries of the scsi_vhci.conf files with mere customers. Our boxes came up, too, but we saw degradation in I/O. If you keep your I/O stats check 'em before & after.

Anyway, as a wild guess, the syntax you need in the conf file may be slightly different from what you currently have. Even without any change in your storage.

hicksd8 · March 17, 2012, 9:14am

Firstly, what I am about to say below assumes these are standard
"internal" drives and NOT iSCSI or on another RAID controller.

SCSI drives are a huge chunk of electronics and are in themselves highly programmable. All kinds of flags, registers, etc are held on the drive itself in areas technically known as "mode pages". It is possible that some mode page settings got screwed up during your upgrade. This is just a guess.

What you could try is to run the format command, select one of the drives giving a problem, and take the option to "set all mode pages to default".
This won't harm your data.

Just a thought.

Dennis.

gull04 · March 17, 2012, 11:58am

Hi Folks,

There was a Qlogic version of a driver in the /kernel/drv/sparcv9 directory which was being loaded on boot, as we were going from update 4 to update 10 on these boxes it could have been any where in between. Removing the offending driver qla2300 resolved the problem, it took sme time to identify the problem but it was finally resolved.

The most likely cause of the problem coming to light is that the kernel had at some point had a bugfix that has highlighted the proble.

Regards

Dave