I have two disk in Solaris 10 rpool and looks like both disks are having issues
c1t1d0 Soft Errors: 0 Hard Errors: 114 Transport Errors: 329
Vendor: HITACHI Product: H103014SCSUN146G Revision: A2A8 Serial No: 1036FR5NPE
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 114 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t3d0 Soft Errors: 0 Hard Errors: 159 Transport Errors: 893
Vendor: HITACHI Product: H103014SCSUN146G Revision: A2A8 Serial No: 1039FASGUE
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 159 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
When I checked /var/adm/messages and noticed that both disks are throwing error at same block. Does it indicate something ? They are supposed to be different.
Oct 11 10:51:04 solaris-10-sparc Error for Command: write(10) Error Level: Retryable
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Requested Block: 107631425 Error Block: 107631425
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 1039FASGUE
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x17
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0/sd@1,0 (sd4):
Oct 11 10:51:04 solaris-10-sparc Error for Command: write(10) Error Level: Retryable
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Requested Block: 107631425 Error Block: 107631425
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 1036FR5NPE
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] Sense Key: Unit_Attention
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x17
Oct 11 10:52:14 solaris-10-sparc scsi: [ID 243001 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Hmmmm........have you created a rpool mirror with these two disks?
The disks are identical make/model/geometry so if there was an event during a write operation (e.g. a power glitch) that caused a write error on that sector, it could occur on both drives.
I would keep an eye on it and see if the problem grows. If not, I wouldn't be worried too much. There are tools that will tell you the path of the file involved if you would like to know.
If you can take the system down and boot from CD into single user you could run the 'analyse' function within 'format' but it will take a while to read the whole disk and ensure that you only select 'read only' and not 'read/write' otherwise you destroy your data.
Yes, both disks are mirrored in rpool.
I did tried read on both disks and it end up with -
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y
pass 0
651/2/824
Warning:Drive may be reserved or has been removed, aborting surface analysis.
analyze>
This is old server with uptime of 4 years. Due to nature of application running on it, it is hard to take its downtime. But I am trying to figure, if disks are bad or it is OS, which can't read it properly. In iostat, I see there are hard and transport errors on both disks.
My bet is, you need a SCSI HW driver patch.
Check the boot messages (command dmesg, /var/adm/messages), to find out which SCSI HW driver detected the disk. E.g. it could be a "mpt", then reach out for an "mpt driver patch". Oddly these driver patches are never in a "recommended patch cluster" or "recommended patch set".
Maybe start here:
Good question.
Could be included in the general FW patch, or has been integrated in the kernel patch. In the latter case it is also included in a "recommended patch set".
The term "reserved" leads to another idea of a root cause. Do you have LDOMs configured that can access the same disk?
Yes, that is correct. Whether it's a hardware RAID1 or a software RAID1 the system will lock out the mirror so that it cannot be addressed for any other purpose. The RESERVED flag (on the disk mode page) is also used by cluster members to take ownership of a disk.
The RESERVED flag can be reset with a utility but that wouldn't be advisable in a rpool mirror.
What happens if you 'analyze' the primary disk of the pair?