Disk error on same block on solaris disks

solaris_1977 · October 11, 2023, 6:13pm

I have two disk in Solaris 10 rpool and looks like both disks are having issues

c1t1d0           Soft Errors: 0 Hard Errors: 114 Transport Errors: 329
Vendor: HITACHI  Product: H103014SCSUN146G Revision: A2A8 Serial No: 1036FR5NPE
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 114 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

c1t3d0           Soft Errors: 0 Hard Errors: 159 Transport Errors: 893
Vendor: HITACHI  Product: H103014SCSUN146G Revision: A2A8 Serial No: 1039FASGUE
Size: 146.81GB <146810536448 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 159 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

When I checked /var/adm/messages and noticed that both disks are throwing error at same block. Does it indicate something ? They are supposed to be different.

Oct 11 10:51:04 solaris-10-sparc      Error for Command: write(10)               Error Level: Retryable
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Requested Block: 107631425                 Error Block: 107631425
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Vendor: HITACHI                            Serial Number: 1039FASGUE
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Sense Key: Unit_Attention
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x17
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0/sd@1,0 (sd4):
Oct 11 10:51:04 solaris-10-sparc      Error for Command: write(10)               Error Level: Retryable
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Requested Block: 107631425                 Error Block: 107631425
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Vendor: HITACHI                            Serial Number: 1036FR5NPE
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        Sense Key: Unit_Attention
Oct 11 10:51:04 solaris-10-sparc scsi: [ID 107833 kern.notice]        ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x17
Oct 11 10:52:14 solaris-10-sparc scsi: [ID 243001 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0 (mpt0):

Can someone suggest something on this ?

Thanks

hicksd8 · October 12, 2023, 4:21pm

Hmmmm........have you created a rpool mirror with these two disks?

The disks are identical make/model/geometry so if there was an event during a write operation (e.g. a power glitch) that caused a write error on that sector, it could occur on both drives.

I would keep an eye on it and see if the problem grows. If not, I wouldn't be worried too much. There are tools that will tell you the path of the file involved if you would like to know.

If you can take the system down and boot from CD into single user you could run the 'analyse' function within 'format' but it will take a while to read the whole disk and ensure that you only select 'read only' and not 'read/write' otherwise you destroy your data.

solaris_1977 · October 12, 2023, 4:55pm

Yes, both disks are mirrored in rpool.
I did tried read on both disks and it end up with -

analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

        pass 0
   651/2/824

Warning:Drive may be reserved or has been removed, aborting surface analysis.
analyze>

This is old server with uptime of 4 years. Due to nature of application running on it, it is hard to take its downtime. But I am trying to figure, if disks are bad or it is OS, which can't read it properly. In iostat, I see there are hard and transport errors on both disks.

MadeInGermany · October 12, 2023, 5:45pm

My bet is, you need a SCSI HW driver patch.
Check the boot messages (command dmesg, /var/adm/messages), to find out which SCSI HW driver detected the disk. E.g. it could be a "mpt", then reach out for an "mpt driver patch". Oddly these driver patches are never in a "recommended patch cluster" or "recommended patch set".
Maybe start here:

solaris_1977 · October 12, 2023, 6:24pm

bash-3.2# cat /var/adm/messages* | grep -i mpt | tail -10
Sep 23 23:17:57 solaris-10-sparc scsi: [ID 365881 kern.info] /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:17:57 solaris-10-sparc      mpt0: IOC Operational.
Sep 23 23:19:54 solaris-10-sparc scsi: [ID 243001 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:19:54 solaris-10-sparc scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:19:54 solaris-10-sparc      mpt_cmd_timeout: Restarting HBA
Sep 23 23:20:04 solaris-10-sparc scsi: [ID 365881 kern.info] /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:20:04 solaris-10-sparc scsi: [ID 365881 kern.info] /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:20:04 solaris-10-sparc      mpt0 supports power management.
Sep 23 23:20:07 solaris-10-sparc scsi: [ID 365881 kern.info] /pci@400/pci@0/pci@8/scsi@0 (mpt0):
Sep 23 23:20:07 solaris-10-sparc      mpt0: IOC Operational.
bash-3.2#

Should it be system firmware (SysFW 7.4.11), which should update mpt driver as well ?
It is Netra T5440 and I see this on your link - Firmware Downloads and Release History for Oracle Server Systems

MadeInGermany · October 13, 2023, 7:41am

Good question.
Could be included in the general FW patch, or has been integrated in the kernel patch. In the latter case it is also included in a "recommended patch set".

The term "reserved" leads to another idea of a root cause. Do you have LDOMs configured that can access the same disk?

hicksd8 · October 13, 2023, 9:28am

solaris_1977:

analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

        pass 0
   651/2/824

Warning:Drive may be reserved or has been removed, aborting surface analysis.
analyze>

Yes, that is correct. Whether it's a hardware RAID1 or a software RAID1 the system will lock out the mirror so that it cannot be addressed for any other purpose. The RESERVED flag (on the disk mode page) is also used by cluster members to take ownership of a disk.

The RESERVED flag can be reset with a utility but that wouldn't be advisable in a rpool mirror.

What happens if you 'analyze' the primary disk of the pair?

system · October 23, 2023, 9:28am

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.