Replaced HDD in RAID-z pool & now 2 LDOMs won't boot

I could use some advice as I have never run into this before.

Oracle T4-4 server Solaris 10 SPARC
8 HDD, first 2 for OS (rpool), next 6 for LDOMs (ldompool)
10 LDOMs running Solaris 11.3 & Solaris 10

There were fatal errors in the messages log about a specific drive in the system so we decided to replace it. I shutdown all the ldoms on ldompool & found a "repurposed" HDD of the same type to replace the one mentioned in the logs. I zpool detached the drive from ldompool. Replaced the drive with the spare & issued a zpool replace ldompool olddrive newdrive. (obviously those were the drive addresses in the actual command)
The server worked for approx 80hr resilvering the zpool to replace the drive. It did finish with errors that I captured & it looked like all the errors were in the2 LDOMs for the DBs. The other 8 LDOMs in ldompool seem to come up just fine, but the two DBs both complain about:

WARNING: /virtual-devices@100/channel-devices@200/disk@0: Timeout receiving packet from LDC ... retrying

Before anyone asks, no there was no backup performed prior to the drive replacement as it was supposed to be a simple replacement & suspected there was already data integrity issues but all 10 of the LDOMs did run. We do have a snapshot from a few years ago but it would be nice if we didn't have to go back that far.

Any LDOM / zpool experts out there that have run into this?

I have no experience, never met a faulty zpool disk.
From the man page I get

zpool replace [–f] pool old_device [new_device]

Replaces old_device with new_device. This is equivalent to attaching new_device, waiting for it to resilver, and then detaching old_device.

The size of new_device must be greater than or equal to the minimum size of all the devices in a mirror or raidz configuration.

new_device is required if the pool is not redundant. If new_device is not specified, it defaults to old_device. This form of replacement is useful after an existing disk has failed and has been physically replaced. In this case, the new disk may have the same /dev/dsk path as the old device, even though it is actually a different disk. ZFS recognizes this.

In zpool status output, the old_device is shown under the word replacing with the string /old appended to it. Once the resilver completes, both the replacing and the old_device are automatically removed. If the new device fails before the resilver completes and a third device is installed in its place, then both failed devices will show up with /old appended, and the resilver starts over again. After the resilver completes, both /old devices are removed along with the word replacing.

–f

Forces use of new_device, even if its appears to be in use. Not all devices can be overridden in this manner.

So, maybe you detached the old disk too early?
Was the zpool status DEGRADED or already beyond it?

" Was the zpool status DEGRADED or already beyond it?"
I honestly don't remember what it said. I know in the message log there were tons of fatal errors about the drive in question.

Since it is basically a zfs raid5, you "should" be able to replace any drive failed or not & once it resilvers it should be back online.

I am really not sure why it failed as I have replaced drives before in other situations, but never had issues like this

Can you try stop/unbind then bind a ldoms which fail to work ?

Regards
Peasant.

The following article suggests a problem with power mgmgt and the workaround:
https://docs.oracle.com/cd/E19604-01/821-0404/auto42/index.html
Anyway I would contact the Oracle support.

"Can you try stop/unbind then bind a ldoms which fail to work ?"
Thanks for the suggestion, but I've done that already with no change.

I saw the article about power outage, & tried the reset-all at the ok prompt.
Also no change.

How about output

zpool status -v <affectedzpool>
ldm list -l <nonbootable ldom>
ldm list-services

Regards.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.