SVM metastat -- needs maintenance

Running Solaris 9 with SVM. I'm not that familiar with it, but metastat output gives "needs maintenance" message on 2 of the mirrors. There are no errors in /var/adm/messages. What do I need to do to fix this error? Thanks.

just try the command metasync(1M)....

This may be caused by the "metasync -r" command not getting executed when the system boots, or if the system boots up only to single-user mode.

This metasync command is normally executed in one of the startup scripts run at boot time.

For Online: DiskSuite[TM] 1.0, the metasync command is located in the /etc/rc.local script. This entry is placed in that file by the metarc command.

For Solstice DiskSuite versions between 3.x and 4.2, inclusive, the metasync command is located in the /etc/rc2.d/S95SUNWmd.sync file.

For Solstice DiskSuite version 4.2.1 and above, the metasync command is located in the file /etc/rc2.d/S95lvm.sync.

In all cases, because this script is not run until the system transitions into run state 3 (multi-user mode), it is to be expected to have both submirrors in a "Needs maintenance" state until the command is run. I/O to these metadevices works just fine while in this state, so there is no need to worry.

if that doesn't help, you be in the situation discribed in bug 82642

When trying to run the metasync command, the c1t0d0s0 device reported errors in /var/adm/messages:

Sep 15 09:11:17 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:11:17 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:11:17 bobbob scsi: Requested Block: 4057844 Error Block: 4057969
Sep 15 09:11:17 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:11:17 bobbob scsi: Sense Key: Media Error
Sep 15 09:11:17 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Sep 15 09:11:19 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:11:19 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:11:19 bobbob scsi: Requested Block: 4057844 Error Block: 4057969
Sep 15 09:11:19 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:11:19 bobbob scsi: Sense Key: Media Error
Sep 15 09:11:19 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4

In this case, the same block is being reported as having problems.

Resolution:

The bad block can be fixed by running format --> analyze --> read on the c1t0d0 disk.

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0
1. c1t1d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f8c663,0
Specify disk (enter its number): 0
selecting c1t0d0
format> analyze
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

    pass 0

Medium error during read: block 4057969 (0x3deb71) (1404/16/101)
ASC: 0x11 ASCQ: 0x0
Sep 15 09:26:59 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:26:59 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:26:59 bobbob scsi: Requested Block: 4057969 Error Block: 4057969
Sep 15 09:26:59 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:26:59 bobbob scsi: Sense Key: Media Error
Sep 15 09:26:59 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Repairing hard error on 4057969 (1404/16/101)...ok.

24619/26/53

    pass 1

24619/26/53

Total of 1 defective blocks repaired.

Now running metasync completes.

# metasync d10
# metastat d10
d10: Mirror
Submirror 0: d0
State: Needs maintenance
Submirror 1: d1
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 69078879 blocks

d0: Submirror of d10
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace d10 c1t0d0s0 <new device>
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s0 0 No Last Erred

d1: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s0 0 No Okay

And then, metareplace can be executed.

# metareplace -e d10 c1t0d0s0
# metastat d10
d10: Mirror
Submirror 0: d0
State: Okay
Submirror 1: d1
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 69078879 blocks

d0: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s0 0 No Okay

d1: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s0 0 No Okay

regards pressy

Maybe I misunderstood your post, but here is what I did. It looks like nothing is happening and I dont see anything in the logs.

# metasync d50
# metastat d50
d50: Mirror
    Submirror 0: d51
      State: Needs maintenance
    Submirror 1: d52
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 65431680 blocks (31 GB)

d51: Submirror of d50
    State: Needs maintenance
    Invoke: metareplace d50 c1t0d0s5 <new device>
    Size: 65431680 blocks (31 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t0d0s5          0     No     Maintenance   Yes


d52: Submirror of d50
    State: Needs maintenance
    Invoke: after replacing "Maintenance" components:
                metareplace d50 c1t1d0s5 <new device>
    Size: 65431680 blocks (31 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t1d0s5          0     No      Last Erred   Yes


Device Relocation Information:
Device   Reloc  Device ID
c1t0d0   Yes    id1,ssd@w2000000c50568c1d
c1t1d0   Yes    id1,ssd@w2000000c50566da1

It looks to me like you lost a disk: c1t1d0s5. I'll bet that "iostat -En" will confirm that. That format command that pressy shows does look interesting, but I don't like trying to repair a disk. I would replace it.

I dont see anything strange in that output:

Also, my root mirror is complaining. It's posted in the original post. Anyhow, how can I be sure that 1) its a disk failure 2) which disk I need to replace.

With nothing showing up in iostat -En, now I think it probably isn't a bad disk. So I don't know what to tell you. :confused:

I think you need to give more info - I noticed the ssd on one of your outputs.

What type of server? Are these internal drives to the server or in arrays?
What type of arrays (if they are)?

Where are your metadb state databases (found with metadb command with no options)?

What are the failing partitions? What's on the failing partitions (OS only, OS and Applications - and of course, what applications)?

I'm assuming that SVM was the standard with Solaris 9 - if not, please post the version of it.

Also, what if anything, was changed before you noticed all of this - reboots, upgrades,...etc.?

And you state no errors in messages file - is syslogd running? Do you normally get error messages on this system? Double check that you are looking at the correct file for errors by looking at syslog.conf.

Sorry for being sparse on the details.

This is a 280R. The drives are internal.

The metadb state databases are on slice 7 of the mirrored disks. Here is the output:

As for the failing partitions, the only indication of failures is the metastat output, the application and OS are running fine. Metastat reports problems with / (d10) and /smarts1 (d50). The application is EMC SMARTS.

I just noticed this issue because I wanted to implement SVM monitoring and happened to do a metastat. We have rebooted this machine a couple of times in the last few months, most recently about 2 weeks ago.

syslogd is running:

And here is the contents of syslog.conf. Let me know if I should provide anything else.

First post: "There are no errors in /var/adm/messages."

Last post: syslog.conf - the only lines not commented are

mail.debug /var/log/mail
and
*.emerg;*.alert;*.crit /var/log/syslog

If the devices are giving warnings, those may be lost - suggest you add/change

*.emerg;*.alert;.crit /var/log/syslog
to
*.emerg;*.alert;*.crit;*.err;*.warn;
.info /var/log/syslog

and send a hup siginal to syslogd so it will re-read the config file - then check your /var/log/syslog file for possible errors. That may give you a better read on your issue. I don't see that /var/adm/messages would have had anything it in from syslogd.

Also suggest you get the output of the following and save it

metastat -p
cat md.cf
cat md.tab

The last two files should be in /etc/lvm/

Yep, doesnt make much sense at all. At first, I looked in /var/adm/messages, and thats why I posted that path in my original post, but then later today saw that the application admin changed the log file to /var/log/syslog. Anyhow, I'll try your suggestions tomorrow and update here. Thanks for your help.

I edited the /etc/syslog.conf file and sent a HUP signal to the syslogd process. I am now getting all messages going to /var/log/syslog. However, I'm still not getting any output related to the volumes that needs maintenance.

It's hard to know what to suggest because I don't understand how the box arrived in the current state. That syslog.conf thing scares me. I guess that I would first verify that I have good backups. Then I would look at the two disks with prtvtoc to ensure that they are partitioned identicly. Then I would look at the special files for the disks to make sure that no one replaced them text files or something. If the disks are partitioned correctly, no hardware error are known, and the special files really point to the device, then it has to be ok to attempt a resync. Or least, I think so. So I would cross my fingers and try:
metareplace -e d50 c1t0d0s5

No moneyback guarantees. Objects in mirror may be closer than they seem. Packed by weight, not by volume. Your results may vary. etc...

"saw that the application admin changed the log file to /var/log/syslog"

That's the scary part.

Suggest a call to SUN is in order for your issue - I've never seen such a problem and can not find anything on sunsolve showing this type of issue.

I called SUN. They requested the logs (which had nothing), the output of format, and iostat -En.

They then suggested (as Perderabo did) to do the following:

metareplace -e d50 c1t0d0s5
metareplace -e d50 c1t1d0s5

metareplace -e d10 c1t0d0s0
metareplace -e d10 c1t1d0s0

The disks are now in the "okay" state. SUN did not have a solid explanation as to why the disks went into maintenance mode, but said it doesnt look like a hardware failure.