SVM metastat -- needs maintenance

dangral · January 26, 2006, 10:38am

Running Solaris 9 with SVM. I'm not that familiar with it, but metastat output gives "needs maintenance" message on 2 of the mirrors. There are no errors in /var/adm/messages. What do I need to do to fix this error? Thanks.

# metastat
d50: Mirror
Submirror 0: d51
State: Needs maintenance
Submirror 1: d52
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 65431680 blocks (31 GB)

d51: Submirror of d50
State: Needs maintenance
Invoke: metareplace d50 c1t0d0s5 <new device>
Size: 65431680 blocks (31 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t0d0s5 0 No Maintenance Yes

d52: Submirror of d50
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace d50 c1t1d0s5 <new device>
Size: 65431680 blocks (31 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s5 0 No Last Erred Yes

d40: Mirror
Submirror 0: d41
State: Okay
Submirror 1: d42
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 8201856 blocks (3.9 GB)

d41: Submirror of d40
State: Okay
Size: 8201856 blocks (3.9 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t0d0s4 0 No Okay Yes

d42: Submirror of d40
State: Okay
Size: 8201856 blocks (3.9 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s4 0 No Okay Yes

d30: Mirror
Submirror 0: d31
State: Okay
Submirror 1: d32
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 40968576 blocks (19 GB)

d31: Submirror of d30
State: Okay
Size: 40968576 blocks (19 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t0d0s3 0 No Okay Yes

d32: Submirror of d30
State: Okay
Size: 40968576 blocks (19 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s3 0 No Okay Yes

d10: Mirror
Submirror 0: d11
State: Needs maintenance
Submirror 1: d12
State: Needs maintenance
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 16393536 blocks (7.8 GB)

d11: Submirror of d10
State: Needs maintenance
Invoke: metareplace d10 c1t0d0s0 <new device>
Size: 16393536 blocks (7.8 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t0d0s0 0 No Maintenance Yes

d12: Submirror of d10
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace d10 c1t1d0s0 <new device>
Size: 16393536 blocks (7.8 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1t1d0s0 0 No Last Erred Yes

Device Relocation Information:
Device Reloc Device ID
c1t1d0 Yes id1,ssd@w2000000c50566da1
c1t0d0 Yes id1,ssd@w2000000c50568c1d

pressy · January 26, 2006, 10:57am

just try the command metasync(1M)....

This may be caused by the "metasync -r" command not getting executed when the system boots, or if the system boots up only to single-user mode.

This metasync command is normally executed in one of the startup scripts run at boot time.

For Online: DiskSuite[TM] 1.0, the metasync command is located in the /etc/rc.local script. This entry is placed in that file by the metarc command.

For Solstice DiskSuite versions between 3.x and 4.2, inclusive, the metasync command is located in the /etc/rc2.d/S95SUNWmd.sync file.

For Solstice DiskSuite version 4.2.1 and above, the metasync command is located in the file /etc/rc2.d/S95lvm.sync.

In all cases, because this script is not run until the system transitions into run state 3 (multi-user mode), it is to be expected to have both submirrors in a "Needs maintenance" state until the command is run. I/O to these metadevices works just fine while in this state, so there is no need to worry.

if that doesn't help, you be in the situation discribed in bug 82642

When trying to run the metasync command, the c1t0d0s0 device reported errors in /var/adm/messages:

Sep 15 09:11:17 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:11:17 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:11:17 bobbob scsi: Requested Block: 4057844 Error Block: 4057969
Sep 15 09:11:17 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:11:17 bobbob scsi: Sense Key: Media Error
Sep 15 09:11:17 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Sep 15 09:11:19 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:11:19 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:11:19 bobbob scsi: Requested Block: 4057844 Error Block: 4057969
Sep 15 09:11:19 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:11:19 bobbob scsi: Sense Key: Media Error
Sep 15 09:11:19 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4

In this case, the same block is being reported as having problems.

Resolution:

The bad block can be fixed by running format --> analyze --> read on the c1t0d0 disk.

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0
1. c1t1d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
/pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f8c663,0
Specify disk (enter its number): 0
selecting c1t0d0
format> analyze
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? y

    pass 0

Medium error during read: block 4057969 (0x3deb71) (1404/16/101)
ASC: 0x11 ASCQ: 0x0
Sep 15 09:26:59 bobbob scsi: WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100002037f396c9,0 (ssd1):
Sep 15 09:26:59 bobbob Error for Command: read(10) Error Level: Retryable
Sep 15 09:26:59 bobbob scsi: Requested Block: 4057969 Error Block: 4057969
Sep 15 09:26:59 bobbob scsi: Vendor: SEAGATE Serial Number: 0107D1MVCF
Sep 15 09:26:59 bobbob scsi: Sense Key: Media Error
Sep 15 09:26:59 bobbob scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4
Repairing hard error on 4057969 (1404/16/101)...ok.

24619/26/53

    pass 1

24619/26/53

Total of 1 defective blocks repaired.

Now running metasync completes.

# metasync d10
# metastat d10
d10: Mirror
Submirror 0: d0
State: Needs maintenance
Submirror 1: d1
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 69078879 blocks

d0: Submirror of d10
State: Needs maintenance
Invoke: after replacing "Maintenance" components:
metareplace d10 c1t0d0s0 <new device>
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s0 0 No Last Erred

d1: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s0 0 No Okay

And then, metareplace can be executed.

# metareplace -e d10 c1t0d0s0
# metastat d10
d10: Mirror
Submirror 0: d0
State: Okay
Submirror 1: d1
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 69078879 blocks

d0: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t0d0s0 0 No Okay

d1: Submirror of d10
State: Okay
Size: 69078879 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c1t1d0s0 0 No Okay

regards pressy

dangral · January 26, 2006, 1:15pm

Maybe I misunderstood your post, but here is what I did. It looks like nothing is happening and I dont see anything in the logs.

# metasync d50
# metastat d50
d50: Mirror
    Submirror 0: d51
      State: Needs maintenance
    Submirror 1: d52
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 65431680 blocks (31 GB)

d51: Submirror of d50
    State: Needs maintenance
    Invoke: metareplace d50 c1t0d0s5 <new device>
    Size: 65431680 blocks (31 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t0d0s5          0     No     Maintenance   Yes


d52: Submirror of d50
    State: Needs maintenance
    Invoke: after replacing "Maintenance" components:
                metareplace d50 c1t1d0s5 <new device>
    Size: 65431680 blocks (31 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t1d0s5          0     No      Last Erred   Yes


Device Relocation Information:
Device   Reloc  Device ID
c1t0d0   Yes    id1,ssd@w2000000c50568c1d
c1t1d0   Yes    id1,ssd@w2000000c50566da1

Perderabo · January 26, 2006, 2:00pm

It looks to me like you lost a disk: c1t1d0s5. I'll bet that "iostat -En" will confirm that. That format command that pressy shows does look interesting, but I don't like trying to repair a disk. I would replace it.

dangral · January 26, 2006, 2:50pm

I dont see anything strange in that output:

iostat -En
c0t6d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
Vendor: TOSHIBA Product: DVD-ROM SD-M1711 Revision: 1005 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
c1t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307FSUN72G Revision: 0307 Serial No: 0334B1RPX4
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
c1t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307FSUN72G Revision: 0307 Serial No: 0334B1S42R
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

Also, my root mirror is complaining. It's posted in the original post. Anyhow, how can I be sure that 1) its a disk failure 2) which disk I need to replace.

Perderabo · January 26, 2006, 3:55pm

With nothing showing up in iostat -En, now I think it probably isn't a bad disk. So I don't know what to tell you.

RTM · January 26, 2006, 4:49pm

I think you need to give more info - I noticed the ssd on one of your outputs.

What type of server? Are these internal drives to the server or in arrays?
What type of arrays (if they are)?

Where are your metadb state databases (found with metadb command with no options)?

What are the failing partitions? What's on the failing partitions (OS only, OS and Applications - and of course, what applications)?

I'm assuming that SVM was the standard with Solaris 9 - if not, please post the version of it.

Also, what if anything, was changed before you noticed all of this - reboots, upgrades,...etc.?

And you state no errors in messages file - is syslogd running? Do you normally get error messages on this system? Double check that you are looking at the correct file for errors by looking at syslog.conf.

dangral · January 26, 2006, 5:17pm

Sorry for being sparse on the details.

This is a 280R. The drives are internal.

The metadb state databases are on slice 7 of the mirrored disks. Here is the output:

As for the failing partitions, the only indication of failures is the metastat output, the application and OS are running fine. Metastat reports problems with / (d10) and /smarts1 (d50). The application is EMC SMARTS.

I just noticed this issue because I wanted to implement SVM monitoring and happened to do a metastat. We have rebooted this machine a couple of times in the last few months, most recently about 2 weeks ago.

syslogd is running:

And here is the contents of syslog.conf. Let me know if I should provide anything else.

cat /etc/syslog.conf
#ident "@(#)syslog.conf 1.5 98/12/14 SMI" /* SunOS 5.0 /
#
# Copyright (c) 1991-1998 by Sun Microsystems, Inc.
# All rights reserved.
#
# syslog configuration file.
#
# This file is processed by m4 so be careful to quote (`') names
# that match m4 reserved words. Also, within ifdef's, arguments
# containing commas must be quoted.
#
#.err;kern.notice;auth.notice /dev/sysmsg
#*.err;kern.debug;daemon.notice;mail.none /var/adm/messages

#.alert;kern.err;daemon.err operator
#.alert root

#*.emerg *

# if a non-loghost machine chooses to have authentication messages
# sent to the loghost machine, un-comment out the following line:
#auth.notice ifdef(`LOGHOST', /var/log/authlog, @loghost)

mail.debug /var/log/mail
#*.emerg;*.alert;*.crit;*.err;*.warn;.info /var/log/syslog
*.emerg;*.alert;.crit /var/log/syslog

#
# non-loghost machines will use the following lines to cause "user"
# log messages to be logged locally.
#
#ifdef(`LOGHOST', ,
#user.err /dev/sysmsg
#user.err /var/adm/messages
#user.alert `root, operator'
#user.emerg *
#)

RTM · January 26, 2006, 9:35pm

First post: "There are no errors in /var/adm/messages."

Last post: syslog.conf - the only lines not commented are

mail.debug /var/log/mail
and
*.emerg;*.alert;*.crit /var/log/syslog

If the devices are giving warnings, those may be lost - suggest you add/change

*.emerg;*.alert;.crit /var/log/syslog
to
*.emerg;*.alert;*.crit;*.err;*.warn;.info /var/log/syslog

and send a hup siginal to syslogd so it will re-read the config file - then check your /var/log/syslog file for possible errors. That may give you a better read on your issue. I don't see that /var/adm/messages would have had anything it in from syslogd.

RTM · January 26, 2006, 10:19pm

Also suggest you get the output of the following and save it

metastat -p
cat md.cf
cat md.tab

The last two files should be in /etc/lvm/

dangral · January 26, 2006, 11:23pm

Yep, doesnt make much sense at all. At first, I looked in /var/adm/messages, and thats why I posted that path in my original post, but then later today saw that the application admin changed the log file to /var/log/syslog. Anyhow, I'll try your suggestions tomorrow and update here. Thanks for your help.

dangral · January 30, 2006, 11:08am

I edited the /etc/syslog.conf file and sent a HUP signal to the syslogd process. I am now getting all messages going to /var/log/syslog. However, I'm still not getting any output related to the volumes that needs maintenance.

Perderabo · January 30, 2006, 1:11pm

It's hard to know what to suggest because I don't understand how the box arrived in the current state. That syslog.conf thing scares me. I guess that I would first verify that I have good backups. Then I would look at the two disks with prtvtoc to ensure that they are partitioned identicly. Then I would look at the special files for the disks to make sure that no one replaced them text files or something. If the disks are partitioned correctly, no hardware error are known, and the special files really point to the device, then it has to be ok to attempt a resync. Or least, I think so. So I would cross my fingers and try:
metareplace -e d50 c1t0d0s5

No moneyback guarantees. Objects in mirror may be closer than they seem. Packed by weight, not by volume. Your results may vary. etc...

RTM · January 30, 2006, 2:12pm

"saw that the application admin changed the log file to /var/log/syslog"

That's the scary part.

Suggest a call to SUN is in order for your issue - I've never seen such a problem and can not find anything on sunsolve showing this type of issue.

dangral · February 2, 2006, 2:18pm

I called SUN. They requested the logs (which had nothing), the output of format, and iostat -En.

They then suggested (as Perderabo did) to do the following:

metareplace -e d50 c1t0d0s5
metareplace -e d50 c1t1d0s5

metareplace -e d10 c1t0d0s0
metareplace -e d10 c1t1d0s0

The disks are now in the "okay" state. SUN did not have a solid explanation as to why the disks went into maintenance mode, but said it doesnt look like a hardware failure.