Impossible to access on /vol1

cerco · December 26, 2013, 10:11am

hi team,
I'm a new with Solaris system and i'm a french, so my english will not be very good but I'll try to explain my problem.
I have a Sun server SunFire X4170 with Solaris 10 as OS.
since last week I am not able to access on /vol1 anymore. And bellow are the warning messages which are displaying during the starting of the server:

 WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
          Error for Command: read                    Error Level: Fatal
          Requested Block: 167762                    Error Block: 167762
          Vendor: Sun                                Serial Number:             
          Sense Key: Hardware Error
          ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
  /dev/rdsk/c0t1d0s0: CANNOT READ: DISK BLOCK 135632: I/O error
  /dev/rdsk/c0t1d0s0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
   
  THE FOLLOWING FILE SYSTEM(S) HAD AN UNEXPECTED INCONSISTENCY: /dev/rdsk/c0t1d0s0 (/vol1)
  fsckall failed with exit code 1.
   
  WARNING - Unable to repair one or more filesystems.
  Run fsck manually (fsck filesystem...).
   
  mount: Please run fsck and try again
  svc:/system/filesystem/local:default: WARNING: /sbin/mountall -l failed: exit status 1
  Reading ZFS config: done.
  Dec 19 00:11:05 svc.startd[7]: svc:/system/filesystem/local:default: Method "/lib/svc/method/fs-local" failed with exit status 95.
  Dec 19 00:11:05 svc.startd[7]: system/filesystem/local:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
   
  MYSERVER console login:

I'm blocked and don't know what can i do to fix this problem.
please can someone help me to resolve it ?

thanks in advance

bartus11 · December 26, 2013, 10:20am

Show us:

echo | format
cat /etc/vfstab
mount
metastat -a
zpool status

cerco · December 26, 2013, 9:42pm

hi Bartus,
thank you for your reply. please find bellow all asked:

#echo | format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <Sun    -STK RAID INT   -V1.0 cyl 36348 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@0,0
       1. c0t1d0 <DEFAULT cyl 54627 alt 2 hd 255 sec 126>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0
Specify disk (enter its number): Specify disk (enter its number): 
#

# cat /etc/vfstab
#device         device          mount           FS      fsck    mount   mount
#to mount       to fsck         point           type    pass    at boot options
#
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/dsk/c0t0d0s0       -       -       swap    -       no      -
/dev/dsk/c0t0d0s1       /dev/rdsk/c0t0d0s1      /       ufs     1       no      -
/dev/dsk/c0t0d0s3       /dev/rdsk/c0t0d0s3      /usr    ufs     1       no      -
/dev/dsk/c0t0d0s4       /dev/rdsk/c0t0d0s4      /var    ufs     1       no      -
/dev/dsk/c0t0d0s5       /dev/rdsk/c0t0d0s5      /opt    ufs     2       yes     -
/dev/dsk/c0t1d0s0       /dev/rdsk/c0t1d0s0      /vol1   ufs     2       yes     -
/dev/dsk/c0t1d0s1       /dev/rdsk/c0t1d0s1      /vol2   ufs     2       yes     -
/devices        -       /devices        devfs   -       no      -
sharefs -       /etc/dfs/sharetab       sharefs -       no      -
ctfs    -       /system/contract        ctfs    -       no      -
objfs   -       /system/object  objfs   -       no      -
swap    -       /tmp    tmpfs   -       yes     -
#

# mount
/ on /dev/dsk/c0t0d0s1 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=800041 on Thu Dec 19 00:01:27 2013
/devices on /devices read/write/setuid/devices/dev=47c0000 on Thu Dec 19 00:01:11 2013
/system/contract on ctfs read/write/setuid/devices/dev=4800001 on Thu Dec 19 00:01:11 2013
/proc on proc read/write/setuid/devices/dev=4840000 on Thu Dec 19 00:01:11 2013
/etc/mnttab on mnttab read/write/setuid/devices/dev=4880001 on Thu Dec 19 00:01:11 2013
/etc/svc/volatile on swap read/write/setuid/devices/xattr/dev=48c0001 on Thu Dec 19 00:01:11 2013
/system/object on objfs read/write/setuid/devices/dev=4900001 on Thu Dec 19 00:01:11 2013
/etc/dfs/sharetab on sharefs read/write/setuid/devices/dev=4940001 on Thu Dec 19 00:01:11 2013
/usr on /dev/dsk/c0t0d0s3 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=800043 on Thu Dec 19 00:01:27 2013
/lib/libc.so.1 on /usr/lib/libc/libc_hwcap1.so.1 read/write/setuid/devices/dev=800043 on Thu Dec 19 00:01:27 2013
/dev/fd on fd read/write/setuid/devices/dev=4ac0001 on Thu Dec 19 00:01:27 2013
/var on /dev/dsk/c0t0d0s4 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=800044 on Thu Dec 19 00:01:29 2013
/tmp on swap read/write/setuid/devices/xattr/dev=48c0002 on Thu Dec 19 00:01:29 2013
/var/run on swap read/write/setuid/devices/xattr/dev=48c0003 on Thu Dec 19 00:01:29 2013
/opt on /dev/dsk/c0t0d0s5 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=800045 on Thu Dec 19 00:11:05 2013
/vol2 on /dev/dsk/c0t1d0s1 read/write/setuid/devices/intr/largefiles/logging/xattr/onerror=panic/dev=800081 on Thu Dec 19 00:11:05 2013
#

# metastat -a
metastat: MYSERVER: there are no existing databases
#

# zpool status
no pools available
#

once again thanks

MadeInGermany · December 27, 2013, 2:38am

The disk c0t1d0 (kernel driver name sd2) is broken:
it has got an unreadable sector 167762.
Replace the disk!
The new disk must get identical (or similar) partitions, new filesystems, and data for /vol1 and /vol2 restored from last data backup.

cerco · December 27, 2013, 8:44am

hi,
please is there not another possibility to fix that or to repair this sector ?
because I'm sorry to tell you that, but we have not done any backup for this volume.
and I don't know if the disks are mounted in Raid5, so that I can just replace another disk.

please help

bartus11 · December 27, 2013, 8:54am

How many disks do you see physically in the server? Also post output of:

raidctl -S

rbatte1 · December 27, 2013, 12:15pm

You have a simple disk, c0t1d0, but I don't think that this necessarily means you have a failed device. The filesystem /vol2 is on the same disk c0t1d0s1, and that has mounted okay.

Can you run fsck on the command line? Something like this, I think:-

fsck /dev/rdsk/c0t1d0s0

Robin
Liverpool/Blackburn
UK

jlliagre · December 27, 2013, 12:54pm

I doubt there really is a single disk as it is reported as "STK RAID INT", which means there is likely hardware raid behind it.

Unfortunately, making no backup and using UFS instead of ZFS is not the best way to prevent such issues to happen ...

MadeInGermany · December 27, 2013, 1:14pm

In format, select the c0t1d0, and inquiry, to ensure it's a simple disk.
Then analyze it - non-destructive read test.
It will 'repair' bad sectors i.e. tell the controller to replace by spare sectors. The contents of the 'repaired' sectors is unknown; run an fsck (like Robin suggested) to ensure file system integrity at least.

cerco · December 30, 2013, 4:01pm

goodmorning team,
I'm seeing physically 8 disks of 300Gb in the server. Bellow is the result of fsck command:

# fsck /dev/dsk/c0t1d0s0
** /dev/rdsk/c0t1d0s0
** Last Mounted on /vol1
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: read                    Error Level: Fatal
        Requested Block: 167762                    Error Block: 167762
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0

CANNOT READ: DISK BLOCK 135632: I/O error
CONTINUE? 
#

but when I type this command, nothing happens:
raidctl -S

rbatte1 · December 31, 2013, 7:17am

To jlliagre,

My assumption is that this is a simple device with the following output that was provided.

AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <Sun    -STK RAID INT   -V1.0 cyl 36348 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@0,0
       1. c0t1d0 <DEFAULT cyl 54627 alt 2 hd 255 sec 126>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0

The STK RAID device is c0t0d0 with the problem on c0t1d0.

This appears to either be a disk with some bad blocks or a corrupt filesystem. The whole disk is not broken (yet)

To cerco,

You say you have 8 physical disks. That's good to know, but how are they arranged? Are perhaps 7 in a RAID and one is not? From the cylinder numbers, it would almost suggest that you have 5 in a RAID at target zero and 3 as a simple LUN (no protection) at target one, but I can't be sure on the numbers.

I'm guessing that there must be a management tool for the array somewhere, hopefully not part of the server OS, else how would one boot first time to allocate the array? What does that tell you about the arrangement of the disks/LUNs?

What output do you get from the suggestion to analyse the LUN from MadeInGermany? It will take a while to run. It might just be that we have to use fsck and read an alternate superblock, but let's not go that way just yet. It's probably best to find out what we can first before taking action.

Thanks,
Robin

cerco · December 31, 2013, 9:33am

hi MadeInGermany,
sorry, but I don't understand very well what are you asking me to do. could you please just tell me what commands must I type to obtain what you need ?
Solaris is not my strong point

thanks

rbatte1 · December 31, 2013, 10:09am

I think you need to start the format utility, without the echo | on the front. Just run format on the command line.

It will take you into an interactive disk management session. Select option 1 which should be for the disk in question, c0t1d0 and it should present you a menu of actions you can take. It's been too many years (when Solaris 2.6 was current) to recall what to pick next (it might even be analyze on that menu) but if it's not obvious, paste the menu into the thread to jog my memory. Make sure you pick the read-only / non-destructive test.

Robin

jlliagre · December 31, 2013, 11:20am

@ratte1 My mistake, you are correct, that's a single disk.

@cerco, you can run "iostat -En" to get some information about your disk. You can also try running "fsck -y /dev/rdsk/c0t1d0s0" see if it manages to fix your file system.

cerco · December 31, 2013, 4:03pm

@jlliagre please find bellow:

# iostat -En
c1t0d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 
Vendor: TSSTcorp Product: CDDVDW TS-T633A  Revision: SR00 Serial No:  
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c0t0d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 
Vendor: Sun      Product: STK RAID INT     Revision: V1.0 Serial No:  
Size: 299.56GB <299563482624 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
c0t1d0           Soft Errors: 2 Hard Errors: 54 Transport Errors: 0 
Vendor: Sun      Product: STK RAID INT     Revision: V1.0 Serial No:  
Size: 898.71GB <898705128960 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 2 Predictive Failure Analysis: 0 
#

# fsck -y /dev/dsk/c0t1d0s0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: write                   Error Level: Retryable
        Requested Block: 167775                    Error Block: 167775
        Vendor: Sun                                Serial Number:             
        Sense Key: Hardware Error
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
WARNING: /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0 (sd2):
        Error for Command: write                   Error Level: Fatal
        Requested Block: 167775                   ** Phase 3a - Check Connectivity
** Phase 3b - Verify Shadows/ACLs
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cylinder Groups
FILESYSTEM MAY STILL BE INCONSISTENT.
107653 files, 247345348 used, 269015561 free (3913 frags, 33626456 blocks, 0.0% fragmentation)

***** FILE SYSTEM WAS MODIFIED *****
***** FILE SYSTEM IS BAD *****

***** PLEASE RERUN FSCK *****
#

@rbatte1 please find bellow:

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <Sun    -STK RAID INT   -V1.0 cyl 36348 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@0,0
       1. c0t1d0 <DEFAULT cyl 54627 alt 2 hd 255 sec 126>
          /pci@0,0/pci8086,340a@3/pci108e,286@0/disk@1,0
Specify disk (enter its number): 1
selecting c0t1d0
[disk formatted]
Warning: Current Disk has mounted partitions.
/dev/dsk/c0t1d0s0 is normally mounted on /vol1 according to /etc/vfstab. Please remove this entry to use this device.
/dev/dsk/c0t1d0s1 is currently mounted on /vol2. Please see umount(1M).


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !<cmd>     - execute <cmd>, then return
        quit
format>

jlliagre · December 31, 2013, 5:55pm

Okay, so c0t1d0 is a RAID after all, not a single disk.

If installed, what says this command:

/opt/StorMan/arcconf GETCONFIG 1

?

cerco · January 2, 2014, 4:02am

@jlliagre,
I think that this command doesn't exists, please find below:

# /opt/Storman/arcconf GETCONFIG 1
bash: /opt/Storman/arcconf: No such file or directory
#

cerco · January 3, 2014, 5:23am

please any update for that ?

rbatte1 · January 3, 2014, 11:07am

Did you follow the advice from fsck that you pasted in the thread?

Robin

cerco · January 3, 2014, 12:48pm

Hi rbatte1,
that has been done, but the result is the same. after have rerun fsck more than 4 times, I'm getting the same message.
must I continue to rerun fsck again and again ???