Unable to see controller C1 on Solaris 8

Good Morning,

We are having a legacy hardware v880 sparc server running SUNOS 5.8. Recently we started getting maintence error on one of the disk slice c1t0d0s4 so before changing the disk we had a look at output of cfgadm but cant see c1 controller in the list where as all our disks are configured on c1 controller.

bash-2.03# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
SBa                            cpu/mem      connected    configured   ok
SBa::cpu0                      cpu          connected    configured   ok
SBa::cpu1                      cpu          connected    configured   ok
SBa::memory                    memory       connected    configured   ok
SBb                            cpu/mem      connected    unconfigured ok
SBc                            cpu/mem      connected    unconfigured ok
SBd                            cpu/mem      connected    unconfigured ok
c0                             scsi-bus     connected    configured   unknown
c0::dsk/c0t6d0                 CD-ROM       connected    configured   unknown
c2                             scsi-bus     connected    unconfigured unknown
c3                             scsi-bus     connected    unconfigured unknown
pcisch0:hpc1_slot0             unknown      empty        unconfigured unknown
pcisch0:hpc1_slot1             unknown      empty        unconfigured unknown
pcisch0:hpc1_slot2             unknown      empty        unconfigured unknown
pcisch0:hpc1_slot3             unknown      empty        unconfigured unknown
pcisch2:hpc2_slot4             unknown      empty        unconfigured unknown
pcisch2:hpc2_slot5             mult/hp      connected    configured   ok
pcisch2:hpc2_slot6             unknown      empty        unconfigured unknown
pcisch3:hpc0_slot7             unknown      empty        unconfigured unknown
pcisch3:hpc0_slot8             unknown      empty        unconfigured unknown

format output
AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0
       1. c1t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100000c50e313eb,0
       2. c1t2d0 <SEAGATE-ST373207FC-0002 cyl 44304 alt 2 hd 4 sec 809>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c60b82a0,0
       3. c1t3d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100000c50e31e4c,0
       4. c1t4d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100000c50e313cb,0
       5. c1t5d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w2100000c50e32913,0
       6. c1t8d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0107dd361,0
       7. c1t9d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0107dc4a1,0
       8. c1t10d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0107dd561,0
       9. c1t11d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e0107dd651,0
      10. c1t12d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c60bae19,0
      11. c1t13d0 <SEAGATE-ST336607FC-0006 cyl 49780 alt 2 hd 2 sec 720>
          /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c62d6df9,0

bash-2.03# iostat -en
  ---- errors ---
  s/w h/w trn tot
    0   0   0   0 d0
    0   0   0   0 d1
    0   0   0   0 d2
    0   0   0   0 d3
    0   0   0   0 d4
    0   0   0   0 d5
    0   0   0   0 d6
    0   0   0   0 d10
    0   0   0   0 d11
    0   0   0   0 d12
    0   0   0   0 d13
    0   0   0   0 d14
    0   0   0   0 d15
    0   0   0   0 d16
    0   0   0   0 d20
    0   0   0   0 d21
    0   0   0   0 d22
    0   0   0   0 d23
    0   0   0   0 d24
    0   0   0   0 d25
    0   0   0   0 d26
    0   0   0   0 c0t6d0
    0  34   0  34 c1t13d0
    0  34   0  34 c1t5d0
    0  36   0  36 c1t9d0
    0  35   0  35 c1t4d0
  2839  73  12 2924 c1t3d0
    0  35   0  35 c1t1d0
    0  35   0  35 c1t11d0
    0  39  12  51 c1t8d0
    0  34   0  34 c1t10d0
    0  34   0  34 c1t2d0
    0  34   0  34 c1t12d0
   16  55   5  76 c1t0d0

bash-2.03# luxadm -e port

Found path to 1 HBA ports

/devices/pci@8,600000/SUNW,qlc@2/fp@0,0:devctl                     CONNECTED

Any pointers how to troubleshoot which seems to be a issue with controller looking at the iostat output and unable to see it in the cfgadm output

Thanks,
P

Perhaps the c1 got hung because of a non-responding disk?
Then a reboot might help.

Thanks for the reply.

I tried even rebooting but dint help. Still same issue.

If you check iostat there are so many hardware errors are recorded on all the disks connected with C1.

Any way to confirm if its a faulty controller that is causing the issue?

Your post of the 'format' command shows it can see a number of disks on c1. What happens if you try to display the vtoc of each disk?
(select a disk and then select option 'p' followed by 'p' again? Does it error or successfully display the vtoc (partition slice(s)? If it errors, please post the error.

Also, if you look in /dev/dsk and /dev/rdsk are the disk nodes there for each of these disks/slices? Or has someone deleted them?

What does cfgadm -al -o show_FCP_dev c1 show?

I asume your Solaris version is not patched properly. In the beginning cfgadm would not show FCAL devices (luxadm was used for that). You need the "SAN foundation suite" installed and patched to have cfgadm show FCAL devices.

The V880 was one of the first servers that used an internal FCAL loop for the internal harddrives. All the other devices are SCSI and are shown perfectly in cfgadm output. GA date for the V880 was November 2001. So in computer years that thing is a dinosaur ;).

Good Morning,

i am able to display all the prtvtoc each disk and no errors were shown and all disk and slices were present under /dev/dsk and /dev/rdsk

Thanks,
P

---------- Post updated at 10:32 AM ---------- Previous update was at 10:21 AM ----------

Below is the output

bash-2.03# cfgadm -al -o show_FCP_dev c1
Ap_Id                          Type         Receptacle   Occupant     Condition
c1: No matching library found

i tried with luxadm previuosly but was getting error as below

bash-2.03# luxadm replace_device /dev/rdsk/c1t0d0s2
 Error: Could not find the loop address for  the device at physical path. - /dev/rdsk/c1t0d0s2.

but when i look for physical path i am getting it

bash-2.03# ls -ltr /dev/rdsk/c1t0d0s2
lrwxrwxrwx   1 root     root          74 Jan 20 16:23 /dev/rdsk/c1t0d0s2 -> ../../devices/pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0:c,raw

So any ideas ?

Thanks,
P

So you can see the vtoc's and the disk device nodes seem to be there too.

Are there entries in /etc/vfstab which try to mount these disks at boot time? (Maybe post the contents of /etc/vfstab).

Have you tried to mount one of these disks manually (from root user)?
If so, does it work or do you get an error? If so, please post the error.

eg,

 
 # cd /
 # mount /dev/dsk/c1?????? /mnt
 # cd /mnt
 # ls -l
 

Good Morning,

I tried to manually mount each slice and i was getting below errors for these slices but rest i was able to mount and contents were visible.

bash-2.03# mount /dev/dsk/c1t0d0s0 /mnt/test
mount: the state of /dev/dsk/c1t0d0s0 is not okay
        and it was attempted to be mounted read/write
mount: Please run fsck and try again
bash-2.03# mount /dev/dsk/c1t0d0s1 /mnt/test
mount: the state of /dev/dsk/c1t0d0s1 is not okay
        and it was attempted to be mounted read/write
mount: Please run fsck and try again

Since i have detached the disk worth running the fsck on the raw disk c1t0d0s2 ?

Thanks,
P

No, no, no. You don't run fsck on slice2 (??????s2) of any disk because that defines the whole disk to Solaris and does NOT hold a filesystem.

You cannot currently mount c1t0d0s0 and c1t0d0s0. Take a look again at the vtoc for that disk (see my earlier post as to how to do that) and see whether either or both are filesystems. Quite often ??????s1 is defined as swap space. If both are indeed filesystems then fsck those slices:

 
 # fsck -n /dev/rdsk/c1t0d0s0
 # fsck -n /dev/rdsk/c1t0d0s1
 

Note that I have included the '-n' switch to check the filesystems without correcting anything first just to see how much damage, if any, there is. Don't use the '-y' switch at the outset because telling it to correct all errors might destroy the filesystem on the ground if damage is extensive. If errors are few simply run again without the '-n' and say 'y' to each question to correct the error(s).

On a production system I would expect there to be entries in /etc/vfstab to mount these filesystems at boot time?

Good Morning hicksd,

Below is the output of vfstab

#device         device          mount           FS      fsck    mount   mount
#to mount       to fsck         point           type    pass    at boot options
#
#/dev/dsk/c1d0s2 /dev/rdsk/c1d0s2 /usr          ufs     1       yes     -
fd      -       /dev/fd fd      -       no      -
/proc   -       /proc   proc    -       no      -
/dev/md/dsk/d1  -       -       swap    -       no      -
/dev/md/dsk/d0  /dev/md/rdsk/d0 /       ufs     1       no      -
/dev/md/dsk/d2  /dev/md/rdsk/d2 /var    ufs     1       no      -
swap    -       /tmp    tmpfs   -       yes     -
/dev/md/dsk/d6  /dev/md/rdsk/d6 /export ufs     1       yes     logging
#
# application filesystems
/dev/md/dsk/d3  /dev/md/rdsk/d3 /opt/oracle     ufs     2       yes     logging
/dev/md/dsk/d4  /dev/md/rdsk/d4 /db01           ufs     2       yes     logging
/dev/md/dsk/d5  /dev/md/rdsk/d5 /backup1        ufs     2       yes     logging

The Slices mentioned above are part of d0 and d2 and i have run the commands as you said

bash-2.03# fsck -n /dev/rdsk/c1t0d0s0
** /dev/rdsk/c1t0d0s0 (NO WRITE)
** Last Mounted on /
** Phase 1 - Check Blocks and Sizes
INCORRECT BLOCK COUNT I=391321 (2 should be 0)
CORRECT?  no

INCORRECT BLOCK COUNT I=806459 (2 should be 0)
CORRECT?  no

** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
UNREF FILE I=391321  OWNER=root MODE=100644
SIZE=0 MTIME=May 10 09:50 2017
CLEAR?  no

LINK COUNT FILE I=391324  OWNER=root MODE=100644
SIZE=487 MTIME=May 10 09:50 2017  COUNT 2 SHOULD BE 1
ADJUST?  no

** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE?  no

75362 files, 4243601 used, 4019771 free (8515 frags, 501407 blocks,  0.1% fragmentation)
bash-2.03# fsck -n /dev/rdsk/c1t0d0s1
** /dev/rdsk/c1t0d0s1 (NO WRITE)
** Last Mounted on /var
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups

FILE SYSTEM STATE IN SUPERBLOCK IS WRONG; FIX?  no

5942 files, 399666 used, 622069 free (1501 frags, 77571 blocks,  0.1% fragmentation)

But if you check i was getting maintenance on slice 4 not on slice 0 and slice 1?

Thanks,
P

Ah, you've got volume management in use looking at your vfstab.

Therefore, you should 'fsck' the raw (rdsk) devices in the 'device to fsck' column of vfstab eg,

# fsck -n /dev/md/rdsk/d2

on any filesystem that refuses to mount because it's flagged as 'dirty'.

Again, use the '-n' switch to examine how much damage there really is. If there are millions of errors then be careful. If errors are few then run again without the '-n' and answer 'y' to each question.

You can run 'fsck' on any/all of the devices in your vfstab 'device to fsck' column.

Hi,

I am struggling to understand here.

df -k output

/dev/md/dsk/d0       8263373 4243395 3937345    52%    /
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
/dev/md/dsk/d2       1021735  399723  560708    42%    /var
swap                 7367512      24 7367488     1%    /var/run
swap                 7367952     464 7367488     1%    /tmp
/dev/md/dsk/d4       105034703 80527538 23456818    78%    /db01
/dev/md/dsk/d5       140052804 95566454 43085822    69%    /backup1
/dev/md/dsk/d6       9736204 4422407 5216435    46%    /export
/dev/md/dsk/d3       35018085 14055926 20611979    41%    /opt/oracle

In above output i am able to mount all the devices and i was getting Needs Maintenance on : d3,d3,d3

When i had a look at metastat output one of the submirror of d3 i.e d13 (c1t0d0s4) is showing as maintenance and rest of the slices c1t0d0 are ok.

I tried to look for C1 controller and failed to get the information about it as i couldnt find the controller in devfsadm or cfgadm or luxadm as said in intial post.

Currently i have detached all the submirrors of c1t0d0 disk from all mirrors and tried to mount and able to mount them after running fsck on the s0 and s1.

But the maintenence slice c1t0d0s4 i am able to mout but getting error when i do the fsck.

Now not sure if we need to replace the disk if so how do we do that when we cant see the controller.

Note: these are legacy systems.

Thanks,
P

What do you mean by 'maintenance slice'?

Please post the vtoc of c1t0d0s4

When you use volume management (meta) the meta database lives on a (usually) very small configured slice. That is where the OS keeps the information as to what is mirrored to what, what volumes are spanned across multiple spindles and the like. That slice is not a filesystem and is not mounted in the usual way (and is therefore not in vfstab).

So again, what do you mean by maintenance slice? Could it be the meta database slice (very small)?

Hi,

Maintainnence in the sense

metastat output for d3

d3: Mirror
    Submirror 0: d13
      State: Needs maintenance
    Submirror 1: d23
      State: Okay
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 71109888 blocks

d13: Submirror of d3
    State: Needs maintenance
    Invoke: metareplace d3 c1t0d0s4 <new device>
    Size: 71109888 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t0d0s4                   0     No    Maintenance


d23: Submirror of d3
    State: Okay
    Size: 71109888 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c1t8d0s4                   0     No    Okay

and prtvtoc of c1t0d0s4 is as below

* /dev/rdsk/c1t0d0s4 partition map
*
* Dimensions:
*     512 bytes/sector
*     424 sectors/track
*      24 tracks/cylinder
*   10176 sectors/cylinder
*   14089 cylinders
*   14087 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00          0  16780224  16780223
       1      7    00   16780224   2106432  18886655
       2      5    00          0 143349312 143349311
       3      3    01   18886656  16780224  35666879
       4      0    00   35666880  71109888 106776767
       5      0    00  106776768  19771968 126548735
       6      0    00  126548736  16770048 143318783
       7      0    00  143318784     30528 143349311

meta database slice resides of c1t0d0s7 and i wasn't referring to it.

Thanks,
P

What error are you getting when you fsck s4?

I understand your concern about being unable to see the c1 controller, however, you seem to be able to see all the disks on c1.

So one side of the mirror of d3 (d13 which is c1t0d0s4) has a problem and needs maintenance. I guess other mirrors on the same drive are still working so the drive itself hasn't failed and d3 should still be mount-able which is the whole point of mirroring.

Have you tried to break and remake that mirror?

Apologies for not replying.

Yes i did try breaking the mirror and remake previously but still getting the error.

bash-2.03# cat /var/adm/messages.0
May 21 09:19:12 xxxx  scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0 (ssd15):
May 21 09:19:12 xxxx        Error for Command: write(10)               Error Level: Retryable
May 21 09:19:12 xxxx scsi: [ID 107833 kern.notice]  Requested Block: 45410506                  Error Block: 45410512
May 21 09:19:12 xxxx scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0302V77103_a
May 21 09:19:12 xxxx scsi: [ID 107833 kern.notice]  Sense Key: Hardware Error
May 21 09:19:12 xxxx scsi: [ID 107833 kern.notice]  ASC: 0x3 (<vendor unique code 0x3>), ASCQ: 0x80, FRU: 0x0
May 21 09:19:13 xxxx scsi: [ID 243001 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
May 21 09:19:13 xxxx        FCP: WWN 0x500000e010279761 reset successfully
May 21 09:19:13 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0 (ssd15):
May 21 09:19:13 xxxx        Error for Command: write(10)               Error Level: Retryable
May 21 09:19:13 xxxx scsi: [ID 107833 kern.notice]  Requested Block: 45410506                  Error Block: 45410512
May 21 09:19:13 xxxx scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0302V77103_a
May 21 09:19:13 xxxx scsi: [ID 107833 kern.notice]  Sense Key: Hardware Error
May 21 09:19:13 xxxx scsi: [ID 107833 kern.notice]  ASC: 0x3 (<vendor unique code 0x3>), ASCQ: 0x80, FRU: 0x0
May 21 09:19:14 xxxx scsi: [ID 243001 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0 (ssd15):
May 21 09:19:14 xxxx        SCSI transport failed: reason 'reset': retrying command
May 21 09:19:15 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w500000e010279761,0 (ssd15):
May 21 09:19:15 xxxx        Error for Command: write(10)               Error Level: Retryable
May 21 09:19:15 xxxx scsi: [ID 107833 kern.notice]  Requested Block: 45410506                  Error Block: 45410512
May 21 09:19:15 xxxx scsi: [ID 107833 kern.notice]  Vendor: FUJITSU                            Serial Number: 0302V77103_a
May 21 09:19:15 xxxx scsi: [ID 107833 kern.notice]  Sense Key: Hardware Error
May 21 09:19:15 xxxx scsi: [ID 107833 kern.notice]  ASC: 0x3 (<vendor unique code 0x3>), ASCQ: 0x80, FRU: 0x0
May 21 09:19:16 xxxx md_stripe: [ID 641072 kern.warning] WARNING: md: d13: write error on /dev/dsk/c1t0d0s4
May 21 09:19:17 xxxx md_mirror: [ID 104909 kern.warning] WARNING: md: d13: /dev/dsk/c1t0d0s4 needs maintenance
May 27 01:00:07 xxxx explorer: [ID 702911 daemon.notice] Explorer started
May 27 01:04:33 xxxx explorer: [ID 702911 daemon.notice] Explorer finished

if i go with changing the disk then i cant see the controller to make this disk offline.

Running out of ideas for fixing this.

Thanks,
P

It looks like the disk surface might be damaged within that filesystem on one leg of the mirror. You could try to repair the faulty side of the mirror that with format>analyse but you will need to be very careful about the block range you select for checking.

The vtoc will tell you cyls/blocks where that filesystem sits.

Repairing a Defective Sector (System Administration Guide: Devices and File Systems)

When sectors (blocks) are repaired they are often re-vectored within the drive using sectors reserved by the manufacturer for this purpose.