Sun Fire v440 Hard disk or controller broken? WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1)

Hi,

I have a Sun Fire V440 server that fails to boot up correctly. A lot of services are not started and the sytems acts really slow to commands. During boot I can see the following Error:

WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1):
        SCSI transport failed: reason 'reset': retrying command
WARNING: /pci@1f,700000/scsi@2/sd@0,0 (sd1):
        Error for Command: read                    Error Level: Retryable
        Requested Block: 689376                    Error Block: 689390
        Vendor: LSILOGIC                           Serial Number: LSI INTERNAL
        Sense Key: Media Error
        ASC: 0x11 (read retries exhausted), ASCQ: 0x1, FRU: 0x0

The first two disks sd0 and sd1 are configured as raid 1 it seems. So I would assume that one of those disks is bad. But raidctl shows no errors:

RAID    Volume  RAID            RAID            Disk
Volume  Type    Status          Disk            Status
------------------------------------------------------
c1t0d0  IM      RESYNCING       c1t0d0          OK
                                 c1t1d0          OK

But iostat -en shows soft and hard errors for the raid:

bash-3.00# iostat -en
  ---- errors ---
  s/w h/w trn tot
    3   6   0   9 c1t0d0
    0   0   0   0 c1t2d0
    0   0   0   0 c1t3d0
    1   0   0   1 c3t600144F0A549542200005CC83C9C0003d0
    1   0   0   1 ssd3

Is it possible that the Raid controller is broken?

bash-3.00# prtdiag -v
System Configuration: Sun Microsystems  sun4u Sun Fire V440
System clock frequency: 183 MHZ
Memory size: 16GB

==================================== CPUs ====================================
               E$          CPU                    CPU
CPU  Freq      Size        Implementation         Mask    Status      Location
---  --------  ----------  ---------------------  -----   ------      --------
0    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
1    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
2    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -
3    1281 MHz  1MB         SUNW,UltraSPARC-IIIi    2.4    on-line      -

================================= IO Devices =================================
Bus     Freq  Slot +      Name +
Type    MHz   Status      Path                          Model
------  ----  ----------  ----------------------------  --------------------
pci     66    MB          pci108e,abba (network)        SUNW,pci-ce
              okay        /pci@1c,600000/network@2

pci     33    MB          isa/su (serial)
              okay        /pci@1e,600000/isa@7/serial@0,3f8

pci     33    MB          isa/su (serial)
              okay        /pci@1e,600000/isa@7/serial

pci     33    MB          isa/rmc-comm-rmc_comm (seria+
              okay        /pci@1e,600000/isa@7/rmc-comm@0,3e8

pci     33    MB          pci10b9,5229 (ide)
              okay        /pci@1e,600000/ide

pci     66    MB          pci108e,abba (network)        SUNW,pci-ce
              okay        /pci@1f,700000/network@1

pci     66    MB          scsi-pci1000,30 (scsi-2)      LSI,1030
              okay        /pci@1f,700000/scsi@2

pci     66    MB          scsi-pci1000,30 (scsi-2)      LSI,1030
              okay        /pci@1f,700000/scsi


============================ Memory Configuration ============================
Segment Table:
-----------------------------------------------------------------------
Base Address       Size       Interleave Factor  Contains
-----------------------------------------------------------------------
0x0                4GB               16          BankIDs 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0x1000000000       4GB               16          BankIDs 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0x2000000000       4GB               16          BankIDs 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
0x3000000000       4GB               16          BankIDs 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63

Bank Table:
-----------------------------------------------------------
           Physical Location
ID       ControllerID  GroupID   Size       Interleave Way
-----------------------------------------------------------
0        0             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1        0             0         256MB
2        0             1         256MB
3        0             1         256MB
4        0             0         256MB
5        0             0         256MB
6        0             1         256MB
7        0             1         256MB
8        0             1         256MB
9        0             1         256MB
10       0             0         256MB
11       0             0         256MB
12       0             1         256MB
13       0             1         256MB
14       0             0         256MB
15       0             0         256MB
16       1             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
17       1             0         256MB
18       1             1         256MB
19       1             1         256MB
20       1             0         256MB
21       1             0         256MB
22       1             1         256MB
23       1             1         256MB
24       1             1         256MB
25       1             1         256MB
26       1             0         256MB
27       1             0         256MB
28       1             1         256MB
29       1             1         256MB
30       1             0         256MB
31       1             0         256MB
32       2             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
33       2             0         256MB
34       2             1         256MB
35       2             1         256MB
36       2             0         256MB
37       2             0         256MB
38       2             1         256MB
39       2             1         256MB
40       2             1         256MB
41       2             1         256MB
42       2             0         256MB
43       2             0         256MB
44       2             1         256MB
45       2             1         256MB
46       2             0         256MB
47       2             0         256MB
48       3             0         256MB           0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
49       3             0         256MB
50       3             1         256MB
51       3             1         256MB
52       3             0         256MB
53       3             0         256MB
54       3             1         256MB
55       3             1         256MB
56       3             1         256MB
57       3             1         256MB
58       3             0         256MB
59       3             0         256MB
60       3             1         256MB
61       3             1         256MB
62       3             0         256MB
63       3             0         256MB

Memory Module Groups:
--------------------------------------------------
ControllerID   GroupID  Labels         Status
--------------------------------------------------
0              0        C0/P0/B0/D0
0              0        C0/P0/B0/D1
0              1        C0/P0/B1/D0
0              1        C0/P0/B1/D1
1              0        C1/P0/B0/D0
1              0        C1/P0/B0/D1
1              1        C1/P0/B1/D0
1              1        C1/P0/B1/D1
2              0        C2/P0/B0/D0
2              0        C2/P0/B0/D1
2              1        C2/P0/B1/D0
2              1        C2/P0/B1/D1
3              0        C3/P0/B0/D0
3              0        C3/P0/B0/D1
3              1        C3/P0/B1/D0
3              1        C3/P0/B1/D1

============================ Environmental Status ============================
Fan Status:
-------------------------------------------
Location             Sensor          Status
-------------------------------------------
FT0/F0               TACH            okay
FT1/F0               TACH            okay
FT1/F1               TACH            okay
PS0                  FF_PDCT_FAN     okay

Temperature sensors:
-----------------------------------------
Location       Sensor              Status
-----------------------------------------
C0/P0          T_CORE              okay
C1/P0          T_CORE              okay
C2/P0          T_CORE              okay
C3/P0          T_CORE              okay
C0             T_AMB               okay
C1             T_AMB               okay
C2             T_AMB               okay
C3             T_AMB               okay
SCSIBP         T_AMB               okay
MB             T_AMB               okay
------------------------------------
Current sensors:
----------------------------------------
Location             Sensor       Status
----------------------------------------
MB                   FF_SCSIA     okay
MB                   FF_SCSIB     okay
MB                   FF_POK       okay
C0/P0                FF_POK       okay
C1/P0                FF_POK       okay
C2/P0                FF_POK       okay
C3/P0                FF_POK       okay
------------------------------------
Voltage sensors:
-----------------------------------
Location       Sensor        Status
-----------------------------------
MB             V_+1V5        okay
MB             V_VCCTM       okay
MB             V_NET0_1V2D   okay
MB             V_NET1_1V2D   okay
MB             V_NET0_1V2A   okay
MB             V_NET1_1V2A   okay
MB             V_+3V3        okay
MB             V_+3V3STBY    okay
MB/BAT         V_BAT         warning (0.00V)
MB             V_SCSI_CORE   okay
MB             V_+5V         okay
MB             V_+12V        okay
MB             V_-12V        okay
PS0            P_PWR         okay
PS0            FF_POK        okay
-----------------------------------------
Keyswitch:
-----------------------------------------
Location       Keyswitch   State
-----------------------------------------
SYS            SYSCTRL     NORMAL
--------------------------------------------------
Led State:
--------------------------------------------------------------
Location               Led                   State       Color
--------------------------------------------------------------
SYS                    ACT                   on          green
SYS                    SERVICE               on          amber
SYS                    LOCATE                off         white
PS0                    POK                   on          green
PS0                    STBY                  on          green
PS0                    SERVICE               off         amber
PS0                    OK2RM                 off         blue
HDD0                   SERVICE               off         amber
HDD0                   OK2RM                 off         blue
HDD1                   SERVICE               off         amber
HDD1                   OK2RM                 off         blue
HDD2                   SERVICE               off         amber
HDD2                   OK2RM                 off         blue
HDD3                   SERVICE               off         amber
HDD3                   OK2RM                 off         blue

=========================== FRU Operational Status ===========================
---------------------------------
Fru Operational Status:
---------------------------------
Location                Status
---------------------------------
SC                      okay
HDD0                    present
HDD1                    present
HDD2                    present
HDD3                    present
PS0                     okay

================================ HW Revisions ================================
ASIC Revisions:
-------------------------------------------------------------------
Path                   Device           Status             Revision
-------------------------------------------------------------------
/pci@1c,600000         pci108e,a801     okay               4
/pci@1d,700000         pci108e,a801     okay               4
/pci@1e,600000         pci108e,a801     okay               4
/pci@1f,700000         pci108e,a801     okay               4

System PROM revisions:
----------------------
OBP 4.16.4 2004/12/18 05:20 Sun Fire V440,Netra 440
OBDIAG 4.16.4 2004/12/18 05:21

I'm really thankful for any hints, as I have no clue how to proceed with this.

Best Regards,
Oliver

The Raid controller is not showing no problems, as you put it.

RESYNCING means that the controller is remirroring the Raid1 disks because of a problem. Depending on the capacity of the Raid1 disks (they will typically be exactly the same size) this resyncing shouldn't take very long, however, whilst this is in progress, system response time will be impacted. Once complete, the status should become OPTIMAL .

If the resyncing is falling over for some reason then the process might be restarting over and over and OPTIMAL is never achieved. What for that. If that is the case I would be inclined to first if possible take the system down and re-seat all SCSI/SATA cables both ends (disk and mobo) and all disk power supply plugs. Reboot and see if the problem persists. If it does, then most likely one of the disks is faulty. It's possible but unlikely that the raid controller is faulty. All the moving parts are the disks.

You could remove the faulty raid1 drive (the one continuously resyncing) and put it on another machine running diagnostics. Perhaps completely reformat and try again. Otherwise, it's a new disk required.

Hi,

the status is shown as optimal. I would guess that if a disk is failed or failing raidctl would show that? How can I identify which of the two disks are bad if raidctl claims everything is ok. I have powered down the server many times. I have not replugged all the cables yet. I will give it a try.

Watch for a repeat of resyncing. If it keeps happening something is wrong (probably with one of the disks). You will also see high disk activity on the disk LEDs which might be easier to spot than keep doing a raidctl.

Do I need to issue a command in order to remove one of the disks? Is the raid hotplug capable?

Yes, the onboard raid is hotplug capable but, of course, you need to be sure that you're pulling the right disk.

With the system down you can pull out and re-seat both of them to try to ensure good connection with the hotplug sockets.

Also, from your original post, it shows that c1t0d0 is the disk being rebuilt (RESYNCING) and c1t1d0 is running OK.

Sorry about the misleading info from raidctl. The resync was shown because I powered down the system removed one of the disks and powered it back on with a single disk to see whether the error stays. Somehow I was only able to boot when d0 was installed. Pulling d0 the system was not able to boot. I have now replaced the second disk d1 with a fresh disk, the raid seems is rebuilt but still the system shows the same error during boot. :frowning: I have now removed d0 again and booted only with the new synced d1 disk but the same error is shown. Is there a possibility that there is a problem with the filesystem itself and this error is now also replicated to the new disk?

Well it's complaining about a media error but, sure, it could indeed be a corrupt sector(s) in the middle of a filesystem.

This is, of course, the root filesystem. Is this filesystem UFS or ZFS?

A full filesystem check would be a good idea. Get ready to write down any inode numbers it complains about (it might not come out with actual filenames if the filesystem is corrupt.

Do you have a backup? If not, you should take one because filesystem checking can destroy a filesystem faster than you can blink if there is significant corruption and it tries to correct it.

Of course, you will probably need to boot from DVD media to check a root filesystem.

Probably a good idea just to do a root filesystem check with a "no write" flag. That way you get to see what damage there is without the risk that it will try to fix it. It will show you the difference as to whether there is one file corrupt or one million files corrupt.

It's a UFS filesystem. I did a file system check and the results don't look good:

There is no recent backup available. The project team that is responsible for the server did not regulary backup the machine. This morning I tried to ufsdump the filesystem but this failed with a couple of errors. I do have a ufsdump from 2017. Maybe I try to restore this dump to the new disk just to check whether this works. If it doesn't there must be a hardware fault somewhere in the system as the disk is brand new.

That's not too bad on face value.

There's obviously a superblock located at block=16 but the system maintains other copies at other block locations

AFAIR, you should be able to specify an alternative superblock location and get it to fix the superblock (at least).

Google that, I'll look it up too, or probably another Sun expert on here will chip in. Don't do anything rash yet.

Yep, there's another copy of the superblock at block=32. (And there will be others).

The search for an alternative superblock failed because you spec'd NO WRITE. So it didn't even try.

Use the switch:

-o b=32

on the command line to use superblock at block=32.

This would be automatic fix -y but -n is NO WRITE.

With those results I would now be inclined to run fsck without either -n or -y and answer the questions as they come. This allows you to CTRL-C out if you get prompted too many times on individual files.

https://docs.oracle.com/cd/E19253-01/817-5093/gagby/index.html