corrupt disk

Hallo Friends,

I have application X running on hpux 11.11 and oracle 9i release 2. I recently had a hardware failure on disk /dev/dsk/c2t0d0

Below is the systemlog file :

root@a7dmc:/var/adm/syslog [139] > /opt/resmon/bin/resdata -R 155713541 -r /storage/events/enclosures/gazemon/0_1_1_0.0.0 -n 155713537 -a

CURRENT MONITOR DATA:

Event Time..........: Wed Aug 17 23:55:47 2011
Severity............: CRITICAL
Monitor.............: gazemon
Event #.............: 100337
System..............: a7dmc

Summary:
**** Disk at hardware path 0/1/1/0.0.0 : Media failure


Description of Error:

**** The device was unsuccessful in reading data for the current I/O request
**** due to an error on the medium. The maximum number of retries were
**** attempted and the data could not be read. The request was likely processed
**** in a way which could cause damage to or loss of data.

Probable Cause / Recommended Action:

**** Reformatting the medium may fix the problem.

**** Alternatively, the medium in the device is flawed. If the medium is
**** removable, replace the medium with a fresh one.

**** Alternatively, if the medium is not removable, the device has experienced
**** a hardware failure. Contact your HP support representative to have the
**** device checked.

Additional Event Data:
**** System IP Address...: 192.168.0.17
**** Event Id............: 0x4e4c38e300000000
**** Monitor Version.....: B.01.00
**** Event Class.........: I/O
**** Client Configuration File...........:
**** /var/stm/config/tools/monitor/default_gazemon.clcfg
**** Client Configuration File Version...: A.01.01
********* Qualification criteria met.
************** Number of events..: 1
**** Associated OS error log entry id(s):
********* 0x4e4c38e100000000
**** Additional System Data:
********* System Model Number.............: 9000/800/rp3440
********* EMS Version.....................: A.04.20
********* STM Version.....................: A.53.00
**** Latest information on this event:
********* http://docs.hp.com/hpux/content/hardware/ems/scsi.htm#100337

v-v-v-v-v-v-v-v-v-v-v-v-v*** D* E* T* A* I* L* S*** v-v-v-v-v-v-v-v-v-v-v-v-v



Product/Device Identification Information:

**** Logger ID.........: sdisk
**** Product Identifier: SCSI Disk
**** Product Qualifier.: HP146
**** SCSI Target ID....: (not available/applicable)
**** SCSI LUN..........: (not available/applicable)

I/O Log Event Data:

**** Driver Status Code..................: 0x0000007C
**** Length of Logged Hardware Status....: 36 bytes.
**** Offset to Logged Manager Information: 40 bytes.
**** Length of Logged Manager Information: 34 bytes.

Hardware Status:

**** Raw H/W Status:
********* 0x0000: 00 00 00 02** F0 00 03 01** 4B F6 66 28** 00 00 00 00
********* 0x0010: 11 01 00 80** 00 3F 00 28** 00 67 00 01** AF 03 00 00
********* 0x0020: 0E E8 03 51

**** SCSI Status...: CHECK CONDITION (0x02)
********* Indicates that a contingent allegiance condition has occurred.* Any
********* error, exception, or abnormal condition that causes sense data to be
********* set will produce the CHECK CONDITION status.

SCSI Sense Data:

**** Undecoded Sense Data:
********* 0x0000: F0 00 03 01** 4B F6 66 28** 00 00 00 00** 11 01 00 80
********* 0x0010: 00 3F 00 28** 00 67 00 01** AF 03 00 00** 0E E8 03 51

**** SCSI Sense Data Fields:
********* Error Code********************* : 0x70
********* Segment Number***************** : 0x00
********* Bit Fields:
************** Filemark****************** : 0
************** End-of-Medium************* : 0
************** Incorrect Length Indicator : 0
********* Sense Key********************** : 0x03
********* Information Field Valid******** : TRUE
********* Information Field************** : 0x014BF666
********* Additional Sense Length******** : 40
********* Command Specific*************** : 0x00000000
********* Additional Sense Code********** : 0x11
********* Additional Sense Qualifier***** : 0x01
********* Field Replaceable Unit********* : 0x00
********* Sense Key Specific Data Valid** : TRUE
********* Sense Key Specific Data******** : 0x80 0x00 0x3F

********* Sense Key 0x03, MEDIUM ERROR, indicates that the command terminated
********* with a nonrecovered error condition that was probably caused by a
********* flaw in the medium or an error in the recorded data.* This sense key
********* may also be returned if the device is unable to distinguish between a
********* flaw in the medium and a specific hardware failure (sense key 0x04).
********* For the RECOVERED ERROR, HARDWARE ERROR, or MEDIUM ERROR Sense Key,
********* the Sense Key Specific data indicates that 63 retries were attempted.

********* The combination of Additional Sense Code and Sense Qualifier (0x1101)
********* indicates: Read retries exhausted.

SCSI Command Data Block:* (not present in log record)

Manager-Specific Information:

**** Raw Manager Data:
********* 0x0000: 02 08 B5 B9** 00 00 34 00** 00 00 00 02** 00 00 00 00
********* 0x0010: 02 00 00 20** 7A 00 09 0A** 28 00 01 4B** F6 40 00 00
** *******0x0020: 40 00

root@a7dmc:/var/adm/syslog [140] >


root@a7dmc:/ [131] > ioscan -funC disk
Class     I  H/W Path       Driver  S/W State   H/W Type     Description
=========================================================================
disk      0  0/0/2/0.0.0.0  sdisk   CLAIMED     DEVICE       TEAC    DV-28E-C
                           /dev/dsk/c0t0d0   /dev/rdsk/c0t0d0
disk      1  0/1/1/0.0.0    sdisk   CLAIMED     DEVICE       HP 146 GMAT3147NC
                           /dev/dsk/c2t0d0   /dev/rdsk/c2t0d0
disk      2  0/1/1/0.1.0    sdisk   CLAIMED     DEVICE       HP 146 GMAT3147NC
                           /dev/dsk/c2t1d0   /dev/rdsk/c2t1d0
root@a7dmc:/ [132] >

root@a7dmc:/ [133] > bdf
Filesystem          kbytes    used   avail %used Mounted on
/dev/vg00/lvol3     229376  141008   87696   62% /
/dev/vg00/lvol1     314736   59528  223728   21% /stand
/dev/vg00/lvol8    8192000 2631048 5517560   32% /var
/dev/vg00/lvfeedData
                   24117248 21653916 2424856   90% /var/opt/dmc/feedData
/dev/vg00/lvSORT    393216    1197  367525    0% /var/opt/dmc/SORT
/dev/vg00/lvASCII  60555264 53112744 7384400   88% /var/opt/dmc/ASCII
/dev/vg00/lvol7    5144576 1298368 3816216   25% /usr
/dev/vg00/lvol6    10256384 3694136 6511800   36% /tmp
/dev/vg00/lvol5    5144576 2002544 3117520   39% /opt
/dev/vg00/lvoracle 10256384 3627982 6421300   36% /opt/oracle/product/9.2.0
/dev/vg00/lvol4    5144576   20976 5083632    0% /home


root@a7dmc:/ [134] > vgdisplay -v vg00
--- Volume groups ---
VG Name                     /dev/vg00
VG Write Access             read/write
VG Status                   available
Max LV                      255
Cur LV                      12
Open LV                     12
Max PV                      16
Cur PV                      1
Act PV                      1
Max PE per PV               4384
VGDA                        2
PE Size (Mbytes)            32
Total PE                    4374
Alloc PE                    4088
Free PE                     286
Total PVG                   0
Total Spare PVs             0
Total Spare PVs in use      0

   --- Logical volumes ---
   LV Name                     /dev/vg00/lvol1
   LV Status                   available/syncd
   LV Size (Mbytes)            320
   Current LE                  10
   Allocated PE                10
   Used PV                     1

   LV Name                     /dev/vg00/lvol2
   LV Status                   available/syncd
   LV Size (Mbytes)            4096
   Current LE                  128
   Allocated PE                128
   Used PV                     1

   LV Name                     /dev/vg00/lvol3
   LV Status                   available/syncd
   LV Size (Mbytes)            224
   Current LE                  7
   Allocated PE                7
   Used PV                     1

   LV Name                     /dev/vg00/lvol4
   LV Status                   available/syncd
   LV Size (Mbytes)            5024
   Current LE                  157
   Allocated PE                157
   Used PV                     1

   LV Name                     /dev/vg00/lvol5
   LV Status                   available/syncd
   LV Size (Mbytes)            5024
   Current LE                  157
   Allocated PE                157
   Used PV                     1

   LV Name                     /dev/vg00/lvol6
   LV Status                   available/syncd
   LV Size (Mbytes)            10016
   Current LE                  313
   Allocated PE                313
   Used PV                     1

   LV Name                     /dev/vg00/lvol7
   LV Status                   available/syncd
   LV Size (Mbytes)            5024
   Current LE                  157
   Allocated PE                157
   Used PV                     1

   LV Name                     /dev/vg00/lvol8
   LV Status                   available/syncd
   LV Size (Mbytes)            8000
   Current LE                  250
   Allocated PE                250
   Used PV                     1

   LV Name                     /dev/vg00/lvfeedData
   LV Status                   available/syncd
   LV Size (Mbytes)            23552
   Current LE                  736
   Allocated PE                736
   Used PV                     1

   LV Name                     /dev/vg00/lvSORT
   LV Status                   available/syncd
   LV Size (Mbytes)            384
   Current LE                  12
   Allocated PE                12
   Used PV                     1

   LV Name                     /dev/vg00/lvASCII
   LV Status                   available/syncd
   LV Size (Mbytes)            59136
   Current LE                  1848
   Allocated PE                1848
   Used PV                     1

   LV Name                     /dev/vg00/lvoracle
   LV Status                   available/syncd
   LV Size (Mbytes)            10016
   Current LE                  313
   Allocated PE                313
   Used PV                     1


   --- Physical volumes ---
   PV Name                     /dev/dsk/c2t0d0
   PV Status                   available
   Total PE                    4374
   Free PE                     286
   Autoswitch                  On


root@a7dmc:/ [135] >

I have sourced a new disk in the meantime. Is there a way i can avoid reinstalling the OS, database and application?

What is on the other disk?
Can you give the output of bdf so I can understand... Why weren't these 2 disks mirrored?

The client refused to run Raid 5 on the server. Here is the bdf output below:

root@a7dmc:/ [149] > bdf
Filesystem          kbytes    used   avail %used Mounted on
/dev/vg00/lvol3     229376  141008   87696   62% /
/dev/vg00/lvol1     314736   59528  223728   21% /stand
/dev/vg00/lvol8    8192000 2631272 5517344   32% /var
/dev/vg00/lvfeedData
                   24117248 21696496 2382960   90% /var/opt/dmc/feedData
/dev/vg00/lvSORT    393216    1197  367525    0% /var/opt/dmc/SORT
/dev/vg00/lvASCII  60555264 53112744 7384400   88% /var/opt/dmc/ASCII
/dev/vg00/lvol7    5144576 1298368 3816216   25% /usr
/dev/vg00/lvol6    10256384 3694136 6511800   36% /tmp
/dev/vg00/lvol5    5144576 2002544 3117520   39% /opt
/dev/vg00/lvoracle 10256384 3627982 6421300   36% /opt/oracle/product/9.2.0
/dev/vg00/lvol4    5144576   20976 5083632    0% /home
You have mail in /var/mail/root
root@a7dmc:/ [150] >

and the output below is from the other disk.

root@a7dmc:/ [169] >  pvdisplay -v /dev/dsk/c2t1d0|more
--- Physical volumes ---
PV Name                     /dev/dsk/c2t1d0
VG Name                     /dev/vgoradata
PV Status                   available
Allocatable                 yes
VGDA                        2
Cur LV                      12
PE Size (Mbytes)            16
Total PE                    8749
Free PE                     1492
Allocated PE                7257
Stale PE                    0
IO Timeout (Seconds)        default
Autoswitch                  On

   --- Distribution of physical volume ---
   LV Name                      LE of LV  PE for LV
   /dev/vgoradata/lvsystem_1    33        33
   /dev/vgoradata/lvtemporary_1 513       513
   /dev/vgoradata/lvundo_1      257       257
   /dev/vgoradata/lvusers_1     4         4
   /dev/vgoradata/lvdimension   33        33
   /dev/vgoradata/lvredo_1      13        13
   /dev/vgoradata/lvredo_2      13        13
   /dev/vgoradata/lvredo_3      13        13
   /dev/vgoradata/lvCDR1_1      1688      1688
   /dev/vgoradata/lvCDR2_1      1688      1688
   /dev/vgoradata/lvGSM1_1      1501      1501
   /dev/vgoradata/lvGSM2_1      1501      1501

   --- Physical extents ---
   PE    Status   LV                           LE
   00000 current  /dev/vgoradata/lvsystem_1    00000
   00001 current  /dev/vgoradata/lvsystem_1    00001
   00002 current  /dev/vgoradata/lvsystem_1    00002
   00003 current  /dev/vgoradata/lvsystem_1    00003
   00004 current  /dev/vgoradata/lvsystem_1    00004
   00005 current  /dev/vgoradata/lvsystem_1    00005
   00006 current  /dev/vgoradata/lvsystem_1    00006
   00007 current  /dev/vgoradata/lvsystem_1    00007
   00008 current  /dev/vgoradata/lvsystem_1    00008
   00009 current  /dev/vgoradata/lvsystem_1    00009
   00010 current  /dev/vgoradata/lvsystem_1    00010
   00011 current  /dev/vgoradata/lvsystem_1    00011
   00012 current  /dev/vgoradata/lvsystem_1    00012
   00013 current  /dev/vgoradata/lvsystem_1    00013
Standard input

mirroring is not raid5...
Is it me not reading correctly or is it your corrupted disk is your boot disk with OS (only?).

How good are at HPUX?

Do you have mirror-ux software installed ( $$ option... the same for onlin-JFS...)?

Vbe let me answer your questions.

Do you have mirror-ux software installed >> No
Is it me not reading correctly or is it your corrupted disk is your boot disk with OS (only?) >> Correct my boot disk is the one which is corrupt.

Is there a way i can save what i have? Are there any scenerios that you can describe that can help me to recover quickly from this problem?

Can you make a recovery tape before anything else?
Normally these boxes can contain up to 3 disks, can you add one?

Do you have a backup of any system or data partitions? Can't see a tape backup device in "ioscan" but maybe you backup across the network?

Did the system crash or for that matter has it had a cold boot for any reason since the failure? Not recommending that you do actually reboot or run "fsck" (because your system may be precarious) but just wondering whether "fsck" has been run by the system already?

Vbe:

  1. I can add another disk as soon as it arrives.
  2. Please guide me through the steps to make a recovery tape. Or are you suggesting that i make a full backup?

Methyl:

  1. There is no backup of any sort.
  2. The system didn't crash yet and it has not been rebooted since this failure. and fsck was not run yet.

A recovery tape allows you to boot from the tape to reinstall if needed on a new disk...
Check what you have in /opt/ignite/bin, depending of version you might - or not - have make_recovery. If not what do you have as make_*_recovery ?

And you will have to take also a full backup...

My idea is to have every thing ready in case of failure...
Don't try to reboot your system before, it may not come up...
If the disk is same size, I would go and try include it in the VG, make it bootable and migrate the faulty PV to the new one then remove it from VG00...

Then cross your fingers and test a reboot...
If size bigger then it depends... If I cant assist you (busy week) I hope methyl can...

I think that the guys on ittoolbox.com are giving you the same sort of advice (get an Ignite backup of vg00 , and also get a backup of your data):
http://unix.ittoolbox.com/groups/technical-functional/hp-ux-l/corrupt-disk-and-reinstallation-of-os-4515267

Running a computer server with no resilient disc system and no backup is not sensible. You need both.

Oracle used to advise against RAID disc systems (notably software RAID) on performance grounds. They also used to made recommendations about the file mapping across multiple discs. A modern hardware RAID system or a SAN is transparent to Oracle.

Just in case you don't have a copy, this unsupported document "When good disks go bad" is a good read. Apart from showing how easy it can be to recover in a resilient system, it also shows the commands to test a disc with "dd".
http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c01911837/c01911837.pdf
HP links can go dead, so do save a copy.

I've seen your error message both from a disc with a tight SCSI cable ... and from a disc which failed totally a few minutes later. You can also get this message (followed by a SCSI reset) if you replug SCSI devices with the power on.