Disk Failure

I am new to being a Unix admin and have a question about replacing some hardware. I have a K class box using HP-UX 10.20 with three disks. Two of the drives are in one logical volume. Every 3 or 4 days, the syslog is showing that one of these drives is experiencing "POWERFAILED" and then recovering a few seconds or minutes later. My manager feels that the drive should be replaced.

From reading the documentation, it seems to me that I can shutdown, replace the one drive that is failing, and then restore the whole logical volume. Do I need to re-create the logical volume before doing a restore? Are there any other steps I need to take when replacing only 1 drive when the volume group and logical volume encompasses 2 drives?

Thank you in advance for any help.

;

Hi,
Follow this:
1.- First of all you should backup your VG and the structure (vgcfgbackup).
2.- Test your disk with diskinfo /dev/rdsk/cXtXdX and dd if=/dev/dsk/cXtXdX of=dev/null and ioscan -fnCdisk (if it is not answering then it is definitely failed)
3.- Shutdown your system
4.- Replace the disk
5.- Boot to single user
6.- Execute vgcfgrestore /dev/vgXX /dev/dsk/cXtXdX
7.- Activate the VG
8.- Probably, if you had information spread all along the disks you should restore the data.

I hope it helps.

Cristian.

Post the exact text of the error message. I would not immediately suspect a bad drive although it is possible. Does your manager have a good reason for suspecting a bad drive? A bad drive should be diagnosed from the hardware logs. Use the script command to record your session. Then as root, use the command: "cstm". From the cstm prompt, type "runutil logtool". Do a "sl" and pay attention to the output. It will tell you what the current log was renamed to. Type "sr", and when prompted, type in the name of that log. You will get a summary of the errors. Type "fr" to format the raw log. Now type "fl" to finally view the log.

Each logtool command is two letters and you type return after the two letters. If the commands wants more info, it will ask for it. To summarize the logtool commands:

sl [switch log]
sr [select raw]
fr [format raw]
fl [formatted log]

Then "quit" to get out of logtool. And "quit" to get out of cstm.

By the way, "powerfailed" sounds like a disk driver or a lvm driver thought an operation took too long. You might have an overloaded bus or an unreasonable timeout value. This is what I would be checking first.

Here is the exact error message from Syslog. (I could not find cstm on my system).

Jun 7 06:02:04 nvidev vmunix: xvfs: mesg 016 : vx_ilisterr - /fs5 file system error readin inode 473
Jun 7 12:40:07 nvidev vmunix: disc30 56/52.4.0 SCSI even UNKNOWN_RESELECT
Jun 7 12:40:07 nvidev vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1c00400) is POWERFAILED
Jun 7 12:40:07 nvidev vmunix: LVM: PV 0 has been returned to vg[1].

Once this happened while one of our programmers was in the middle of something and the whole system froze up. After about 10 minutes of panic from several people, the system cleared up and he was able to save his work. Since this happens 3 or 4 times a week, the manager believes that the drive is failing and would like it replaced before it fails completely.

Just to let you know, we first noticed the problem when the whole system crashed. We restarted the machine and noticed the errors in the syslog. As far as I know, there were no changes to the system before the crash.

Thanx for your help

;

Thanx for the help, Cristian. I was not aware of the vgcfgbackup and would have been lost without it.

Perderabo,
Based on the info from Syslog, would you do a drive replacement? Or should I be looking at something else?

Thanx for the help.

;

looks like drive needs replacement from what the log is saying ... especially if manager wants it and is willing to pay for new drive --- make him feel good :slight_smile:

There is some wisdom in what Just Ice says. The manager wants the drive replaced, so replace it. Drives are not super expensive and it can't hurt to replace one.

I don't know what I would do if I was in your position. There is no way that I would allow myself to be in that position. You have a HP-UX OS without support tools. Well I would install them pronto. That means finding your support cd. Should you find it...here is the manual. In theory another option is to download the tools and that option is mentioned here. But I can't find the support tools for 10.20. I assume you know that 10.20 is no longer supported? Without the output from the diagnotics I don't know where to point a finger.

So the best path I can see is to replace the drive as your manager wants. After that is done, you will know if it was the drive or not. :smiley: