Recover failed system disk

jj5406 · February 10, 2016, 2:38pm

I have an oldish Solaris 10 system (SunFire x4240), which due to a recent heating event in the server room, lost it's system disk.

I have rsync backups of all the other (data) disks, but apparently I do not have a backup of /. I can start the machine up in failsafe mode, but running fsck on the system disk always reports a couple of bad sectors, which I don't seem to be able to repair or ignore (tried format->analyze->read, etc.).

It looks like I can mount the disk read only, so I'm hoping I can copy most of the pertinent info off of it, install Solaris 10 on a fresh replacement disk, and then copy that pertinent system info back onto the new system so that I don't have to recreate network info, users, disk mounts, NIS info, and various other things from scratch.

Once I have the fresh OS installed on a new disk, is it safe to mount the failed disk read-only and use rsync to copy the accessible files on that disk to a safe location - or is there a better way to do this?

Thanks.

-J

---------- Post updated at 02:38 PM ---------- Previous update was at 10:07 AM ----------

I installed Solaris 10 on the new disk, but now I'm wondering how best to get the files I want off of the old disk. The system has 16 SAS bays, they are as follows:

0: new system disk
1-3: single-volume disks
4-15: RAID 5 array

All bays are filled.

The raid controller knows about all the disks, but the new system does not (yet), though I can easily mount the single-volume disks. The question is: can I turn off the computer, swap out one of the single-volume disks for the bad disk, and then power up and mount the bad disk and copy files off of it - without irrevocably screwing up the raid-controller's knowledge of the disk I pulled out to make space for the bad disk?

Thanks.

-J

hicksd8 · February 11, 2016, 6:28am

It seems to me that the main thing you need to do is save whatever files you can from the root filesystem that you have no backup of.

You should be able to mount it read-only onto your new O/S (or you could boot into single-user from CD/DVD and mount it under that).

If you have other storage sufficient available you could then attempt to 'dd' the whole raw partition off to a file and/or attempt to find/cpio the whole filesystem off to an archive.

The comments you have already made lead me to assume that you have quite a bit of experience of Solaris so I haven't gone into much detail.

What RAID controller is it? Make/model?

When you say that fsck doesn't straighten out the filesystem what command are you using? If you're using a "-n" flag then it will test the filesystem without correcting anything. I definitely would not use a "-y" flag because the filesystem could be damaged beyond repair before you can stop the operation. I would use neither flag and just see what questions it asks. I would also perhaps use the (often undocumented) flag "-o full" to examine the whole filesystem although this could take a very long time. Again, don't use -y or -n so you can abort if needed.

---------- Post updated at 11:28 AM ---------- Previous update was at 11:26 AM ----------

I guess the main point I'm making is to mount the filesystem read-only and then back it up. If you subsequently lose that filesystem completely, you can get some data back from archive.

jj5406 · February 11, 2016, 10:23am

"quite a bit of experience with Solaris" is probably a big overstatement. I have done system administration only out of necessity for the past 20 years on various *nix systems. So I have experience over many years, but it is infrequent experience. The main sysadmin around here has not dealt with Solaris for years. So while I remember a few things, google is my friend.

Running StorMan, I see the claim that the raid controller is "Sun STK RAID INT", but that's about all the info I seem to be able find without rebooting - which I guess I will be doing soon. My main worry right now is that if I swap out a disk (a single-volume disk, not one from the RAID) that the controller may lose knowledge of it. However, I have THAT data backed up, so it's not a big deal if I lose it. And I guess if I can actually mount and read the old system disk in the freed up slot, I should be able to repeat that process when I put back whatever disk I pulled out to make room.

I've already run fsck -y a number of times on the bad disk, so whatever damage that may have done is already done.

I know what to do in general, but I am trying to avoid a misstep that will damage the disk further and prevent me from getting as much info as I can off of it.

Thanks.

-J

hicksd8 · February 11, 2016, 12:28pm

"Sun STK Raid int" tells me it's a StorageTek RAID controller often found in Sun boxes.

If you Google search for it there's plenty of info.

ALSO, search this forum for it.......

There's experience of this RAID controller on this forum better than mine like DukeNuke2.

If you don't know the existing RAID configuration I'd be inclined to avoid removal of any working disks as they may be part of an array; a RAID5 array for example or, even worse, a RAID0 where loss of one drive takes the array off-line.

You either need to backup the drive where it is now or remove the drive and connect to another machine just to take an image (sector by sector) backup. That way if the data you lose afterwards turns out to be vital you can write that image out to a new drive.

So you are saying that despite running fsck -y a number of times the filesystem still isn't fixed? It still shows errors?

jj5406 · February 11, 2016, 3:29pm

So far so good.

I knew the configuration of all of the disks from the raid controller setup screen (ctrl-a when booting). I had already removed the failed disk in order to put in a fresh one on which to put a new system (no free slots).

The raid controller must be smart enough because everything worked smoothly. I shut down the computer and pulled a known single-volume disk with not too much data on it (for which I had a complete backup as well) and inserted the failed disk into that slot. On boot, the controller detected the change and made a new configuration and came up just fine. I was able to mount the bad disk and copy everything off of it except for the contents of /usr/lib. This should effectively get me everything (config files, etc.) that I need to rebuild the system the way it was before. I copied the files off of the bad disk by rsyncing what I thought were the most important directories first (in case something bad should happen). After turning the computer off, re-inserting the disk that I had swapped out, and turning it back on again - the raid controller once again detected the change, did a reconfiguration and now everything looks as it did earlier today. I can see and mount all of the disks and access their data - except that now I also have a copy of everything that was on the old system disk (except for /usr/lib) from which I can (hopefully) get the system back into its pre-crash state.

Thanks.

-J

---------- Post updated at 01:57 PM ---------- Previous update was at 12:57 PM ----------

Apparently, I foolishly chose the default disk partitioning when installing the new system. Now / (slice 0) has very little space on it (6.4 GB), while the rest of the space on the disk (124 GB) is mounted as /export/home (slice 7).

So far, I've only made a few minor changes to /, and none to /export/home.
Is there any way to repartition the disk so that the whole thing is allocated to / in slice 0, or will I have to reinstall the system (again)?

User home directories are all on a separate disk anyway.

Thanks.

-J

---------- Post updated at 03:29 PM ---------- Previous update was at 01:57 PM ----------

FYI to anyone who might read this thread in the future:

I was actually able to increase the size of partition 0 to the full disk (except for the swap and boot sectors) by basically following the instructions at https://blogs.oracle.com/michel/entry/resize\_solaris_partition (modified for my purposes) and running growfs. And it didn't even bork my system! I was prepared to have to reinstall Solaris.

However, I didn't think to first remove the line from my vsftab which attempts to mount the partition that I removed. This caused problems on boot and ended up requiring another reboot.

The system is probably not properly tuned. Of course, now that I have done it, it occurs to me that maybe swap should be much larger. This is what happens when you only do system management occasionally.

-J