Solaris Boot Problems, random messages [/etc/rcS: /etc/dfs/sharetab: cannot create]

ranjtech · February 2, 2008, 7:57am

Hello All,

I have all of a sudden developed issues with booting up one of my Solaris [V240] Servers. Upon a routine reboot, I was faced with the following errors:

Feb 1 07:56:44 sco1-au-tci scsi: WARNING: /pci@1c,600000/scsi@2/sd@0,0 (sd0):
Feb 1 07:56:44 sco1-au-tci Error for Command: read(10) Error Level: Retryable
Feb 1 07:56:44 sco1-au-tci scsi: Requested Block: 114007888 Error Block: 114007903
Feb 1 07:56:44 sco1-au-tci scsi: Vendor: SEAGATE Serial Number: 053532DN34
Feb 1 07:56:44 sco1-au-tci scsi: Sense Key: Media Error
Feb 1 07:56:44 sco1-au-tci scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xf
Feb 1 07:56:45 sco1-au-tci scsi: WARNING: /pci@1c,600000/scsi@2/sd@0,0 (sd0):
Feb 1 07:56:45 sco1-au-tci Error for Command: read(10) Error Level: Fatal
Feb 1 07:56:45 sco1-au-tci scsi: Requested Block: 114007888 Error Block: 114007903
Feb 1 07:56:45 sco1-au-tci scsi: Vendor: SEAGATE Serial Number: 053532DN34
Feb 1 07:56:45 sco1-au-tci scsi: Sense Key: Media Error
Feb 1 07:56:45 sco1-au-tci scsi: ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xf

So I figured, Oh ****...the disk is messed up. However, on running a few scans, i.e. 'iostat -En' showed ALL errors to be '0'. In addition, I ran the format -> analyze -> read test which ran for about 10 or so hours and came back saying 0 errors found to be repaired. So it appears nothing particularly is wrong with my hardware. After the 2nd reboot, I didn't get the errors above anymore but now I can't seem to get past the single-user mode. I get the following errors.

mount: the state of /dev/dsk/c1t0d0s0 is not okay
and it was attempted to be mounted read/write
mount: Please run fsck and try again
/sbin/rcS: /etc/dfs/sharetab: cannot create
failed to open /etc/coreadm.confsyseventd: Unable to open daemon lock file '/etc/sysevent/syseventd_lock': 'Read-only file system'
INIT: Cannot create /var/adm/utmpx

INIT: failed write of utmpx entry:" "

INIT: SINGLE USER MODE

Type control-d to proceed with normal startup,
(or give root password for system maintenance):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

I am unable to run fsck since this drive has an image of a corrupted drive (which had a bunch of unreadable sectors/blocks). I used ufsdump/ufsrestore to back it up, which obv left a gaping hole at the track/sectors where the original/corrupted disk was unreadable. So now even though it makes the server do its function without any problems, it doesn't allow me to run fsck and gives me a message like

[root@sol8-ssw01 /]# fsck -y /dev/rdsk/c1t0d0s0
** /dev/rdsk/c1t0d0s0

CANNOT READ: BLK 143278112
CONTINUE? yes

THE FOLLOWING SECTORS COULD NOT BE READ: 143278112 143278113 143278114 143278115

I have read a whole bunch of stuff as I found on google, like /var being full (it's not), the WWN being wrong as compared between vfstab, /dev, and /devices directory etc. I don't know what is wrong and I don't know what to do to fix this. Any ideas as to why this happened and what I can do?

PLEASE HELP!!!

System_Shock · February 2, 2008, 9:47am

Have you tried another disk?

I'm also curious about the "routine reboots". Do you routinely reboot Solaris servers? WHy?

Perderabo · February 2, 2008, 11:58am

Solaris has the command "iostat -E" which reports hardware errors. I suggest the OP run that.

System Shock, I am favorably inclined towards routine reboots. My last employer's Data Center went down due to power problems (despite an super-ups and an on-site generator!) and dozens of boxes which had been up for months did not reboot. Various changes had been made and no one had tested the start up scripts. Some of the boxes did not reboot because the battery in the id-prom had died. I finally figured out how to get them up, but this left them in a state where they would be unbootable should power drop again. Rebooting a few boxes at a time each week would have exposed those issues. Another time, we had to take a box down to move it and we noticed it had a .reconfigure in /. The guy who put it there had left over a year ago. We had no idea what the reboot would bring. Also we were unable to install security patches because they would almost always reboot a box. If we have a reboot schedule, we can have a reasonable patch management policy.

ranjtech · February 2, 2008, 3:38pm

System Shock: I don't think the situation is at a point of trying new disks. If I had to do that I wouldn't be posting my question anywhere. I only replace disks when I know for sure it's the problem with the disk and not something else. Not to mention the fact that we don't have an on-site OPs team and I live on a different continent than where the servers reside, plsu it being a weekend and the time difference of 16 hrs doesn't make it any easier to just use the 'replace disk' card too often or too casually. As for 'routine reboot', exactly as Perderabo said. It exposes a lot of problems that one would never have caught.

Perderabo: iostate -En was the first thing I'd tried, and I have said in my original message that it came back with 0 (zero) errors on ALL lines. Plus format -> analyze -> read showed no errors, so I'm guessing it's not the disk. Plus the media errors only showed up once, but don't show up after subsequent reboots which they would if the disk was damaged.

reborg · February 2, 2008, 4:00pm

Darren Dunham already gave you pretty much everything you needed to know about this elsewhere.

This is your disk:

[root@sol8-ssw01 /]# prtvtoc -s /dev/rdsk/c1t0d0s0
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 2 00 0 141476928 141476927
1 3 01 141476928 1872384 143349311
2 5 00 0 143349312 143349311

141476927 < 143278112

As you can see from this you have tried to restore a dump which contains more data than can fit in the slice you tried to restore it into. Re-layout the disk and try again.

ranjtech · February 2, 2008, 4:34pm

Hi reborg,

thanks for that. I was waiting on darren to get back to me to confirm that I'm reading / understanding it correctly. What's confusing is that the ufsdump/restore was done from a disk with the exact same geometry / mode/ size etc. The partition table was copied from the disk as well, so I don't know how there's more data than the original slice would have had? Also, would re-laying out of the disk need me to reinstall everything from scratch including OS/Applications etc?

The 2nd question is, is fixing the partitions and presumably getting fsck to run going to fix my original problem of not being able to boot up? Mind you, this server has been successfully booted/rebooted in the past with the same partitioning etc in the past. It was up for about 178 days and I rebooted it just during maintenance but ran into these errors. They somehow occurred all by themselves during the period it was running fat and happy.

Any thoughts on the original problem?

Thanks
\R

System_Shock · February 2, 2008, 5:10pm

perderabo:

System Shock, I am favorably inclined towards routine reboots. My last employer's Data Center went down due to power problems (despite an super-ups and an on-site generator!) and dozens of boxes which had been up for months did not reboot. Various changes had been made and no one had tested the start up scripts. Some of the boxes did not reboot because the battery in the id-prom had died. I finally figured out how to get them up, but this left them in a state where they would be unbootable should power drop again. Rebooting a few boxes at a time each week would have exposed those issues. Another time, we had to take a box down to move it and we noticed it had a .reconfigure in /. The guy who put it there had left over a year ago. We had no idea what the reboot would bring. Also we were unable to install security patches because they would almost always reboot a box. If we have a reboot schedule, we can have a reasonable patch management policy.

You are in Rockville.. that total loss of power, did it happen in a data center around Beltsville, by any chance?

Perderabo · February 2, 2008, 5:57pm

The data center was in Rockville which I would not say is "around Beltsville" but your milage may vary. It used to be that I both lived and worked in Rockville. Now I live in Rockville but work in Sterling, Va. I'm not sure what to put in my profile now.

reborg · February 2, 2008, 7:25pm

^^^ Warp Drive

ranjtech, I don't know how or when, but you have ended up with a filesystem larger than the slice on which it resides. I'm guessing someone was playing with fmthard, that being the case there is a reasonable chance that if you can put things back exactly as they were ( if you saved the prtvtoc from the source disk for example ) that you can recover the situation.

That being said you should have backups, the question of reinstallation should not even arise. If you don't have backups, it's time to start.

System_Shock · February 2, 2008, 7:50pm

I asked, because I lived in the District for many years, worked in Herndon with PSINet, and I remember something like the power loss you describe happening to Degex, a data center that was between Beltsville and Laurel. That was some time ago, though..