Rootvg in read only state

solaris_1977 · January 25, 2013, 11:05am

I have Oracle Linux 5.7. It was hanged, so I rebooted it and now it came into maintenance mode and asking for root password. After password, it is at

(Repair filesystem) 10 #

I am not sure, where to go from here. All file-systems are in vg. Non-root is coming from EMC storage and root is from local disks.
Root filesystem is as below -

/dev/cciss/c0d0p2 62G 29G 31G 49% /
/dev/cciss/c0d0p1 62G 29G 31G 49% /boot

I also ran fsck as below and rebooted, but no luck.

fsck /dev/cciss/c0d0p2
fsck /dev/cciss/c0d0p1

Please suggest.

fpmurphy · January 25, 2013, 9:32pm

You probably need to reactivate your volume group. See the vgchange man page.

solaris_1977 · January 26, 2013, 6:00am

It was ASR, which rebooted server. Now I am able to bring server up.

Remounted root fs as rw
commented all file-systems coming from SAN from /etc/fstab
Rebooted server
Uncommented SAN fs and mounted all file-system manually.
But still need to check an issue. If I give a clean reboot, it again stops for maintenance and I have to follow above process. I have attached screenshot for referance. Console is in GUI, so can not copy output, so attaching screenshot.
---------------------------------------------------------------------
I am able to get some more information, which it went into file not found for lv, while booting. I think, there is something not standard in /etc/fstab. Here is the copy

[root@tlprd_spc05 ]# cat /etc/fstab
LABEL=/                 /                       ext3    noatime,defaults        1 1
LABEL=/boot             /boot                   ext2    noatime,defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SW-cciss/c0d0p3   swap                    swap    defaults        0 0
/dev/vg_local/lv_oel5u7_dump                     /oel5u7                           ext3 noatime,defaults 1 2
/dev/vg_local/lv_opt_simpana                     /opt/simpana                      ext3 noatime,defaults 1 2
/dev/vg_tssd/lv_tss                              /tss                              ext3 noatime,defaults 1 2
/dev/vg_tssd/lv_tss_apps                         /tss/apps                         ext3 noatime,defaults 1 2
/dev/vg_tssd/lv_tss_oracle_oradata_TSSD          /tss/oracle/oradata/TSSD          ext3 noatime,defaults 1 2
/dev/vg_tssd/lv_tss_oracle_product_11.2.0_TSSD   /tss/oracle/product/11.2.0/TSSD   ext3 noatime,defaults 1 2
/dev/vg_tssq/lv_tss_oracle_oradata_TSSQ          /tss/oracle/oradata/TSSQ          ext3 noatime,defaults 1 2
/dev/vg_tssq/lv_tss_oracle_product_11.2.0_TSSQ   /tss/oracle/product/11.2.0/TSSQ   ext3 noatime,defaults 1 2
/dev/vg_u01_oracle/lv_u01_oracle                 /u01/oracle                       ext3 noatime,defaults 1 2
/dev/vg_u01_oracle/lv_u01_oracle_export          /u01/oracle/export                ext3 noatime,defaults 1 2
/dev/vg_u01_oracle/lv_u01_oracle_housekeeping    /u01/oracle/housekeeping          ext3 noatime,defaults 1 2
/dev/vg_u01_oracle/lv_u01_oracle_product_11.2.0  /u01/oracle/product/11.1.0        ext3 noatime,defaults 1 2
/dev/vg_oe9d/lv_tss_oracle_oradata_OE9D          /tss/oracle/oradata/OE9D          ext3 noatime,defaults 1 2
/dev/vg_oe9d/lv_tss_oracle_product_11.2.0_OE9D   /tss/oracle/product/11.2.0/OE9D   ext3 noatime,defaults 1 2

I think, while server is coming up, OS started finding logical volumes from SAN. Since HBA may be completely intialised till that time and it resulted in throwing error below

fsck.ext3: No such file or directory while trying to open /dev/vg_vg_tssd/lv_tss

What can be best to avoid this issue, if I reboot the box again. Please suggest.

zazzybob · January 28, 2013, 10:04am

I've seen this happen with old IBMsdd drivers/software where it attempts to mount filesystems before the driver is actually loaded. We ended up commenting out the entries in /etc/fstab and hacking /etc/rc.local - you should not do that

What HBAs do you have in the server, is the storage managed by multipathd, etc.?

When the server hung for the first time, was that the first time it was rebooted since these logical volumes were added, or has the server been rebooted successfully before?

solaris_1977 · January 28, 2013, 10:36am

You are right about drivers. It seems, same thing is happening here.
EMC Storage is connected to this server via Brocade HBA card (2 nos.) and multipath is managed by EMC powerpath.
Server hung is not a problem, we got the cause for that, but worried about unsuccessful reboot. I am not sure about past reboots, but whenever I am rebooting this Linux server, it always stops at maintenance level. I am suspecting /etc/fstab is playing a role here.
I have already added noatime in /etc/fstab, should it not delay mounting of SAN file-systems ?

zazzybob · January 28, 2013, 10:41am

No - something lower-level is happening if the device entires do not exist - no manner of changes within /etc/fstab will help that. noatime stops access time being updated within inodes and is a performance boon (sometimes) - it will not delay the mounting of anything.

Does the PowerPath software log anything?

solaris_1977 · January 28, 2013, 10:57am

EMC powerpath does not log anything. However, device paths are correct. Once I am able to bring server up without SAN file-systems, I can run mount -a and it mount all devices without any problem.
In /etc/fstab, we have enabled fsck to run on EMC SAN devices, as marked 2 in last field

/dev/vg_oe9d/lv_tss_oracle_product_11.2.0_OE9D   /tss/oracle/product/11.2.0/OE9D   ext3 noatime,defaults 1 2

I am suspecting this is forcing SAN file-system to fsck, while system is not yet completely initialised. Is it really required ?
if not, we can change this 2 to something else. Your thoughts please.

zazzybob · January 28, 2013, 11:30am

All the EMC PowerPath stuff should be loaded nice and early via /etc/modprobe.conf.pp - you can try disabling fscks (change the 2 to 0 for all SAN-based FS) but it *shouldn't* make a difference.

Found a couple of posts with people having similar issues to you - one of which was unanswered and one was identified as an EMC bug - the EMC solution?! - comment out and mount manually!

If you do end up doing that, I'd suggest adding the mount entries to /etc/rc.local.

However, in the first instance, I'd get a support case open with EMC ASAP and double check for changes on the system. Have there been any upgrades recently (kernel, for example)?

solaris_1977 · January 28, 2013, 12:51pm

I will need to take downtime to test changes of fsck from 2 to 0. I will think it later, once I have something in my hand.
There were no recent changes made on this system. However, this server was rebooted after very long time, so we didn't knew if this is new problem or it was exist there and we didn't knew. I will try to get hold of EMC guys.

zazzybob · January 29, 2013, 3:48am

You'll probably find it was the result of some change made in the very long time prior to the server being rebooted and prior to the problem being discovered.

Let us know how you go with EMC!

Cheers,
ZB

solaris_1977 · January 29, 2013, 1:34pm

I got response from EMC.
"In the /etc/fstab file , PowerPath devices should be mounted with the _netdev option instead of the defaults option. This will ensure that fsck is run later in the boot sequence"
This should work I think.

zazzybob · January 29, 2013, 7:46pm

That will definitely work - and checking my notes, is exactly what I used to do with OCFS2 filesystems under Linux.

solaris_1977 · January 30, 2013, 12:30am

My filesystems are reiserfs. Should this option work with that also ?

zazzybob · January 30, 2013, 12:33am

It should still work - the _netdev mount option should be independent of the filesystem type. However - your initial /etc/fstab shows them as being ext3 filesystems? In any case, _netdev should do the trick.

solaris_1977 · January 30, 2013, 12:37am

Oh yes. You are right.
I got more from EMC -
"This is the recommended setting when we are using OEL/RHEL 5.X with PowerPath. There is probability that LVM and Powerpath started working together and clashing resulting the reported issue.
Actually, setting the _netdev option will delay the mount of the filesystem on the PowerPath devices. PowerPath doesn't rely on the network, but delaying until network start ensures that LVM has completed its tasks and thus avoid a race condition"

zazzybob · January 30, 2013, 12:48am

Yep - that all sounds correct. OCFS2 uses network-based heartbeat and that's why we used the _netdev option - but it should also work for PowerPath-based devices too.

Now I might go back and fix those old IBMsdd issues too, with _netdev - should work there too