I have a bit of an odd situation I would like to float out here and see if anyone has any ideas on this..
We are working on doing Disaster Recovery on a number of RHEL 7.4 systems. These are running on Cisco Blade Servers.
The mount point for /boot is on a Multipathed SAN LUN. There are a couple of other SAN volumes as well in disk groups; none of those are being problematic, but /boot is.
/dev/mapper/360000970000197700328533030344233p1 976M 175M 751M 19% /boot
Above is the mount for the production system. We use SRDF to copy this data to a DR site and then clone the storage and boot up the same hardware at the DR site; Cisco blade servers (same models) as in production.
When we boot up the DR nodes, the system understandably gets a little 'confused' and will mount /boot to a RAW device.
/dev/sdy1 976M 175M 751M 19% /boot
Note /dev/sd* instead of the MPATH device /dev/mapper/***
So to fix this:
First, I run a 'multipath -W' and this corrects the extra WWIDs in /etc/multipath/wwids file.
multipath -W
successfully reset wwids
Now the WWIDS from the production side are gone and only the new ones that are at the DR site exist - So that file is now a happy file.
After that, I need to add a filter to /etc/lvm/lvm.conf to ignore any devices aside from the /dev/mapper devices (RedHat support suggested that I add the global_filter as well - but it seemed to work ok with just 'filter', but it didn't hurt either..
filter = [ "a|/dev/mapper/.*|", "r|.*|" ]
global_filter = [ "a|/dev/mapper/.*|", "r|.*|" ]
Then - create a new initramfs image:
dracut --force --add multipath --include /etc/multipath
And reboot.
The server comes back up in either rescue or emergency mode (I'll pay more attention next time) and EACH TIME, running grub2-mkconfig fixes it - and the server boots just fine.
I need to figure out what's going on for my own geeky-obsessive-ness. The thing is, I saved a backup copy of /boot/grub2/grub.cfg and compared it to the new one that was generated in emergency mode and there are zero differences. I used notepad ++ and did a file comparison - even adding a character to verify the plug-in was working right and I can find no difference at all between the two files.
I thought that grub2-mkconfig just generated a new grub.cfg file, but it almost seems like something else is going on here as well.
Any ideas?
It's not that I can't get these servers back online, it's just that I would like to skip the reboot into rescue mode - as we are looking to automate this process as much as possible.
We have recovered these 4 nodes a couple of times - this process seems consistent. I just can't figure out what change grub2-mkconfig is making to the system to get it to boot!
Thanks in advance!