Ext3 filesystems on san become read only

mig28mx · August 11, 2014, 10:09pm

Hi all,
Today I�m faced a situation where two of my ext3 filesystems on a san become read only. This happens 3 times today.

The usual way to get this filesystems read and write was to reboot the server.

But now the continuous go to ro mode instead of rw mode, crashes my database.

I have readed the messages from the server and appears a warning about a EX3-fs error. My firs guest was a damaged hd, but after reviewing my san, there all ok.

Anybody have a clue about this?

Thank you in advance.

Don_Cragun · August 11, 2014, 11:20pm

We might have a better chance at making a useful guess at what is wrong if you show us the warning messages you found on the server.

mig28mx · August 12, 2014, 9:53am

Hi,
Thanks for your reply.
Yes here is a part of my messages log.

Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 350662791
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 527806119
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 527806127
Aug 11 10:02:16 FriscoDB kernel: Aborting journal on device dm-9.
Aug 11 10:02:16 FriscoDB kernel: EXT3-fs error (device dm-9) in ext3_ordered_writepage: IO failure
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 383382207
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 383382215
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 383382223
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 383382231
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 474560831
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000
Aug 11 10:02:16 FriscoDB kernel: end_request: I/O error, dev sdc, sector 383382239
Aug 11 10:02:16 FriscoDB kernel: ext3_abort called.
Aug 11 10:02:16 FriscoDB kernel: EXT3-fs error (device dm-9): ext3_journal_start_sb: Detected aborted journal
Aug 11 10:02:16 FriscoDB kernel: Remounting filesystem read-only
Aug 11 10:02:16 FriscoDB kernel: SCSI error : <2 0 1 0> return code = 0x10000

Apparently there is a physical hd defective. But over my san I can�t see nothing wrong.
After that, the filesystems go to read only and the database crashes.

What is your adivce?
Thank you in advance.

vbe · August 12, 2014, 10:11am

A SAN shows LUNs which are virtual disks, so I doubt disk issues, or the SAN disk subsystem should be complaining... but you have controllers, fiber cables etc...
And there is some configuration also... but the command are OS /equipment dependant, Im thinking of queue depth... or jounal log too small
There are plenty reasons why writes fail apart disk failure...

achenle · August 12, 2014, 12:13pm

What's the output from

multipath -l
pvs
lvdisplay

mig28mx · August 15, 2014, 7:30pm

Hi all,
Thank you for your opinions.

The fact is, due to this is my productive database and this server is off-site, I can�t has Access to the hardware surrounding the SAN, fiber, etc.

When the system is starting, we see a double path for the SAN LUN.

The weird thing is: when I started to use the "default" path, the disks begin to fail after 15 minutes and the failure message start to appear. And then, the filesystems switchs to read only in order to protect the information.

Under this conditions, on the mounting points file, we comment the secondary path, in order to see just one. After doing this, the error message go away.

So, when I run the requested command multipath -l, appears like the command are unavailable.

This is possible?
I mean, work over a SAN without multipath enabled?
What is this two paths that I see in the boot process?

As far as I have knowledge on the hardware configuration, the server has 1 HBA card with two fiber distributed over 2 switches connected to the SAN.
Can be possible to have a one fibre with failure?
If this is true, an alert must be trown to the console, right?

Thank you all.

achenle · August 15, 2014, 7:57pm

"multipath" is likely in /sbin.

cerber0 · August 16, 2014, 1:07am

Hi Mig28mx,

if you can confirm that the issue is present when the second PATH is active, probably your problem is the compatibility of linux native multipath with the storage system.

For example, if your Storage Array is a HighEnd ( EMC VMAX, HDS VSP, DS8000), normally, the multipath daemon is fully compatible with default settings.

However if you Storage is a midrage (EMC VNX, HDS HUS or AMS), you need write special settings inf the multipath.conf. You need review the notes about of this setting in the User Guide of the Storage Array.

If the issue continue, I think that you need review the SAN Switches in order to see if the error counters in the fibre ports are growing.

If you can share with me the model of switches and storage I can be more precise.