Fsck -n on mounted FS - how unreliable ?

greetings all, we had SAN bobbles recently and so I ran fsck against all local FS on all systems.

our domino servers present a moving set of alleged corruption, both JFS and JFS2 FS.

Am I playing whack-a-mole with special effects (i.e. false positives) produced by vigorous filesystem activity while I'm trying to scan it ? Or is this potentially real enough so it's worth taking outages to unmount FS, etc. ?

TIA

It never makes sense to fsck a mounted filesystem, especially a journalled one. The 'moving corruption' may be a race condition between the journal and the disk contents.

If you are truly concerned, scan them offline -- or at least remount them read-only.

1 Like

Agreed.

Running fsck on a mounted filesystem is a worthless task. Unless you're lucky and it shows no errors - something I've never seen the few times I've witnessed an fsck of a mounted file system.

Because an indication of corruption isn't meaningful at all for a mounted file system.

If you're worried about corruption, you umount the filesystem and check it.

1 Like

And be sure you have backuped those FS just in case : Have coruupted data is better than none...
Although its years since last time I faced FS issues I still remember the consequences...

3 Likes

thanks all for your replies. we took an outage, unmounted the filesys and sure enough, no errors. this was not a foreign concept to me, but last time we had a SAN prob I ran fsck on mounted FS and it turned out they did in fact have corruption when I unmounted them to fix them. Didn't want to chance it.

On that note, are any of you aware of a FS checking tool that works through the buffer cache so that the race with the disk contents on a mounted FS isn't there ? I'll unmount if I find I need to fix them, but to check them it would be nice to be able to do it live.

That's rather contradictory, like asking for a bluer orange. If it were possible, it would end up being an even bigger hassle and risk than just unmounting it -- imagine losing 30 minutes of live changes on a corrupted and unrecoverable disk once you've realized "whoops, I really shouldn't have been writing to that". Mounting a bad filesystem read-write may have problems other than corruption -- it could work fine, for example, with one important file missing that you don't realize until later, after it's long past recoverable.

The closest you can do to what you want is mounting it read-only. Actually fixing it while it's mounted, read-only or not, is of course a recipe for a kernel panic.

I suppose it might also be possible at the disk level. Remove one mirror or something and scan it. You'll want to have it unmounted, or mounted read-only, while you do so. Again, though, you don't want to be writing changes to a disk that might be bad; you could lose your current changes.

Or some sort of scratch-disk union mount, so you're saving new changes in a temporary space until you've verified the disk is okay. Merging the two partitions would be difficult though.

I can't think of anything that's faster and less trouble than just doing the job properly in the first place. You can't re-bore your engine while it's running.

1 Like
You can't re-bore your engine while it's running.

Classic!

We should have a hall of fame for comments like that!

;0)

1 Like

Well, you can try, but you'll probably have pieces-parts all over... :wink:

And it's not just the cache that's the problem with a mounted file system - it's the fact that things are changing. Every file system check I'm aware of assumes the entire file system is static. And it pretty much has to be to ensure consistency. If there's some sort of required consistency between two or more items in the file system, if things are changing there's no way to know if there's an inconsistency because it's corrupt or because it's been changed. The fsck process can't look at the entire file system at one moment in time - it can only look at all the different parts of the file system in some sequence.

I get that I don't want to modify the FS live, however, a really simplistic construction is that the buffer cache contents on top of the disk contents represents the consistent (correct) state of the FS. There is a delay in that consistent state being written through to the disk, i.e. depending on when it's written through. I said "simplistic" and disclaim right away that I don't know the slightest about how that picture changes with journaling.

So again, I don't want to modify it live, but why is it improbably to -check- it live, by looking at the buffer cache, and through to the corresponding on-disk data, and noting (probably with a fair amount of complexity) what inconsistencies -might- be corruption, or which ones are unequivocal signs of it. I anticipate immediately that telling the difference between an inconsistency that's the result of write-through lag, vs. one where no matter how hard the OS tries, the on-disk data can't be made to match what's in the buffer cache because of I/O errors.

Glad to know the conclusion that it's not possible, I'm just curious of the details of why.

But you are modifying it live. It's like trying to read a dictionary while someone else is constantly changing the order of the words -- possibly tearing out pages while they're at it, if they're writing to a corrupt filesystem. If you at least made it read-only, stopped moving things around, it'd be possible to check it.

The kernel makes an awful lot of assumptions about the state of the filesystem. It's the job of it and the journal to keep it that way, but if something else unintentionally alters the filesystem tree -- power outage, disk failure, controller glitch, whatever -- these assumptions may be violated, and using it can cause bad things to happen. Like an index pointing to the wrong disk cluster, causing data to be overwritten or two files to unintentionally share contents, or a file just not mentioned at all anywhere and disappearing from disk, etc, etc.

fsck makes as few assumptions as possible, but does assume the filesystem isn't changing. That's the tradeoff it makes.

1 Like