File System Corruption on IBM DS8300

Hi All,

We are facing the problem of file system corruption on DS8300,we have done very much effort to find out the root cause of problem but we still not get any success, we have AIX 5.3 OS installed on system with latest patches, we have upgraded HBA firmwares, DS8300 firmware, System firmware, Upgraded the Fabric Switches firmwares, recently deployed brand new switches but sill the problem exists, when problem occurs we have to down our live services and unmount the affected file system & repair the file system by fsck utility & then we have to restart the services which results in down time of about 30-40 minutes, we have raised the problem to IBM whenever the problem arise but when they analyzed they haven't find any abnormality they analyzed the PE packages in DS but they didn't find any abnormality. Have anyone received these file system corruption error on DS or any suggestion Idea ?

Regards

What is the data or application writing or using the 'corrupt' data?
This could be an application problem.
Does the application use raw filesystems?
Get your application provider in the loop and get them talking to IBM.
HTH.

You will have to ask IBM to help you fix this

There are a few things that I know you need to have a look
If you are using the new 450 disks and you are using space efficient flash copy you must upgrade to the latest level.

There was problem on this also disk that are SAN boot root disk will have to be recreated in some cases to recover from this problem. This can be done by remirroring and moving the disk but then the old disk must be removed and recreated on the storage.
:smiley:

The corruption of file system was occurred some times on Application File system & some times on Database file system, We are using Bea Weblogic & Oracle 10g Database, previously the corruption error was occurred with a time span of two weeks, now form last two months we noted that the corruption error came on DS8300 after two weeks We do not have HACMP environment but we are using SDDPCM with MPIO for attaching host, we made the VG's & FS available on one host at a time.

The error which came on host was FILE SYSTEM CORRUPTION when we saw the error report by errpt -a it shows some file name j2imap.c and in the end the name of the effected file system was written.Yes fsck always fix it. When the fsck repairing the file system it display messages like super block mark dirty but Fixed.

This sounds like asking for trouble: if somehow two machines concurrently access the volumes it will result in corrupted filesystems, probably even if not both systems actually write to the disk. I remember reading the HACMP scripts for taking over the shared volumes from one cluster node to another once (back in the days when disks were SCSI or SSA) and they were an absolute nightmare of low-level device manipulation to avoid such problems.

Verify you really really always access the LUNs only from one system at a time.

I hope this helps.

bakunin

we use the same configuration,

mpio sddpcm, two vio-server for non-hacmp, two for hacmp systems

p570 Power 6 (9117-MMA) here for this example

oracle 10g with ocr and asm, oracle 10g on jfs2, many db2 v9.1 on jfs2 with SAP, a lot of java applikations, which are always candidates for damaging filesystems, and no problems

I can't tell you whats wrong on your systems, but I can tell you our settings:


vio-server:
:/home/padmin-->lsdev -Ccadapter | grep fc
fcs0    Available 02-00 4Gb FC PCI Express Adapter (df1000fe)
fcs1    Available 02-01 4Gb FC PCI Express Adapter (df1000fe)
fcs2    Available 03-00 4Gb FC PCI Express Adapter (df1000fe)
fcs3    Available 03-01 4Gb FC PCI Express Adapter (df1000fe)

:/home/padmin-->lsattr -El fcs0
bus_intr_lvl             Bus interrupt level                                False
bus_io_addr   0xff800    Bus I/O address                                    False
bus_mem_addr  0xffe7e000 Bus memory address                                 False
init_link     al         INIT Link flags                                    True
intr_msi_1    66085      Bus interrupt level                                False
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x200000   Maximum Transfer Size                              True
num_cmd_elems 1024       Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True

all adapters have the same settings, mode is load balance

>pcmpath query device 37

DEV#:  37  DEVICE NAME: hdisk37  TYPE: 2107900  ALGORITHM:  Load Balance
SERIAL: 75DP8911018
===========================================================================
Path#      Adapter/Path Name          State     Mode     Select     Errors
    0           fscsi0/path0           OPEN   NORMAL   25009976         10
    1           fscsi0/path1           OPEN   NORMAL   25033949         10
    2           fscsi1/path2           OPEN   NORMAL   25019659          8
    3           fscsi1/path3           OPEN   NORMAL   25029155          9
    4           fscsi2/path4           OPEN   NORMAL   25018403          8
    5           fscsi2/path5           OPEN   NORMAL   25031846          8
    6           fscsi3/path6           OPEN   NORMAL   25034755          9
    7           fscsi3/path7           OPEN   NORMAL   25022454          9

/home/padmin-->lslpp -l | grep -i sddpc
  devices.sddpcm.53.rte      2.2.0.0  COMMITTED  IBM SDD PCM for AIX V53
  devices.sddpcm.53.rte      2.2.0.0  COMMITTED  IBM SDD PCM for AIX V53

/home/padmin-->oslevel -s
5300-08-01-0819

/home/padmin-->ioslevel
1.5.2.1-FP-11.1



>lscfg -vpl fcs0
  fcs0             U789D.001.DQD21A1-P1-C2-T1  4Gb FC PCI Express Adapter (df1000fe)

        Part Number.................10N7255
        Serial Number...............xxx
        Manufacturer................001F
        EC Level....................A
        Customer Card ID Number.....xxx
        FRU Number.................. 10N7255
        Device Specific.(ZM)........3
        Network Address.............xxx
        ROS Level and ID............02E82752
        Device Specific.(Z0)........2057706D
        Device Specific.(Z1)........00000000
        Device Specific.(Z2)........00000000
        Device Specific.(Z3)........03000909
        Device Specific.(Z4)........FFE01212
        Device Specific.(Z5)........02E82752
        Device Specific.(Z6)........06E12715
        Device Specific.(Z7)........07E12752
        Device Specific.(Z8)........xxx
        Device Specific.(Z9)........ZS2.71A2
        Device Specific.(ZA)........Z1F2.70A5
        Device Specific.(ZB)........Z2F2.71A2
        Device Specific.(ZC)........00000000
        Hardware Location Code......U789D.001.DQD21A1-P1-C2-T1


  PLATFORM SPECIFIC

  Name:  fibre-channel
    Model:  10N7255
    Node:  fibre-channel@0
    Device Type:  fcp
    Physical Location: U789D.001.DQD21A1-P1-C2-T1




lsmcode -d fcs0:

Microcode: 

DISPLAY MICROCODE LEVEL                                                   802110
fcs0    4Gb FC PCI Express Adapter (df1000fe)





sample lpar: 

:/-->lspath
ussap103:/-->lspath
Enabled hdisk0  vscsi2
Enabled hdisk57 vscsi1
Enabled hdisk2  vscsi2
Enabled hdisk58 vscsi1
Enabled hdisk60 vscsi1
Enabled hdisk55 vscsi1
Enabled hdisk61 vscsi1
Enabled hdisk62 vscsi1
Enabled hdisk56 vscsi1
Enabled hdisk63 vscsi1
Enabled hdisk64 vscsi1
Enabled hdisk59 vscsi1
Enabled hdisk2  vscsi0
Enabled hdisk0  vscsi0
Enabled hdisk55 vscsi3
Enabled hdisk56 vscsi3
Enabled hdisk57 vscsi3
Enabled hdisk58 vscsi3
Enabled hdisk59 vscsi3
Enabled hdisk60 vscsi3
Enabled hdisk61 vscsi3
Enabled hdisk62 vscsi3
Enabled hdisk63 vscsi3
Enabled hdisk64 vscsi3


vscsi0 root disks from vio1
vscsi2 root disks from vio2
vscsi1 data disks from vio1
vscsi3 data disks from vio2


:/-->oslevel -s
5300-07-01-0748

if you need more information, feel free to ask ^^

I would run

filemon -O lf,lv,pv

to trace read/write errors and files accessed

tracefile will be very big!