Hi All,
We are facing the problem of file system corruption on DS8300,we have done very much effort to find out the root cause of problem but we still not get any success, we have AIX 5.3 OS installed on system with latest patches, we have upgraded HBA firmwares, DS8300 firmware, System firmware, Upgraded the Fabric Switches firmwares, recently deployed brand new switches but sill the problem exists, when problem occurs we have to down our live services and unmount the affected file system & repair the file system by fsck utility & then we have to restart the services which results in down time of about 30-40 minutes, we have raised the problem to IBM whenever the problem arise but when they analyzed they haven't find any abnormality they analyzed the PE packages in DS but they didn't find any abnormality. Have anyone received these file system corruption error on DS or any suggestion Idea ?
Regards
What is the data or application writing or using the 'corrupt' data?
This could be an application problem.
Does the application use raw filesystems?
Get your application provider in the loop and get them talking to IBM.
HTH.
You will have to ask IBM to help you fix this
There are a few things that I know you need to have a look
If you are using the new 450 disks and you are using space efficient flash copy you must upgrade to the latest level.
There was problem on this also disk that are SAN boot root disk will have to be recreated in some cases to recover from this problem. This can be done by remirroring and moving the disk but then the old disk must be removed and recreated on the storage.
The corruption of file system was occurred some times on Application File system & some times on Database file system, We are using Bea Weblogic & Oracle 10g Database, previously the corruption error was occurred with a time span of two weeks, now form last two months we noted that the corruption error came on DS8300 after two weeks We do not have HACMP environment but we are using SDDPCM with MPIO for attaching host, we made the VG's & FS available on one host at a time.
The error which came on host was FILE SYSTEM CORRUPTION when we saw the error report by errpt -a it shows some file name j2imap.c and in the end the name of the effected file system was written.Yes fsck always fix it. When the fsck repairing the file system it display messages like super block mark dirty but Fixed.
This sounds like asking for trouble: if somehow two machines concurrently access the volumes it will result in corrupted filesystems, probably even if not both systems actually write to the disk. I remember reading the HACMP scripts for taking over the shared volumes from one cluster node to another once (back in the days when disks were SCSI or SSA) and they were an absolute nightmare of low-level device manipulation to avoid such problems.
Verify you really really always access the LUNs only from one system at a time.
I hope this helps.
bakunin
we use the same configuration,
mpio sddpcm, two vio-server for non-hacmp, two for hacmp systems
p570 Power 6 (9117-MMA) here for this example
oracle 10g with ocr and asm, oracle 10g on jfs2, many db2 v9.1 on jfs2 with SAP, a lot of java applikations, which are always candidates for damaging filesystems, and no problems
I can't tell you whats wrong on your systems, but I can tell you our settings:
vio-server:
:/home/padmin-->lsdev -Ccadapter | grep fc
fcs0 Available 02-00 4Gb FC PCI Express Adapter (df1000fe)
fcs1 Available 02-01 4Gb FC PCI Express Adapter (df1000fe)
fcs2 Available 03-00 4Gb FC PCI Express Adapter (df1000fe)
fcs3 Available 03-01 4Gb FC PCI Express Adapter (df1000fe)
:/home/padmin-->lsattr -El fcs0
bus_intr_lvl Bus interrupt level False
bus_io_addr 0xff800 Bus I/O address False
bus_mem_addr 0xffe7e000 Bus memory address False
init_link al INIT Link flags True
intr_msi_1 66085 Bus interrupt level False
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x200000 Maximum Transfer Size True
num_cmd_elems 1024 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
all adapters have the same settings, mode is load balance
>pcmpath query device 37
DEV#: 37 DEVICE NAME: hdisk37 TYPE: 2107900 ALGORITHM: Load Balance
SERIAL: 75DP8911018
===========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 25009976 10
1 fscsi0/path1 OPEN NORMAL 25033949 10
2 fscsi1/path2 OPEN NORMAL 25019659 8
3 fscsi1/path3 OPEN NORMAL 25029155 9
4 fscsi2/path4 OPEN NORMAL 25018403 8
5 fscsi2/path5 OPEN NORMAL 25031846 8
6 fscsi3/path6 OPEN NORMAL 25034755 9
7 fscsi3/path7 OPEN NORMAL 25022454 9
/home/padmin-->lslpp -l | grep -i sddpc
devices.sddpcm.53.rte 2.2.0.0 COMMITTED IBM SDD PCM for AIX V53
devices.sddpcm.53.rte 2.2.0.0 COMMITTED IBM SDD PCM for AIX V53
/home/padmin-->oslevel -s
5300-08-01-0819
/home/padmin-->ioslevel
1.5.2.1-FP-11.1
>lscfg -vpl fcs0
fcs0 U789D.001.DQD21A1-P1-C2-T1 4Gb FC PCI Express Adapter (df1000fe)
Part Number.................10N7255
Serial Number...............xxx
Manufacturer................001F
EC Level....................A
Customer Card ID Number.....xxx
FRU Number.................. 10N7255
Device Specific.(ZM)........3
Network Address.............xxx
ROS Level and ID............02E82752
Device Specific.(Z0)........2057706D
Device Specific.(Z1)........00000000
Device Specific.(Z2)........00000000
Device Specific.(Z3)........03000909
Device Specific.(Z4)........FFE01212
Device Specific.(Z5)........02E82752
Device Specific.(Z6)........06E12715
Device Specific.(Z7)........07E12752
Device Specific.(Z8)........xxx
Device Specific.(Z9)........ZS2.71A2
Device Specific.(ZA)........Z1F2.70A5
Device Specific.(ZB)........Z2F2.71A2
Device Specific.(ZC)........00000000
Hardware Location Code......U789D.001.DQD21A1-P1-C2-T1
PLATFORM SPECIFIC
Name: fibre-channel
Model: 10N7255
Node: fibre-channel@0
Device Type: fcp
Physical Location: U789D.001.DQD21A1-P1-C2-T1
lsmcode -d fcs0:
Microcode:
DISPLAY MICROCODE LEVEL 802110
fcs0 4Gb FC PCI Express Adapter (df1000fe)
sample lpar:
:/-->lspath
ussap103:/-->lspath
Enabled hdisk0 vscsi2
Enabled hdisk57 vscsi1
Enabled hdisk2 vscsi2
Enabled hdisk58 vscsi1
Enabled hdisk60 vscsi1
Enabled hdisk55 vscsi1
Enabled hdisk61 vscsi1
Enabled hdisk62 vscsi1
Enabled hdisk56 vscsi1
Enabled hdisk63 vscsi1
Enabled hdisk64 vscsi1
Enabled hdisk59 vscsi1
Enabled hdisk2 vscsi0
Enabled hdisk0 vscsi0
Enabled hdisk55 vscsi3
Enabled hdisk56 vscsi3
Enabled hdisk57 vscsi3
Enabled hdisk58 vscsi3
Enabled hdisk59 vscsi3
Enabled hdisk60 vscsi3
Enabled hdisk61 vscsi3
Enabled hdisk62 vscsi3
Enabled hdisk63 vscsi3
Enabled hdisk64 vscsi3
vscsi0 root disks from vio1
vscsi2 root disks from vio2
vscsi1 data disks from vio1
vscsi3 data disks from vio2
:/-->oslevel -s
5300-07-01-0748
if you need more information, feel free to ask ^^
I would run
filemon -O lf,lv,pv
to trace read/write errors and files accessed
tracefile will be very big!