Memory error causing reboot

hcclnoodles · November 8, 2006, 6:08am

Hi there

I have a box that at 4pm started recieving soft errors on a DIMM, normally this is ok and we have time to swap it out. But I got the following error which caused the box to reboot

NOTE: there were abount 6 or 7 normal "soft error encountered" messages before this one

Nov  7 16:02:37 my.box SUNW,UltraSPARC-IIIi: [ID 667005 kern.info] [AFT3] errID 0x00212a5e.7a11dcbb Above Error is in User Mode
Nov  7 16:02:37 my.box     and is fatal: will reboot
Nov  7 16:02:37 my.box SUNW,UltraSPARC-IIIi: [ID 936573 kern.info] NOTICE: [AFT0] Corrected memory (FRC) Event detected by CPU1 at TL=0, e
rrID 0x00212a5e.7a11ddb5
Nov  7 16:02:37 my.box     AFSR 0x00100002<PRIV,CE>.18000027<FRC,FRU> AFAR 0x00000012.0a625570 INVALID
Nov  7 16:02:37 my.box     Fault_PC 0x100350b0 Esynd 0x0027 INVALID J_AID 0 INVALID
Nov  7 16:02:37 my.box SUNW,UltraSPARC-IIIi: [ID 337726 kern.info] NOTICE: [AFT0] Corrected memory (CE) Event detected by CPU1 at TL=0, er
rID 0x00212a5e.7a11ddb5
Nov  7 16:02:37 my.box     AFSR 0x00100002<PRIV,CE>.18000027<FRC,FRU> AFAR 0x00000012.0a625570
Nov  7 16:02:37 my.box     Fault_PC 0x100350b0 Esynd 0x0027 INVALID
Nov  7 16:02:37 my.box SUNW,UltraSPARC-IIIi: [ID 568294 kern.info] NOTICE: [AFT0] Corrected remote memory/cache (RCE) Event detected by CP
U0 at TL=0, errID 0x00212a5e.7a11dcbb
Nov  7 16:02:37 my.box     AFSR 0x00000001<RUE>.81000000<RCE> AFAR 0x00000011.0a0fffe0 INVALID
Nov  7 16:02:37 my.box     Fault_PC 0xffffffff7dc04884 J_REQ 1 INVALID
Nov  7 16:02:37 my.box unix: [ID 855177 kern.warning] WARNING: [AFT1] initiating reboot due to above error in pid 7744 (apas_OaLgw)
Nov  7 16:02:38 my.box SUNW,UltraSPARC-IIIi: [ID 845842 kern.info] NOTICE: [AFT0] Corrected memory (FRC) Event detected by CPU1 at TL=0, e
rrID 0x00212a5e.7a6b8682

My question is this really, is it possible that when an process tries to access the specific bad area of memory on the DIMM it can cause the box to reboot .......because in the above example, its only when the process (apas_OaLgw) gets involvedf that anything happens

any help would be greatly appreciated

system · November 8, 2006, 7:09am

Yes it is possible..

From sunsolve:
"On EDP, LDP, CP, UE, BERR, and TO events the system will panic if the address is in kernel space or if the error occurs while the CPU is at a trap level greater than zero. Otherwise, if the affected address is in use by a process, the process will be killed immediately (sent SIGKILL) and the system will be rebooted (as if a privileged user had entered "init 6")."