Error is “sticky” on board4 J3400.

I get the following messages in logs :
var/adm/messages:Aug 3 04:35:33 mhs-apps33-d unix: [ID 220797 kern.warning] WARNING: [AFT0] Sticky Softerr encountered on Memory Module Board 4 J3400

/var/adm/messages:Aug 3 04:35:33 mhs-apps33-d SUNW,UltraSPARC-II: [ID 520797 kern.info] [AFT0] errID 0x0007cb98.604a53e5 Corrected Memory Error on Board 4 J3400 is Sticky

/var/adm/messages:Aug 3 04:35:33 mhs-apps33-d SUNW,UltraSPARC-II: [ID 248118 kern.info] [AFT0] errID 0x0007cb98.7ff04f25 Corrected Memory Error on Board 4 J3400 is Persistent

/var/adm/messages:Aug 3 04:35:33 mhs-apps33-d SUNW,UltraSPARC-II: [ID 532138 kern.info] [AFT0] errID 0x0007cb98.7ffa946e Corrected Memory Error on Board 4 J3400 is Persistent

/var/adm/messages:Aug 3 04:35:33 mhs-apps33-d SUNW,UltraSPARC-II: [ID 361962 kern.info] [AFT0] errID 0x0007cb98.80083481 Corrected Memory Error on Board 4 J3400 is Persistent

/var/adm/messages:Aug 3 04:35:34 mhs-apps33-d SUNW,UltraSPARC-II: [ID 651253 kern.info] [AFT0] errID 0x0007cb98.9400e061 Corrected Memory Error on Board 4 J3400 is Persistent

/var/adm/messages:Aug 3 04

Whereas prtdiag shows the belwo output:
System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise E4500/E5500
System clock frequency: 100 MHz
Memory size: 8192Mb

========================= CPUs =========================

                Run   Ecache   CPU    CPU

Brd CPU Module MHz MB Impl. Mask
--- --- ------- ----- ------ ------ ----
0 0 0 400 8.0 US-II 10.0
0 1 1 400 8.0 US-II 10.0
2 4 0 400 8.0 US-II 10.0
2 5 1 400 8.0 US-II 10.0
4 8 0 400 8.0 US-II 10.0
4 9 1 400 8.0 US-II 10.0
6 12 0 400 8.0 US-II 10.0
6 13 1 400 8.0 US-II 10.0

========================= Memory =========================

                                          Intrlv.  Intrlv.

Brd Bank MB Status Condition Speed Factor With
--- ----- ---- ------- ---------- ----- ------- -------
0 0 1024 Active OK 60ns 4-way A
0 1 1024 Active OK 60ns 4-way B
2 0 1024 Active OK 60ns 4-way A
2 1 1024 Active OK 60ns 4-way B
4 0 1024 Active OK 60ns 4-way B
4 1 1024 Active OK 60ns 4-way B
6 0 2048 Active OK 60ns 2-way A

========================= IO Cards =========================

 Bus   Freq

Brd Type MHz Slot Name Model
--- ---- ---- ---------- ---------------------------- --------------------
1 SBus 25 0 lpfs/sd (block) LP9002S
1 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
1 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
1 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
1 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
1 SBus 25 3 SUNW,hme
1 SBus 25 3 SUNW,fas/sd (block)
1 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060
3 SBus 25 0 lpfs/sd (block) LP9002S
3 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
3 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
3 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
3 SBus 25 2 SUNW,qfe SUNW,sbus-qfe
3 SBus 25 3 SUNW,hme
3 SBus 25 3 SUNW,fas/sd (block)
3 SBus 25 13 SUNW,socal/sf (scsi-3) 501-3060

No failures found in System

No System Faults found

is there a fault in the DIMM ie is replacement required

I would suggest scheduling downtime and changing the memory stick that is coming up in the error message. This could worsen and the server could crash at a bad time (in the middle of the day, for instance).

If you can fix it over a weekend or something where it won't impact anything, do it.

The bad mem is in slot J3400. Get it replaced as soon:eek:

I agree with the 2 above responses. "Sticky" means a memory error that the kernel can detect and fix but keeps repeating. That is why prtdiag looks ok - the memory stick hasn't totally failed yet so if the kernel keeps correcting the errors the hardware diags think it is ok.

If it keeps coming up so regularly like this eventually it will become uncorrectable and crash your box. May be in 5 minutes, may be in 5 years. But since you don't know when it will crash the prudent thing is to replace the failing memory ASAP, before it gets any more serious.