Bizarre Sun T5240 behavior

pyroman · August 2, 2011, 11:21pm

Hi -

I have a T5240 with 7 LDOMS configured. One night, network comm was broken somehow. Nobody was doing anything on the machine at the time. Here is what I saw in messages:

WARNING: nxge3 : nxge_dma_mem_alloc: ddi_dma_mem_alloc kmem alloc failed
WARNING: nxge3 : nxge_alloc_rx_buf_dma: Alloc Failed: dma 12 size_index 10 size requested 4194304
WARNING: nxge3 : ==> nxge_alloc_rx_buf_dma: not enough for channel 12 allocated 0x200000 requested
WARNING: nxge3 : <== nxge_init_rxdma: status 0x40000000
WARNING: nxge3: nxge_grp_dc_add (12): channel init failed
NOTICE: nxge3: xcvr addr: 0x1a - link is down

So, it took down ALL of the interfaces (the entire quad card). It appears that it cannot allocate memory..... The funny thing is, I am not even using nxge3 - nothing is plugged into it, it has never been plumbed. I AM using nxge0,1,2 as aggr1 for vsw0. I shutdown the machine, pulled the power plugs and then booted.

After it came up, the network looked fine. The vsw0 was working properly. But, "svcs -a" reported the following problems:

maintenance  12:19:58 svc:/ldoms/vntsd:default
maintenance  12:21:36 svc:/ldoms/ldmd: default

So, I could not start my ldoms. Doing the following, resulted in such:

cat /var/svc/log/ldoms-ldmd:default.log

......
warning: unable to reconfigure CPUs in guest primary
Executing start method ("/opt/SUNWldm/bin/ldmd_start")
Method or service timed out. Killing contract 43

Also, about the same time, in dmesg output:

vdc: [ID 995498 kern.notice] NOTICE: [2} disk access failed.

After a while poking around and such and running svcadm enable/restart, etc. We got the ldm and vntsd running again - independant of the service mgmt facility. It still shows them in maintenance mode.

I brought up all of the LDOMS successfully and it all seems to be running fine.

I am just not sure what started this whole thing. Now I am getting (where I was not getting them before):

nxge: [ID 339653 kern.notice] NOTICE: nxge3: xcvr addr:0x1a - link is down.

Any ideas at all?????

DGPickett · August 3, 2011, 12:23pm

Maybe someone filled /tmp=swap, so malloc()/brk() was failing.

Maybe that link has some configuration parameter too high, tries to malloc() for 4GB in a 32 bit app.

Got all patches? http://wesunsolve.net/bugid/id/6768523

Maybe the card is broken? Static discharge?

pyroman · August 3, 2011, 4:22pm

Thanks DGPickett - I'll check it out.