syslog message..!

nicknihal · November 5, 2007, 12:07am

I got my system sun fire 6800 hung later reboot after generating these message can any one help me on this to review these message..!!
nfssrv: [ID 694464 kern.warning] WARNING: nfsauth upcall failed: RPC: Operation in progress
mountd[664]: [ID 676604 daemon.error] cannot accept connection: 19: error unknown (current state -1)
KAVE00166-W The Store service is delayed and the load on the service is high. Revise the collection items and the collection intervals. (queue length=1)
^Mpanic[cpu16]/thread=2a10004fcc0:
using kernel phase-lock loop 0041, drift correction 0.00000

I am unable to diag the problem with the system... can anyone please put his/her valuable remarks on these .

joerg · November 5, 2007, 12:59am

It can bee a problem with nfs.
May be a important file system can not be mounted or is damaged.
To start the system use the single user mode.

You implemented disksuite?(Is the system mirrored)

We need more information about the system please provide us the /etc/vfstab!
and a prtdiag -v

Best regards
joerg

nicknihal · November 5, 2007, 1:24am

Thanks joerg for thanking interest in my call.

There is no hardware error wht i have diagnosed from the systemn log. But wht making me weird is mpanic message of cpu
^Mpanic[cpu16]/thread=2a10004fcc0:
using kernel phase-lock loop 0041, drift correction 0.00000
wht does this mean more over there has been a var/crash/ generated is there any simple method to understand these crashes.. i dont no how ot get the mdb or adb cmd workout with these message..!

Can u help me on this ..!! i m looking forward for ur reply..

joerg · November 5, 2007, 1:53am

It is a little bit complicated to to lead you thru this because of my own small knowledge about this topics. But we can try it.
First of all we have to open then core files:
/var/crash/<name of the host>/

mdb -k unix.0 vmcore.0

$r <type and return and with space to the next .

Find the %pc Register
something like : %pc = 0x00008732873ff56 open_+4x66
The underlined part is important.

Now we have to disassemble the command which cause the problem.
0x00008732873ff56/ai

panic_thread/K Simple type it and dont ask me!

You got a line with a HEX address put this address:
address$<thread

Loking for a raw with procp inside and catch the seconed HEX address for the next command:

address$<proc2u

This shows you the command and the arguments which runs at the moment the system crashed.
But this is not implicitly the root cause!!!
It can be a hardware(CPU or memory) error that cause the crash at the moment the process start. In most cases the problem is a memory error!

I hope this could help you!

But don't ask me to much about this topics!

Best regards
joerg

nicknihal · November 5, 2007, 2:19am

Hi joerg, I got the panuc message but unable to get through the addess$> line it says
failed to dereference symbol: unknown symbol name

I think the syntax is getting somewhere wronge ..can u guide further on this ..

joerg · November 5, 2007, 2:30am

Please post the complete lines!
Best regards
joerg

nicknihal · November 5, 2007, 5:37am

as per your guideline i followed but unable to get the address$<thread stuff done ..see wht happen if i try to do it

%tba = 0x0000000000000000
%tt = 0x17f
%tl = 0x0
%pil = 0xf
%pstate = 0x016 cle=0 tle=0 mm=TSO red=0 pef=1 am=0 priv=1 ie=1 ag=0

   %cwp = 0x06  %cansave = 0x00

%canrestore = 0x00 %otherwin = 0x00
%wstate = 0x00 %cleanwin = 0x00
> panic_thread/K
panic_thread:
panic_thread: 2a10004fcc0
> address$<2a10004fcc0
mdb: failed to dereference symbol: unknown symbol name
> 2a10004fcc0
0x2a10004fcc0: 2a100047cc0
> address$<thread
mdb: failed to dereference symbol: unknown symbol name
>
can u drive me through this..

joerg · November 5, 2007, 6:02am

Sorry I think the description is not so clear:

panic_thread:
panic_thread: 2a10004fcc0
> address$<2a10004fcc0 ----> Change it to
2a10004fcc0$<thread

Change this please
Best regards
joerg

nicknihal · November 5, 2007, 6:20am

thanks it work out ..but after that this comes up ...how to understand this ... 2a10004fcc0$<thread
{
t_link = 0x2a100047cc0
t_stk = 0x2a10004fad0
t_startpc = thread_create_intr
t_bound_cpu = cpu0
t_affinitycnt = 0x1
t_bind_cpu = 0xffff
t_flag = 0x809
t_proc_flag = 0
t_schedflag = 0x3
t_preempt = '\003'
t_preempt_lk = '\0'
t_state = 0x4
t_pri = 0xa5
t_epri = 0
t_writer = '\0'
t_pcb = {
val = [ 0x105fa78, 0x18aa911 ]
}
t_lwpchan = {
lc_wchan0 = 0
lc_wchan = 0
}
t_sobj_ops = 0
t_cid = 0
t_clfuncs = sys_classfuncs+0x48
t_cldata = 0
t_ctx = 0
t_lofault = 0
t_onfault = 0
t_ontrap = panic_stack+0x3c68
t_swap = 0x2a10004a000
t_lock = 0
t_lockstat = 0
t_pil = 0x6
t_pi_lock = 0
t_nomigrate = '\0'
t_cpu = cpu0
t_weakbound_cpu = 0
t_lpl = 0x300001ec040
t_lgrp_reserv = [ 0, 0 ]
t_intr = 0x2a100755cc0
t_intr_start = 0xdd2e4128026bb
t_did = 0xb
t_tnf_tpdp = 0x300001cf0e8
t_cpc_ctx = 0
t_cpc_set = 0
t_tid = 0
t_waitfor = 0
t_sigqueue = 0
t_sig = {
__sigbits = [ 0, 0 ]
}
t_extsig = {
__sigbits = [ 0, 0 ]
}
t_hold = {
__sigbits = [ 0, 0 ]
}
t_forw = 0
t_back = 0
t_thlink = 0
t_lwp = 0
t_procp = p0
t_audit_data = 0
t_next = 0x2a100047cc0
t_prev = 0x2a100057cc0
t_whystop = 0
t_whatstop = 0
t_dslot = 0
t_pollstate = 0
t_pollcache = 0
t_cred = 0x60001001ee8
t_start = 2007 Sep 16 08:43:22
t_lbolt = 0
t_stoptime = 0
t_pctcpu = 0
t_sysnum = 0
t_delay_cv = {
_opaque = 0
}
t_delay_lock = {
_opaque = [ 0 ]
}
t_lockp = cpu0+0xf8
t_oldspl = 0xa
t_pre_sys = '\0'
t_lock_flush = 0
t_disp_queue = cpu0_disp
t_disp_time = 0x19c486f6
t_kpri_req = 0xffff4e35
_tu = {
_ts = {
_t_astflag = '\0'
_t_sig_check = '\0'
_t_post_sys = '\0'
_t_trapret = '\0'
}
_t_post_sys_ast = 0
}
t_waitrq = 0x699a151ba0b
t_mstate = 0x7
t_rprof = 0
t_prioinv = 0
t_ts = 0x6000ab84c88
t_tsd = 0
t_stime = 0
t_door = 0
t_plockp = p0lock
t_schedctl = 0
t_sc_uaddr = 0
t_cpupart = cp_default
t_bind_pset = 0xffffffff
t_copyops = 0
t_stkbase = 0x2a10004a000
t_red_pp = 0
t_activefd = {
a_fd = 0
a_nfd = 0
a_stale = 0
a_buf = [ 0 ]
}
t_priforw = 0
t_priback = 0
t_sleepq = 0
t_panic_trap = 0xf059e880
t_lgrp_affinity = 0
t_upimutex = 0
t_nupinest = 0
t_proj = 0x300001e4968
t_unpark = 0
t_release = 0
t_hatdepth = 0
t_joincv = {
_opaque = 0
}
t_taskq = 0
t_anttime = 0
t_pdmsg = 0
t_predcache = 0
t_dtrace_vtime = 0x1
t_dtrace_start = 0
t_dtrace_stop = 0
t_dtrace_sig = 0
_tdu = {
_tds = {
_t_dtrace_on = 0
_t_dtrace_step = 0
_t_dtrace_ret = 0
_t_dtrace_ast = 0
}
_t_dtrace_ft = 0
}
t_dtrace_pc = 0
t_dtrace_npc = 0
t_dtrace_scrpc = 0
t_dtrace_astpc = 0
t_hrtime = 0x312659a12420
How will i diagonse the problem which made the server rebooted...from the above result..

joerg · November 5, 2007, 6:40am

OK thats perfect!

And now this please:

0x2a100057cc0$<proc2u

And send me the output!

Best regards
joerg

nicknihal · November 5, 2007, 11:06pm

> 2a10004fcc0$<proc2u
{
mdb: failed to read u_execsw pointer at 2a100050100: no mapping for address
}
>
this is what come if I type in address<proc2u what might be the issue ..

joerg · November 6, 2007, 1:14am

I'm confused!
OK pleas provide me the output of this command:

2a100047cc0$<proc2u

Important is that you do the commands all in serial order.

Short explanation:
This localize the panic thread:
> panic_thread/K
panic_thread:
panic_thread: 2a10004fcc0

This is the pointer to the memory where the thread is:

2a10004fcc0$<thread

And inside the output we are looking for the procp Pointer this is a pointer to the proc structure so we can identify the command or program with cause the panic.

The problem for me is that SUN changed the format of the output to XML.
In the moment I have no core files to test is on the lab.

.....
t_procp = p0
t_audit_data = 0
t_next = 0x2a100047cc0
t_prev = 0x2a100057cc0
t_whystop = 0
.....

And this was (I hope my!) mistake:
2a10004fcc0$<proc2u
change it to :
2a100047cc0$<proc2u
and try this additionally
2a100057cc0$<proc2u

Please try it and inform me!

Best regards joerg

nicknihal · November 6, 2007, 4:40am

hey , tht cmd is not mapping
2a100047cc0$<proc2u

this provide output like this..

> 2a100047cc0$<proc2u
{
mdb: failed to read u_execsw pointer at 2a100048100: no mapping for address
}
> 2a100057cc0$<proc2u
{
mdb: failed to read u_execsw pointer at 2a100058100: no mapping for address
}

it says failed to read ..is tht mean there is no process running at tht time ..

Regards
nick

joerg · November 7, 2007, 2:11am

Sorry for the delay.

After a long search I find 5 files to test, but only with one I was able to test my old learned procedure with succses.

So at this point you need someone with more knowledge about this topic.

Sorry!!

Best regards
joerg