comm: mysqld Not tainted ... Kernel Panic , System totally unresponsive

massoo · August 26, 2010, 10:32am

Hi,

I am experiencing frequent system hangs, hard kernel panics, etc almost thrice a day. The system would be totally unresponsive and the only way is to reboot is hard power recycling (plug out the power cable and plug in back after 30 secs). I enabled kdump, but unfortunately the kdump files are as huge as 16GB and unable to analyze. The repeated errors I get in the /var/log/messages is

Aug 24 18:05:35 blr-cos-mdb01 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [mysqld:5365]
Aug 24 18:05:35 blr-cos-mdb01 kernel: CPU 0:
Aug 24 18:05:45 blr-cos-mdb01 kernel: Modules linked in: ipv6 xfrm_nalgo crypto_api hidp l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy snd_hda_intel sr_mod tpm_infineon snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq cdrom tpm snd_seq_device snd_pcm_oss snd_mixer_oss tpm_bios snd_pcm e1000e snd_timer shpchp serio_raw snd_page_alloc snd_hwdep pcspkr sg snd soundcore dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Aug 24 18:05:45 blr-cos-mdb01 kernel: Pid: 5365, comm: mysqld Not tainted 2.6.18-194.3.1.el5 #1
Aug 24 18:05:45 blr-cos-mdb01 kernel: RIP: 0010: __d_lookup+0xe2/0xff
Aug 24 18:05:45 blr-cos-mdb01 kernel: RSP: 0018:ffff8101a415fc88 EFLAGS: 00000282
Aug 24 18:05:45 blr-cos-mdb01 kernel: RAX: ffff8103c6c864c8 RBX: ffff8103c6c864c8 RCX: 0000000000000015
Aug 24 18:05:45 blr-cos-mdb01 kernel: RDX: 00000000000db1d6 RSI: ffff8101a415fd28 RDI: ffff8103c83274b0
Aug 24 18:05:45 blr-cos-mdb01 kernel: RBP: ffff810417bbf800 R08: 0000000000008001 R09: ffff81041747e5c0
Aug 24 18:05:45 blr-cos-mdb01 kernel: R10: ffff810188c8c580 R11: ffffffff8002c3e0 R12: ffff81040c0840c0
Aug 24 18:05:45 blr-cos-mdb01 kernel: R13: 0000000000000000 R14: ffff8101a394e348 R15: ffff8101a394e348
Aug 24 18:05:45 blr-cos-mdb01 kernel: FS: 00000000405c8940(0063) GS:ffffffff803ca000(0000) knlGS:0000000000000000
Aug 24 18:05:45 blr-cos-mdb01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 24 18:05:45 blr-cos-mdb01 kernel: CR2: 00002aace6224000 CR3: 0000000401db3000 CR4: 00000000000006e0
Aug 24 18:05:45 blr-cos-mdb01 kernel:
Aug 24 18:05:45 blr-cos-mdb01 kernel: Call Trace:
Aug 24 18:05:45 blr-cos-mdb01 kernel: __d_lookup+0xb0/0xff
Aug 24 18:05:45 blr-cos-mdb01 kernel: do_lookup+0x2c/0x1e6
Aug 24 18:05:45 blr-cos-mdb01 kernel: __link_path_walk+0xa01/0xf42
Aug 24 18:05:45 blr-cos-mdb01 kernel: link_path_walk+0x42/0xb2
Aug 24 18:05:45 blr-cos-mdb01 kernel: do_path_lookup+0x275/0x2f1
Aug 24 18:05:45 blr-cos-mdb01 kernel: __path_lookup_intent_open+0x56/0x97
Aug 24 18:05:45 blr-cos-mdb01 kernel: open_namei+0x73/0x6d5
Aug 24 18:05:45 blr-cos-mdb01 kernel: do_page_fault+0x4fe/0x874
Aug 24 18:05:45 blr-cos-mdb01 kernel: do_filp_open+0x1c/0x38
Aug 24 18:05:45 blr-cos-mdb01 kernel: _atomic_dec_and_lock+0x39/0x57
Aug 24 18:05:45 blr-cos-mdb01 kernel: do_sys_open+0x44/0xbe
Aug 24 18:05:45 blr-cos-mdb01 kernel: tracesys+0xd5/0xe0
Aug 24 18:05:45 blr-cos-mdb01 kernel:
Aug 24 18:22:23 blr-cos-mdb01 syslogd 1.4.1: restart.

I have CentOS release 5.5 (Final) with kernel-2.6.18-194.3.1.el5. The hardware is HP dc7900 with 16 GB RAM, Intel Core 2 Duo E8400/3Ghz/4GB RAM, 160GB HDD. I have installed MySQL builds from Percona viz
Percona-XtraDB-1.0.6-10.2-5.1.45-10.2.rhel5
Percona-Server-server-51-5.1.47-rel11.1.51.rhel5
Percona-XtraDB-1.0.3-5-5.1.34-5.rhel5
Percona-Server-shared-compat-5.1.43-3
Percona-Server-client-51-5.1.47-rel11.1.51.rhel5
Percona-Server-test-51-5.1.47-rel11.1.51.rhel5
Percona-Server-devel-51-5.1.47-rel11.1.51.rhel5
Percona-Server-shared-51-5.1.47-rel11.1.51.rhel5

The uname -a produces Linux blr-cos-mdb01.digi.com 2.6.18-194.3.1.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

What could be the issue and how to resolve it ?

Regards
Prashant

Corona688 · August 26, 2010, 11:17am

The kernel is not supposed to panic under any circumstances, there's no "normal" circumstances beyond a hardware failure or kernel bug that should cause one. As such, there's no magic technique for avoiding potential kernel panics. Try upgrading your kernel.

massoo · August 30, 2010, 8:22am

Issue resolved. BAD ... BAD MEMORY, after removing one if the DIMM's, the server no longer crashes.

I wonder how memtest passed the Memory !!!!

mark54g · August 30, 2010, 10:38am

did you actually run memtest86 or memtest86+ or did you rely on the BIOS address test? A BIOS address test will not resolve issues like you describe. Memtest86+, run for several hours has never failed in identifying bad RAM for me.