Linux server locking up

mongoose1 · October 1, 2008, 11:07pm

I've been running Django on MySQL on a Linux server for about 3 months w/o any problem. Suddenly a couple days ago, the server Django began throwing errors about not being able to connect to MySQL, at which point the entire server seizes up. The Apache Web server stops serving pages and SSH seizes up as well.

However, I can still ping the machine and it responds in a timely manner.

I don't think it's a configuration issue w/ Apache or Django or MySQL, since the SSH console locks up too.

My background is in applications development and I don't know much about Unix system administration, beyond the very basics.

What tools would I use to begin diagnosing this problem?

Is it possible that the server was hacked, and that there's some kind of Trojan horse or virus on it?

era · October 2, 2008, 2:28am

Sounds vaguely like the kernel is buggy, but beyond that, we can only speculate. Look in log files for error messages; probably the earliest nontrivial events you can find before the crash are more likely to indicate the root cause, but this too is just a basic rule of thumb.

flekzout · October 2, 2008, 5:43am

the first thing that you need to do is look at your /var/log/messages to see the events occur during that time, or maybe you have some specific directory that monitoring those processes running on your machine. if you can have that and paste it here, someone might be able to help you.

otheus · October 2, 2008, 8:33am

Can you log in AT the console? As root?
1a. If so, you can run "top" to kill the process taking too much memory. It might be a good idea to run " ps fax " and capture the output to a file and post it here. Lots of options if you can get to this point.
If not, Try holding down the ALT key and the SYSREQ key and following this guide: The magic sysreq options introduced

2a. First, try ALT-SYSREQ-m to see how much memory is used up. Then ALT-SYSREQ-f to kill any process using too much memory. Then try ALT-SYSREQ-t to see which processes might be consuming all the resources. Try again to login.

2b. Now try ALT-SYSREQ-s, to sync all the filesystems. Now try ALT-SYSREQ-e to kill all processes. If after a minute, you still fail to get a login, then...

2c. Reboot (CTL-ALT-DEL) or ALT-SYSREQ-c.

Post-mortem analysis. Boot into "single user mode" and...

3a. Check /var/log/messages to see what happened last (already suggested).

3b. Run "sar", "sar -b", "sar -c", "sar -q", "sar -W" to see what was happening on your system at the time of the crash. Look for "spikes" in the data. You should know a spike when you see one, especially since after a reboot, there values should be "nominal".

3c. Do a disk scan for badblocks. Use "e2fsck -f -c" on all partitions in /etc/fstab. WARNING: don't run this on partitions currently open for read-write. Hopefully in single-user mode, this won't be a problem. If you find "badblocks" or you get messages on the console, replace the drive. If you get "filesystem errors", run it again with -p.

If you don't find anything useful, do a few more things to try to catch the error next time:

4a. Add a line to syslog.conf:
*.debug,mark.* /var/log/details
and make sure syslogd is run with the -m option (the number is how many minutes a "heartbeat" is sent to the logs, so syslogd -m 5 would report a "MARK" message every 5 minutes. At least this way you can nail down when the machine hung.)

4b. Increase granularity of sa1 running. In some distros, this is found in /etc/crontab, while in others, it's in /etc/cron.d/sysstat. It's a cronjob which looks like this: /usr/lib/sa/sa1 1 1". Change it from every hour to every 5 minutes. In RHEL it looks like this:
*/20 * * * * root /usr/lib/sa/sa1 1 1
Change it to:
*/5 * * * * root /usr/lib/sa/sa1 1 1

4c. Recompile a standard-issue Linux kernel. If it fails during the compile, there's a good chance the problem is either (a) memory, (b) power supply, (c) motherboard, in that order of probability. (Assuming you have already checked the hard drive). When these devices start to fail, they often show up in strange ways.

4d. If the previous step worked, then you can upgrade the kernel. Okay, it might be a buggy kernel, but doubtful unless you're using a distro like Debian which pushes technology to the bleeding edge.

4e. Install lkcd to facilitate taking core dumps during a kernel panic. If the kernel hangs after running a specific process (ie, mysql), but still runs processes (like cron), then add a cron job (or submit an "at" job) to panic the kernel a few minutes after your hanging-process starts. Then take follow this FAQ: Linux Crash HOWTO (kenerl rebuild required) so you can analyze the crash.