Find out process that crashed the server

z1dane · February 8, 2010, 7:00am

Hi everybody,

I want to find out all the processes that ran before a server crashed. Is that possible?

I've looked in /var/log/messages and found out that the system was out of memory.

A user probably wrote a script (in Perl or Python) that used up all available memory and crashed the server.

I'm using Red Hat Enterprise Linux Server release 5.4 (Tikanga).

Thanks in advance!

Dave

zaxxon · February 8, 2010, 7:09am

Hm, if you have nothing set up that tracks stuff like this and there is no more details in /var/log/messages, I would set up some ps aux, maybe in combination with pmap and compare the numbers of memory usage from a fresh start to when it is about to have low memory (check with free or cat /proc/meminfo) triggered by cron every 10 minutes maybe.

z1dane · February 8, 2010, 7:15am

Hi zaxxon!

Thanks for the prompt response.

That's a great idea. So there is no native log file or daemon that tracks this sort of information in RHEL? I wanted to find out the script/user that used up all the memory, so I can avoid situations like this in the future.

If there is no such service, I will create the cron job.

Cheers,

Dave

zaxxon · February 8, 2010, 7:28am

I am not sure if RHEL does provide something specially for this case, but usually on Linux'es you see messages about processes that will be killed in a dire attempt to free memory. I usually noticed that by a box that was still up and running but had no sshd or rsyncd running anymore.

z1dane · February 8, 2010, 7:46am

Are messages about processes being killed stored anywhere?

I did notice this line in /var/log/messages

Feb 8 19:46:18 computer-name kernel: Out of memory: Killed process 19136 (emacs-x).

But I don't know what process 19136 is.

Thanks again!

Dave

Corona688 · February 8, 2010, 12:11pm

My crystal ball tells me it was emacs.

But the process killed isn't necessarily the one that caused the out of memory condition. The kernel tries to identify it but when the whole system is memory starved, EVERYTHING is fighting for memory...

mikep9 · February 8, 2010, 12:19pm

debug0:2> eps() <Enter>
The eps() command will give you a process listing.

This works with SCO Unix boxes that have had a kernel panic.

z1dane · February 8, 2010, 5:55pm

Ah point taken.

The system has 64 gigs of memory, and probably was using 30% of it before someone/something used it all up. What would emacs be doing?

I know it is possible in vi, since I once did a substitute command in vi on a big file and drained the memory.

---------- Post updated at 08:55 ---------- Previous update was at 08:50 ----------

I tried looking for the eps() command on my box, RHEL, but couldn't find it. Is it available on RHEL or just SCO Unix?

zaxxon · February 9, 2010, 2:48am

I think you misunderstood corona.

When a Linux box runs out of memory it starts killing processes (I think randomly) to free memory. In this case emacs was just prey. As corona said too, you can't see by this messages, which other processes were using up all memory, causing this behaviour (killing other processes).

So you might want to, as already said, just write a little script and place it in the crontab, that takes a snapshot with ps and pmap every hour or 10 minutes or whatever and compare them and have a look which of them rises in memory usage between a fresh reboot and when it is about time it should start doing this.

Corona688 · February 10, 2010, 12:23am

Checking the source, it's got a complex scoring system to measure a process' "badness". It preferentially kills:

Things with lots of memory.
Things with lots of children(forkbombs).
Things with very high total CPU time, i.e. endless allocation loops.
Low-priority and/or non-root things (since they're presumably less important).
Above all else, swapoff. duh.

But it can only measure the stats, and gauges what's safe to kill as much as what should be killed.

This doesn't rule out emacs, either! It might have been killed because it was consuming too much memory. Or it might have been killed to make way for a runaway process that had higher priority or access privileges than it, which the OOM killer preferentially keeps.

MarkSeger · April 10, 2010, 8:13am

I just answered a previous note about memory usage and pointed the user at collectl. There are a couple of things worth noting - collectl is VERY lightweight, on the order of using <0.1% of the cpu when sampling system data every 10 seconds! When trying to track down something tricky you ALWAYS need fine grained time or you never see those spikes that so ofter offer at the least expected time. In fact if you want to sample once a second you're still <1%.

But back to the problem at hand. While you can certainly run ps from cron every hours there are 2 reasons why you might not want to. First of all, sampling once an hour isn't really going to help much unless you get real lucky. Second, even if ps did tell you something you might also want to get other things that happened at the time in question like CPU, memory usage, open files, etc. but you don't have access to it because you didn't think to ask ahead of time.

With collectl, you just start it running as a daemon and it will collect more than you thought of to ask. It will even collect info on your slab usage and a runaway allocation of slab memory can certainly trigger the out-of-memory killer.

Just note that collectl only monitors slabs/processes once a minute because there are high load tasks...

-mark

z1dane · April 10, 2010, 8:41am

markseger:

I just answered a previous note about memory usage and pointed the user at collectl. There are a couple of things worth noting - collectl is VERY lightweight, on the order of using <0.1% of the cpu when sampling system data every 10 seconds! When trying to track down something tricky you ALWAYS need fine grained time or you never see those spikes that so ofter offer at the least expected time. In fact if you want to sample once a second you're still <1%.

But back to the problem at hand. While you can certainly run ps from cron every hours there are 2 reasons why you might not want to. First of all, sampling once an hour isn't really going to help much unless you get real lucky. Second, even if ps did tell you something you might also want to get other things that happened at the time in question like CPU, memory usage, open files, etc. but you don't have access to it because you didn't think to ask ahead of time.

With collectl, you just start it running as a daemon and it will collect more than you thought of to ask. It will even collect info on your slab usage and a runaway allocation of slab memory can certainly trigger the out-of-memory killer.

Just note that collectl only monitors slabs/processes once a minute because there are high load tasks...

-mark

Thanks Mark! collectl seems to be the type of monitor tool I need to use to keep track of events; certainly as you have pointed out a much better solution than cron and ps. I'll give it a go. Cheers!

Dave