First, go into the cron folder and update sysstat to run more frequently. On our centos systems, inside /etc/cron.d/sysstat, I use:
*/10 * * * * root /usr/lib/sa/sa1 -d -I 30 20
To have this take effect immediately, you need to delete today's sa file. Otherwise, the change will start taking place tomorrow.
rm -f /var/log/sa/sa`date +%d`
The next thing is to monitor processes. Something like this should work. Add to the sysstat cron file these two lines:
1 * * * * root find /var/log/sa -name "ps-*" -cmin +300 | xargs rm -f &>/dev/null
* * * * * root ps -N --sort comm,pid -ww
-o tty:1,pid,c,pmem:5,rss:8,sz:8,size:8=TSIZE,vsz:8,nlwp,lstart,wchan,args |
sed -n 's/^? //p' |
awk '$4 != "0" && $5 != "0"'
&>/var/log/sa/ps-`date +%H%m`
(Note, you must put these on exactly TWO lines. For readability, I've broken up the second entry onto multiple lines.)
Every hour, the first command cleans up after the second command any data that is more than 5 hours old (to prevent the directory from getting too full). You can change that if it's not enough. The second command runs every minute and saves a very details ps-listing to the disk.
If you have a hang, reboot and then run "sar -A", which should now give you very detailed information about everything. You might notice a memory spike followed by IO, or vice-versa. Note the time when the problem occurs, and then go into the appropriate ps-* files to see if you can see the problem process. You might need to look and previous ps outputs to see a change. The processes are ordered by command-name and pid, so you are able to do a "diff" between two ps files to see where a change really occurs.