I am a newbie sysadmin to AIX, i have worked on HPUX for 3 years.
I have started a new role with in an IBM house and because there is me and one other there are a couple of issues I cannot work out:
We havehad a production server slowing down processing batch jbs over the past few nights - I have checked many things such as nmon stats, vmstats top procs and general performance from the machine.
We are getting hit by hit wait times in vmstat throughout the evening and know of certain jobs that run (business critical jobs) These are runniing up to 4hours longer.
Can you tell me the best way to monitor jobs / process etc so I can tell the "BOSS" what is causing the issues.
The main problem is that developers run queries to DB's on the server which we are currently running through a process to stop this.
Use filemon. You can see then if there is a bottleneck in your disks or I/O somewhere. IMO it's probably the best tool there is for checking that, but you have to use it while the slow down is occuring.
it is difficult to help minus details but try that:
if use SSA see that volume group is in a good health and you have no stale physical volumes, run defragmentation !
see system defs for maximum number of open files for a process and buffer limits for a process.
see in top what are the processes occupy most of the time, then in lsof figure out what takes it and then in iostat or vmstat see how the picture changes as you go trough steps 1 and 2.
yes we have checked various subsystems during the issues - we have nmon graphs that show high wait times and also have alerting that proved wait times to be above 60 from the vmstat command.
I ran svmon:
--> svmon -G -i 2
size inuse free pin virtual
memory 3145689 3087096 58593 182176 858018
pg space 2785280 428652
work pers clnt
pin 182158 0 0
in use 913993 2173103 0
also topas and noticed lots of page faults due to paging in and out.
hdisk1 and 0 are heavily utilised pretty much all day as well as other system disks but I tend not to believe everything in topas.
We run ps awux > /tmp/monitoring.date.
this file is updated every 15 minutes and I find the following which apperently is normal system calls: (these are the top procs in the file every 15 mins)
root 2064 8.5 0.0 12 9008 - A 24 Feb 60881:40 kproc
root 1806 8.5 0.0 12 9008 - A 24 Feb 60821:30 kproc
root 1548 8.5 0.0 12 9008 - A 24 Feb 60818:49 kproc
root 1290 8.5 0.0 12 9008 - A 24 Feb 60703:20 kproc
root 2322 8.5 0.0 12 9008 - A 24 Feb 60685:25 kproc
root 1032 8.5 0.0 12 9008 - A 24 Feb 60554:57 kproc
root 774 8.4 0.0 12 9008 - A 24 Feb 60152:51 kproc
root 516 8.1 0.0 12 9008 - A 24 Feb 57866:27 kproc
root 3096 0.0 0.0 64 9052 - A 24 Feb 198:45 kproc
root 2580 0.0 0.0 12 9004 - A 24 Feb 139:47 kproc
root 2838 0.0 0.0 16 9012 - A 24 Feb 1:41 kproc
root 3354 0.0 0.0 16 9012 - A 24 Feb 1:10 kproc
root 32510 0.0 0.0 16 9012 - A 24 Feb 0:03 kproc
root 30446 0.0 0.0 16 9012 - A 24 Feb 0:02 kproc
root 582168 0.0 0.0 16 9004 - A 28 Feb 0:00 kproc
root 25284 0.0 0.0 16 9004 - A 24 Feb 0:00 kproc
root 25542 0.0 0.0 16 9004 - A 24 Feb 0:00 kproc
root 25800 0.0 0.0 16 9004 - A 24 Feb 0:00 kproc
root 25026 0.0 0.0 16 9004 - A 24 Feb 0:00 kproc
when I grep out defunct:
retail ps auwx Monitor on Mon 24 Apr 18:15:00 2006
rt07mszw 1228822 Z 0:00 <defunct>
rt05hdzw 925824 Z 0:00 <defunct>
rt0v9rzm 1064108 Z 0:00 <defunct>
rt0a5jzm 1990444 Z 0:00 <defunct>
rt07mszw 1772756 Z 0:00 <defunct>
rt07mszw 1733018 Z 0:00 <defunct>
rt06ggxp 1731806 Z 0:00 <defunct>
informix 246550 Z 0:00 <defunct>
rt07mszw 781804 Z 0:00 <defunct>
rt08cazm 807862 Z 0:00 <defunct>
informix 732496 Z 0:00 <defunct>
informix 671516 Z 0:00 <defunct>
retail ps auwx Monitor on Mon 24 Apr 18:30:01 2006
rt050azb 1280306 Z 0:00 <defunct>
informix 1502640 Z 0:00 <defunct>
rt0d5rws 1481808 Z 0:00 <defunct>
rt0j2czb 1410630 Z 0:00 <defunct>
rt0o5ayb 1410304 Z 0:00 <defunct>
rt0r5mza 1030858 Z 0:00 <defunct>
rt0o5ayb 1014478 Z 0:00 <defunct>
root 1914084 Z 0:00 <defunct>
root 1966324 Z 0:00 <defunct>
rt095req 1948512 Z 0:00 <defunct>
rt01mszm 1944508 Z 0:00 <defunct>
rt0d5rws 1682574 Z 0:00 <defunct>
root 455384 Z 0:00 <defunct>
informix 232872 Z 0:00 <defunct>
informix 732496 Z 0:00 <defunct>
informix 734412 Z 0:00 <defunct>
rt05adyk 551914 Z 0:00 <defunct>
rt0a2gzt 654196 Z 0:00 <defunct>
now these do disapear and repear with different PIDS.
If you are getting a lot of paging during this time check to see if your paging space is setup correctly as well.
A few questions you might bring up -
When was the last time the box was rebooted? If you have any memory leaks this will clean that up.
Has the number of apps increased on the box since it was bought? Does it need an actual memory upgrade?
Check performance and tuning guide in relation to what the vendor recommends.
I still recommend running filemon to see if you have a disk bottleneck. Your paging can increase if there is a bottleneck and writes are taking longer and longer to compelte. If so, you would need to move around your LV's in order to increase performance.