How often should I monitor the CPU and memory usage ?

qiulang · August 19, 2009, 11:58pm

Hi all,

When you monitor the CPU and memory usage, how often do you do it ? Do it too often or too rarely will both cause the problem. So does anyone have hand-on experience ?

And for my case, the requirement says that when CPU usage is above X% or memory usage is above Y%, I should reject the further request but no further words for how long when the system is overloaded that I should reject the further request. But the requirment does mention that for each accepted request I should send ack back in Z seconds.

So my initial thought is that I should monitor CPU and memory usage when my app begins to run (and till it exits), and if the CPU/memory usage keeps running above the threhold for Z seconds I should let then let request processing part know it and reject the further request.

It seems to an easiest answer. But any other thought ?

And how oftern should I monitor the CPU/Memory usage within this Z seconds ?

Besides, my app is a Java app but for the monitroing part I just want to write a script to call commands like sar/top/vmstat. So how these 2 parts communicate ? socket or keep checking the log file ?

Thanks in advance!

---------- Post updated at 07:58 PM ---------- Previous update was at 02:47 AM ----------

I google the topic for a while and only find this article discussing about the sampling rate InformIT: Windows 2000 Performance Tools: Leverage Native Tools for Performance Monitoring and Tuning > Performance Monitor

But it only discusses things like the more often you sample, the more disk space performance log files require. Still no words about how sampling rate affects system performance itself.

Besides, how do I calculate the cpu/memory usage? Should I average the sample data or count times (in a row) that CPU/memory usage are above the threshold and set another threshold for that count number ?

Any idea ? Thanks!

zxmaus · August 20, 2009, 2:29am

Hi,

which OS do you run - I assume it's AIX? And why do you want to do it at all - are you running so low in resources that a simple command execution / a running process will put your system into trouble? Or do you expect your application to be so badly written that it will negatively impact the system for a certain amount of time?

If it's AIX, vmstat or sar or any kind of command collecting once off data will not impact your performance at all, while interactive monitoring tools like topas or nmon will. On unix you should not average anything since the resource usage is 'at a given moment in time' and will change about 1000 times per second anyway - so most of these tools are internal averaging the interval between executions.

Saying this, where would you set your threshold anyway. AIX using vmm and constantly re-nicing its processes, is absolutely capable to run in (and depending hardware/setup/virtualization overrun) its cpu entitlement without any kind of problems - and if you're overutilizing your memory for a short time, and your system is properly tuned, this will not slow down your system either (an AIX box with proper sizing and 'enough' memory uses normally 70-80% memory computational and gives you the headroom of 20% memory for peak times).

Not knowing your system / application at all, I would say 'it depends on how long you expects your threads to run and what they're doing overall' - on a DB box I would monitor the system in 1 - 2 second intervals for using all resources more than 3 intervals - but as stated - when I use virtualization and have a large shared pool where I can get 1000% cpu in case I need it, I just don't have to monitor it at all. And if you have p6 systems and large shared memory pools, too - I would not even do it for memory.

zxmaus

qiulang · August 21, 2009, 12:04pm

Thanks for the reply. And my system is indeed AIX (Power server). How did you know that ? COOL!!

I am not sure exactly why my customer want to monitor the cpu/memory usage. Maybe they want their system to be "stable" because the requirement says when cpu usage is above X% or memory usage is above Y% I should reject the further request. ( Y is actually 70!! Nice guess again )

Or maybe they want to monitor to get a general sense of what the system is doing.

To me I just need to figure 2 things out,

For how long when system is running above these thresholds I can say the system is indeed overloaded.
What is the appropriate sampling rate ?

Any further suggestion ?

Thanks!

zxmaus · August 21, 2009, 1:03pm

Hi,

can you please post the output of lparstat -i ... this would be helpful to determine what kind of system you have and if it's capped or uncapped and how much resources are in the shared pool

We are monitoring our systems for peak performance - 100% cpu usage for more than 1 min (4 measure intervals every 20 seconds) on capped systems (these lpars that don't have a shared pool where they can get more cpu if required) and we have setup a custom alert that let us know once computational memory crosses 80% or pagingspace 5% - here we measure every 5 min.

Please ge aware the AVM value in the vmstat -I output is in 4k pages ... and computational memory usage is all that really counts in a box (that is sufficient tuned). If your box is or not is something we cannot tell you without any outputs

Kind regards
zxmaus

Neo · August 21, 2009, 1:15pm

Hey!

We use Zabbix to monitor our CPU and the default is 10 seconds..

See attached.

qiulang · August 21, 2009, 10:47pm

Hi my server is power 550 with 4 core processors at 3.5 GHZ and 8G memory.

The app we develop is a Java application. So will this information help us to determine the sampling rate and peak time duration ?

Thanks!