SNMP responses failing under high system load

Greetings,

I've got a Zenoss v2.5 server monitoring a large video encoding farm. Needless to say, these systems are under high bandwidth and CPU utilization the majority of the time.

What I'm running into is that, occasionally, these systems will fail to respond to a standard SNMP request, thereby throwing "SNMP agent down" errors in Zenoss, and generating lots of otherwise unnecessary alerts. Then, the next time the system is polled, it works, and a clear message is also sent (generating even more alerts).

Short of nice-ing the snmpd process down so that it doesn't get completely blocked by the video encoding, what would be the best way to handle this, either via configuring Zenoss, SNMP, or the servers themselves? I don't see an obvious solution to this puzzle.. :wall:

How to fix it depends on why it's not responding.

If UDP packets are actually lost due to network overload, I'm not sure you can fix that. Is it possible to get your monitoring system to retry SNMP at least once instead of sending a failure message?

If the SNMP process just isn't responding in time due to CPU overload, then nice-ing your video processes to reduce their priority will do the job. Reducing something's priority is a better idea than increasing something else's since reducing your own privilege doesn't need root privileges. Low-priority jobs still get 100% CPU when nothing else competes with them, so you shouldn't lose throughput on a system that doesn't have other intensive tasks.

If it's not responding in time due to disk thrashing, I'm less sure how to deal with that; the server literally can't respond in time since things need to be loaded from an already-occupied disk first...