Can HPOV monitor server hung state ?

solaris_1977 · June 26, 2019, 2:52pm

Hi,
We have Solaris-10 running on VMware (x86). It is being monitored by HP Openview. Sometimes when this server hungs, while ping still works, HPOpenview can't alert that server is down (which is actually unresponsive).
First symptom we see is, login failure. It will ask user name and after providing that, either it won't ask for password or if asks, it will just wait there and doesn't come on prompt.
I am not expert of mionitoring tool, so seeking suggestions here. Is there any capability in HPOpenview, which can track and alert this kind of hung state, like login failure ?
Thanks

hicksd8 · June 26, 2019, 3:16pm

My guess is that this server is not in a hung state but is running dead slow. The fact that ping still works proves that the vmware guest OS is alive.

This guest OS is either out of memory resource or the whole platform is out of memory resource. Since you are not complaining that other guests are showing any signs of problems I would ask whether a maximum real memory limit was put on this guest when is was configured. If so, as load increases on this guest it could reach a level where it is allowed no more memory and so starts to page like mad. The symptoms of very slow (almost impossible) login are typical of resource limitation. If you were to wait a very long time probably login would eventually complete.

What OS is under VMWare and booting the platform?

Could another guest OS on this platform suddenly be demanding a lot of resource thereby killing other guests?

Could you easily increase the overall RAM on the whole platform to test this theory out?

solaris_1977 · June 26, 2019, 3:40pm

Yes, we can assume that it was dead slow. I waited for 20 minutes, before I hit reset button. since this was critical application server, I was not able to wait longer. From VMWare console, I checked and memory graph was not showing peak utilization. But I have seen behavior where server was frozen due to memory crunch and VMWare console doesn't show that. So your theory can be true in this case too.

VMWare platform is stable (over 300 VM servers are running on it) and none other VM complained about it and still have large amount of memory. Memory is capped to each VM. For example, for this affected VM, 8 GB of memory is allocated.

As a preventive action, I am looking for alert, if it happens again. There could be a small script running from other server and keep login to affected server and say "Login OK". As soon as it delays or not responding for 10-20 seconds, it will send email to admins. But we want to do it many servers and that will become a messy solution. HPOpenview is handled by different team, but they are not taking any initiative to advance on it. So I am looking for solution, if something can be done with this tool.

Neo · June 26, 2019, 9:58pm

HPOV like most centralized network management systems (NMS) rely primarily on polling to update their management system.

This means you cannot get information by polling an unresponsive server.

However, most of these same systems also can be configured to send traps.

This means, your HPOV team needs to set up your system to send traps back to the management system before the system being monitored slows to a crawl (soft failure) and cannot respond to polling,

In other words, most novices set up network management systems like HPOV to use only polling; but experienced network management people will also set up traps to be sent back to the NMS for certain critical processes which need to alert the NMS prior to overall system "failure" (also meaning a soft failure, not only a hard failure).

I have extensive hands-on experience in NMS including debugging HPOV when it was a black and white versus decades ago. All well configured NMS will be configured to poll (95% +) and to trap (< 5%) alerts.

Bottom Line: Set up traps for the critical processes to alert the NMS before the system soft-fails (slows to a crawl).

Hope this helps.

Cheers.

solaris_1977 · June 26, 2019, 10:27pm

This is good information to pass on to them.
But in this situation when system is dead slow on login (or completely dead on login), traps can be set on what processes ? I am tying to understand concept of setting up traps. Does a process login to sever (with some default credential) and come back with response ?
If we assume that there is soft failure on NIS service (ypwhich/ypbind), that is not happening here. As soon server goes into hung state, login doesn't go after providing password (or sometimes doesn't ask for password after giving username).

Neo · June 26, 2019, 10:31pm

You have to monitor the system and send out the traps on the critical processes you want to monitor before the system soft-fails.

It's that simple.

solaris_1977 · June 26, 2019, 10:39pm

I got your point. But probably it is difficult to know the process, which we assume, will start failing, before server start choking up. I will browse more in messages file, if that indicate some failure message, right before server hangs

Neo · June 27, 2019, 12:51am

You must develop your own detection algorithms.

Nothing is difficult if you put your mind to it.

Your brain has 100 billion neurons. That is more neurons in your brain than all the stars in our galaxy.

Algorithms are not difficult!

Just do it.