Scripts for monitoring real time for the error code

JOHNSON · October 6, 2003, 5:12am

Hi ! I wish extract one information(error code) from the machine which is running on the HP-UX platform. Can anyone help me?

Wanted to write a scripts that monitor real time at the error code and transfer the information out from the machine. The information is then process and SMS out to the user.

Perderabo · October 6, 2003, 9:52am

What error message? The best you can reasonably do is to check for the error, say, once every 5 minutes or so. You can check as often as you like within reason, but you don't want to overload the machine.

By sms, I assume that you want to send a very text message to a particular email address?

JOHNSON · October 6, 2003, 11:01am

Hi Perderabo,
Many thanks for your quick reply. When there the machine encounter an error, it will need some human assistance. Therefore i need to monitor the error it generated in real time and continuously because i want to notify the users so that the users can address the error promptly and the down time of the machine can be minimized.

How can i overload the machine by monitoring the machine error code?

The user will be notified by SMS using their email address/mobile phone.

Perderabo · October 6, 2003, 12:55pm

You still have specified what errors you are looking to monitor. I'll guess that you want to monitor /var/adm/messages for anything.

From your thread in the Linux forum, I gather that you are writing in C++. You can use fstat() or stat() to see if the file has been changed. But you must repeatedly invoke fstat(). Let's say that you want your program to notice a new error message in less than a millisecond. Then 1000 times each second you must fstat() that file. That is quite a load for the box. Even HP's fastest cpu will spend a high percentage of it's time doing that. Most realtime applications can tolerate a millisecond delay so I'm guessing that this would be enough. Now suppose that you can tolerate a one second delay between the arrivial of the error message and your program noticing it. That would reduce the load on the box by a factor of 1000. At this point the cpu can do some other things. Most of us will actually tolerate 2 or 3 minutes in a situation like this. Some of us would even go as high as 5 minutes. But it's your box, you can loop as fast as you want.

JOHNSON · October 7, 2003, 12:27am

How can i prevent the machine from overloading if i wish to monitor the status at real time?

Currently we are donwloading the errorlog every 30 min from the machine. The errorlog is being copy out from the machine through netwoking.

Perderabo · October 7, 2003, 12:35am

I would write a script to examine errlog and if finds something, it will email someone. It would do this once, it would not loop. Then I would use cron to schedule it to run every five minutes.

JOHNSON · October 7, 2003, 1:17am

The main idea is to notify the user when the machine encounter an error so that the user can address the problem promptly.

If i do not loop, user will not know that there is a problem with the machine immediatley.

Can you shed some light for me as to guide me to better solution regarding my concern?

If i am unable to add or change any scripts at the machine side, can i used another PC(external source) to interrogate the machine status by monitong the path where the error is kept and copy the error code and send the message to the user through SMS?

Perderabo · October 7, 2003, 10:02am

Note that I said "I would use cron to schedule it to run every five minutes". If an error occurs no more than 5 minutes later, cron will run the script. I think that an average wait of 2.5 minutes is acceptable.

What is difference between a script running continuosly looping forever (but with a "sleep 300" statement in the loop) and cron running a script every 5 minutes? The only one I can think of is that if your script dies, you lose monitoring if it was looping but not if cron is running it.

You could even have cron run the script once a minute. Will a one minute maximum wait (30 seconds average) really kill you?

When I send an sms message to my phone, it can take several minutes to arrive. How fast is sms for you? And how fast can you read the message and take corrective action?

It does not make sense to require your script to notice an error in a millisecond and then use sms to notify someone.

If your application is that critical, I suggest sending the error messages to a terminal. Arrange to have someone onsite 24 hours a day, 7 days a week staring intently at the terminal. When the error message occurs, the person can respond in seconds.

We actually do that, except that the person is monitoring about 30 terminals. He has procedures to handle some things, but most problems involve calling someone...and he must actually talk to someone, he does not simply send a page or email.

And even then, the scripts that send the error messages to most of those terminals do indeed involve a delay of a minute ot two.

JOHNSON · October 7, 2003, 1:21pm

Many thanks for your advise. You definitely got the points.

Do you mean that i can run the cron for every one minutes?
And i am actually able to receive the error from the machine out to my PC in less than a minute? The time for the information to be SMS will depend on the transsmion network itself, right?

If my SMS application is written with Visual Basic.net, do you think it is possible to integrate with Unix/Linux?

Perderabo · October 7, 2003, 3:08pm

I'd be surprised if you can run visual basic on Unix. But you will need a Microsoft expert to be sure.

Yes, cron can run a script once a minute.

To unix, the sms is just a mail message. It will get passed from computer to computer until it reaches your phone companies computer. Only then does it become sms in a true sense. And mail is seldom treated as a high priority. Your local computer will try to contact your local mailhost. If your mailhost is too busy, it will reject the connection. Then your local computer will just queue the mail. One an hour or once a half hour, your computer will try to send stuff in the queue. It typically will take days before it gives up and returns the mail to sender. You can adjust your own settings. But you cannot adjust everyone else's. Mail really should not be used for emergencies.

When I use sms, it is for messages like "part 2 of the big rsync job is done". As long as the messages arrive, I know that my job is progressing. If a message is overdue, then I can check the status.