False alerts

Hi

I have written a script to send email alerts when load of my linux server reaches max point
I keep getting false emails thought the load is normal , looks like same email is generated again and again - called from cron tab

checked if the tempfile is present , no it is not , cleaned after every time scripts runs

what could be the issue?

please suggest

Hello,

In order for anyone to have a chance of helping you diagnose this, you'll have to provide more information. Without actually seeing the script, it isn't really possible to say what the problem might be. If you could also provide the full 'crontab' entry that is used to run the script, that would be good too. Without these things, anything anyone says is just going to be pure guesswork, which is best avoided if possible.

2 Likes

<joke>
"If there's an error, it's probably on line 42." [(c) Hitchhiker's Guide to the Galaxy]
</joke>

1 Like
#!/bin/sh
# Script to send email alerts to mail box if cpu is more than 90% utilization
LIMIT=90
ALERT="monitoringbox@abc.com"
TEMPFILE=/tmp/temp1
HOSTNAME=`hostname`
rm -f $TEMPFILE
CPU_LOAD=`sar -P ALL 10 1 |grep Average' |awk -F" " '{print 100.0 -$NF}' |cut -d \. -f1`
if [[ $CPU_LOAD -gt $LIMIT ]];
then
echo "CPU is high on $HOSTNAME " >> $TEMPFILE
fi
if [ -e $TEMPFILE ]
then
mail -s " CPU ALERT  " $ALERT < $TEMPFILE
fi
rm -f $TEMPFILE

The above script is working as desired , ran from crontab for every 5 minutes
I also get false alerts cpu is 1% check alert

My linux team thinks it is nagios issue

I don't have sar on the system I use, so it is hard to guess at what might be going on. Furthermore, since you don't preserve the value that is triggering your mail message, we have even less information.

You could reduce the load on your system a little bit, preserve the CPU load value, and get rid of the temp file by changing:

ALERT="monitoringbox@abc.com"
TEMPFILE=/tmp/temp1
HOSTNAME=`hostname`
rm -f $TEMPFILE
CPU_LOAD=`sar -P ALL 10 1 |grep Average' |awk -F" " '{print 100.0 -$NF}' |cut -d \. -f1`
if [[ $CPU_LOAD -gt $LIMIT ]];
then
echo "CPU is high on $HOSTNAME " >> $TEMPFILE
fi
if [ -e $TEMPFILE ]
then
mail -s " CPU ALERT  " $ALERT < $TEMPFILE
fi
rm -f $TEMPFILE

to:

ALERT="monitoringbox@abc.com"
HOSTNAME=`hostname`
CPU_LOAD=`sar -P ALL 10 1 | awk -F" " '/Average/{print int(100-$NF)}'`
if [[ $CPU_LOAD -gt $LIMIT ]]
then	echo "CPU is high ($CPU_LOAD) on $HOSTNAME" | mail -s " CPU ALERT" "$ALERT"
fi

Please try this and let us know what mail message you get when the CPU load is not high.

Thank you Don

Yes the script I used which provides the value of CPU and hostsme and date
Sorry I provided my local saved code which I initially started to write was missing that value for CPU usage in email
It will calculate and send the current CPU value when limit breaches
I feel some where a bad email is saved and it is triggered once in a while
I checked var spoil root emails and other folders
Is there any other spot to check and fix

OK. So we know that something isn't working correctly. And, we know that the code you showed us is not the code you're using. So, we know that we have absolutely no idea what is going on. Sorry, but given these conditions, I have absolutely no idea what, if anything, needs to be fixed nor where to look -- other than at your actual code.

Hi,

Also, if you could please supply the full contrab entry that's being used on the live server itself to run the script every five minutes that would be good too (i.e. the entire line you see in the output of crontab -l on the live server that concerns the script in question).

One other aside: you mentioned way back in your second response that your "Linux team" thinks it's a Nagios issue. If this server is being monitored by Nagios, or if it can be monitored by Nagios, then there are nrpe or check_mk plugins that can monitor server load directly without you having to write a script of your own. If you don't know much about it, then basically nrpe and check_mk are pieces of software that can run on a server that's being montiored by Nagios to allow more complex checks than "is it up or down" to be carried out.

Whoever is responsible for the Nagios monitoring system at your site should be able to help you with that. Load monitoring is one of the most commonly-used Nagios plugins, so if you can run either nrpe or check_mk on this server that would definitely be the best way to go here, rather than rolling your own script.

1 Like
#!/bin/sh
#description ...
THRESHOLD=90
ALERT="monitoringbox@abc.com"
TEMPFILE=/tmp/temp1
HOSTNAME=`hostname`
rm -f $TEMPFILE
CPU_LOAD=`sar -P ALL 10 1 |grep 'Average.*all' |awk -F" " '{print 100.0 -$NF}' |cut -d \. -f1`
if [[ $CPU_LOAD > $THRESHOLD ]];
then
echo "CPU notification on $HOSTNAME is ${CPU_LOAD}% " `date`  >> $TEMPFILE
fi
if [ -e $TEMPFILE ]
then
mail -s "Check CPU usage on $HOSTNAME `date`  " $ALERT < $TEMPFILE
fi

Hello,

On my own test system (which is running Ubuntu 16.04 LTS x86_64) this script does basically appear to work. At any rate, it certainly doesn't generate any false positives for me when run at the shell as a non-privileged user, and the values it's getting for load appear to be genuine and sensible.

So, if you could provide the full crontab entry from which the script is run, then it'll be possible to look at that as a source of the issues next.

Again though: if you do have a Nagios environment, using its own load monitoring plugins is a much, much better idea. Honestly, if you have a Nagios server and the ability to add your server to it, you're just re-inventing the wheel here for no real gain whatsoever by writing your own script.

2 Likes

Thank you drysdalk

Exactly , I want to make sure that my thoughts were correct

my script in interfering with nagios or so and hence doing the false alarm

May be I need to take out the script from cron job and do it from cron.hourly ? or so ?

will be a good idea to add script to cron.hourly to execute every 5 min time frame ?

Hi,

I doubt a script this simple could cause any problems for Nagios. What monitoring is already configured in Nagios ? Are these load alerts that you regard as false coming as e-mails from your script, or as alerts from Nagios ? And can you please provide the crontab entry so it can be ruled out as a cause ?

I'd suggest changing this:

if [[ $CPU_LOAD > $THRESHOLD ]];

to

if [ $CPU_LOAD -gt $THRESHOLD ];

Also I'd debug what the value of CPU_LOAD is actually at the time of the email being sent.

Another, question.... you're doing this and checking of the existing of a file later:

echo "CPU notification on $HOSTNAME is ${CPU_LOAD}% " `date`  >> $TEMPFILE

Sounds like after the FIRST triggered condition, the file will be APPENDED to and any subsequent run of the script will trigger an email.
Don't you want to remove the file AFTER the condition has been triggered?

Hi drysdalk

I doubt a script this simple could cause any problems for Nagios. What monitoring is already configured in Nagios ? Are these load alerts that you regard as false coming as e-mails from your script, or as alerts from Nagios ? And can you please provide the crontab entry so it can be ruled out as a cause ?
[/quote]

Yes Nagios is already monitoring load ,

0,15,30,45 * * * * /unixmon/servermon.p > /dev/null 2>&1
*/5 * * * *  /etc/applicationMonitoring.sh

---------- Post updated at 03:04 PM ---------- Previous update was at 02:59 PM ----------

I have placed rm command at the top will it make difference ?

Hi,

In that case, if Nagios is already monitoring the load of your server...what is it you're hoping to achieve by running your own separate load monitoring script ? What does it do differently from what the Nagios check does, and would it not be possible to amend the Nagios check to do whatever you want so you only have one single check ?

Hi

we need to get email alerts and monitor our app with out any delay :slight_smile:

you don't need a temp file. How about this for the trailing portion of your script:

CPU_LOAD=`sar -P ALL 10 1 |grep 'Average.*all' |awk -F" " '{print 100.0 -$NF}' |cut -d \. -f1`
if [ $CPU_LOAD -gt $THRESHOLD ]; then
    echo "CPU notification on $HOSTNAME is ${CPU_LOAD}% " `date`  | mail -s "Check CPU usage on $HOSTNAME `date`  " $ALERT
fi
1 Like

I can change and test :slight_smile:

Hi,

(Edit: forgot to include the date in the mail)

Here's a version of your script that's as streamlined as I've been able to make it:

#!/bin/bash

hostname="`/bin/hostname`"
date="`/bin/date`"
load="`/usr/bin/sar -P ALL 10 1 | /usr/bin/awk '$1 == "Average:" && $2 == "all" {print 100-$NF}'`"

threshold="90.00"
recipient="unixforum@localhost"
subject="Load alert on host $hostname"
body="Load is $load, date is $date"

if [[ "$load" > "$threshold" ]]
then
        echo "$body" | /usr/bin/mail -s "$subject" "$recipient" >/dev/null 2>/dev/null
        exit 1
else
        exit 0
fi

Again in my own local tests this worked fine, but then so did your original. You may need to amend paths to things like sar, awk, etc (it's always a good idea to use fully-qualified paths in scripts that will be run via crontab).

Hope this helps.

1 Like

Hi vgersh99,
Good catch on the -gt versus > test. Without that change, it could miss reporting that the CPU load was 100% (but it still shouldn't have caused any false high load reports). Note that the script we were shown in post #4 didn't have this bug.

Note the code marked in red above. I agree wholeheartedly that the temp file is not needed (and suggested removing it back in post #5 in this thread), but the temp file is removed before it is appended to in the code you're questioning, so that shouldn't have caused any false high load reports either (assuming the code shown to us in post #9 is the actual code being run).


I can change and test :\)
[/quote]

Hi anil529,
Why don't you also make the changes I suggested in post \#5 in this thread \(where I also proposed getting rid of the temp file\) and get rid of two unneeded processes that are adding unneeded load to the system you're trying to monitor?  :confused:

If you decide to try drysdalk's suggestion instead, at least note that you must change the:
```text
if [[ "$load" > "$threshold" ]]
```

to:
```text
if [[ "$load" -gt "$threshold" ]]
```

as noted above by vgersh99 to keep from missing reports if the load reaches 100%.
1 Like