Performance issue or something else?

Hi All,

I have the following script which I use in Nagios to check the health of the applications, the problem with it is that the curl part ($TOTAL) does not return anything after running for 2-3 hrs, even though from command line the script runs fine but not from Nagios.

There are 17 invocations of this script in a second. Wonder if its due to the system not able to handle multiple curls at the same time. Though the load and usage on the system seems to be fine.

Can someone try to help and see if we can improve on this script's performance or is it something else.

Thanks,
Jack

#!/bin/bash

read URL < "$1"

TOTAL=`curl -w '\ntotal_time=%{time_total}s' -s -m 3 --connect-timeout 3 $URL |
    perl -n0e '$s=/"alive"/?"OK":"ERROR";($t)=/(total_time.+)/;print "$s $t;0;0;0\n"'`

echo `date`: $TOTAL : $URL >> /tmp/curl.log

STATUS=`echo $TOTAL|awk '{print $1}'`
PERF=`echo $TOTAL|awk '{print $2}'`

case $STATUS in
OK)
echo "$STATUS|$PERF"
echo "curl $URL"
exit 0
;;
WARN)
echo "$STATUS|$PERF"
echo "curl $URL"
exit 1
;;
ERROR)
echo "$STATUS|$PERF"
echo "curl $URL"
exit 2
;;
FATAL)
echo "$STATUS|$PERF"
echo "curl $URL"
exit 2
;;
*)
echo "$STATUS|$PERF"
echo "curl $URL"
exit 2
;;
esac

Here is the curl.log output from when it was returning data and then does not return anything -

Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.015s;0;0;0 : http://URL
Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.021s;0;0;0 : http://URL
Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.016s;0;0;0 : http://URL
Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.017s;0;0;0 : http://URL
Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.024s;0;0;0 : http://URL
Wed Oct 6 15:11:41 PDT 2010: OK total_time=0.017s;0;0;0 : http://URL
Wed Oct 6 15:11:42 PDT 2010: : http://URL
Wed Oct 6 15:11:42 PDT 2010: : http://URL
Wed Oct 6 15:11:42 PDT 2010: : http://URL
Wed Oct 6 15:11:42 PDT 2010: : http://URL

This looks to me like you could get the same information from ping, curl is pretty heavy-duty for just seeing response times in ms.

I am guessing you are trying to monitor network "health". There are excellent open source tools for that like nagios -

Nagios - The Industry Standard in IT Infrastructure Monitoring

To answer your question, it looks like the remote service is not responding. You are not checking exit codes from curl.
For example http failure can return a 22. curl has a lot of error codes. They are very important. You need to un-one-liner your code and get the exit code, redirect to a file, check and report errors first, then see what you got in the file. I cannot tell why things are failing at this point.

  curl -w '\ntotal_time=%{time_total}s' -s -m 3 --connect-timeout 3 $URL > somefile
  if [ $? -ne 0 ] ; then
     # play with error code reporting here.
  fi
  # perl code here..... that reads somefile