Performance of log parsing shell script very slow

sowmitr · April 8, 2009, 9:23am

Hello,
I am an absolute newbie and whatever I've written in the shell script (below) has all been built with generous help from googling the net and this forum. Please forgive any schoolboy mistakes.

Now to the qn, my input file looks like this -

2009:04:03 08:21:41:513,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: OIS - EndUserName: - RequestId: null20090403082313
2009:04:03 08:21:41:775,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: OIS - EndUserName: - RequestId: null20090403082313 - StatusCode: 0 - StatusText: Success
2009:04:03 08:21:45:660,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: VCC - EndUserName: - RequestId: 411111111111111120090403082318
2009:04:03 08:21:46:171,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: VCC - EndUserName: - RequestId: 411111111111111120090403082318 - StatusCode: 0 - StatusText: Success
2009:04:03 08:21:49:583,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: CO - EndUserName: - RequestId: 20090403082321
2009:04:03 08:22:03:571,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: CO - EndUserName: - RequestId: 20090403082321 - StatusCode: 0 - StatusText: Success

From this I have to find out the resp time (start time - end time) for each request id for the SvcName CO. I have written this shell script (not finished though) however the performance is very slow (takes a minute to process a 100 line file). can you please point me in the right direction to improve the performance? Although not the exact subject of this post, if any pointers can be given to calculate the difference between start and end time, it will be quite helpful.

#!/bin/bash
#script to resptime timings for CO call from logfile
#////////////////////////////////////
#if no command line args
    if [ $# -ne 1 ]
    then
    echo 1>&2"Oops......Usage is wrong. $0 <tgtsrchfile>"
    exit 2
    fi
#assigning command line params to variables
srchFN=$1
#remove resptime.log if already present
checkfile="./resptime.log"
tempfile="./temp.log"
tempfile1="./temp1.log"
    if [ -e $checkfile ];then
    rm -r $checkfile
    fi
    if [ -e $tempfile ];then
        rm -r $tempfile
    fi
    if [ -e $tempfile1 ];then
        rm -r $tempfile1
    fi
    #if keywordfile not present
    if [ ! -r $srchFN ]; then
    echo Target search file $srchFN not present
    exit 2
    fi
#grep for request id
        grep 'Start.*CO' $srchFN | awk -F "RequestId: " '{print $2}'>temp.log
#for each request id get starttime and end time and print into temp file
        cat temp.log | while read line; do
        #if string is empty
        if [ -n $line ];then
        sttime=`grep Start.*CO.*$line $srchFN | awk -F "," '{print $1}'|awk -F " " '{print $2}'`
        endtime=`grep End.*CO.*$line $srchFN | awk -F "," '{print $1}'|awk -F " " '{print $2}'`
            if [ -n "$sttime" -o -n "$endtime" ];then
            echo $line,$sttime,$endtime>>temp1.log
            fi
        fi        
        done;
        
#/////////////////////////////////////

Thanks in advance

quirkasaurus · April 8, 2009, 11:02am

something wrong with this line, no delimiter specified:

grep 'Start.*CO' $srchFN | awk -F "RequestId: " '{print $2}'>temp.log

should be something like:

grep 'Start.*CO' $srchFN | awk -F, "RequestId: " '{print $2}'>temp.log

it looks like the problem is you're rereading the whole $srchFN for every line
of your temp.log twice through those grep commands.

You'll need to reformat your logic so that these extra greps are NOT even done.

Try to rethink things so that you know what you're looking for while only
reading the file once.

sowmitr · April 8, 2009, 11:11am

My input file has one line for the start of a request and one for the end of the request. And there are hundreds of unique requests. So I am struggling to think of how else I can get the start and end time with one grep, for each req id...

ghostdog74 · April 8, 2009, 11:14am

too many greps, too many awks, useless cats....

quirkasaurus · April 8, 2009, 11:15am

something like this? substitute 'd' for your log file:

cat d |
while read line ; do

#----------------------------------------------------------------------#
# Start time and service name.                                         #
#----------------------------------------------------------------------#
  if [[ $line = *Service,Start* ]] ; then
    echo $line |
      sed -e 's/^.*SvcName: //' -e 's/ - .*$//' |
      read service_name

    echo $line |
      sed -e 's/,INFO.*$//' |
      read start_time
  fi

#----------------------------------------------------------------------#
# End time.                                                            #
#----------------------------------------------------------------------#
  if [[ $line = *Service,End*SvcName*$svcname*EndUser* ]] ; then
    echo $line |
      sed -e 's/,INFO.*$//' |
      read end_time
  echo start-time $start_time end-time $end_time service-name $service_name
  fi
done

quirkasaurus · April 8, 2009, 11:20am

btw -- that'll work really only if each request has an end-time record immediately following it. if this is not the case, then let me know.

there's another trick we'll have to use to preprocess the log file.

specifically, creating a key using the svcname and "Service,Start" "Service,End"
and sorting ASC,DESC based on that key.

Then processing the file again using my above script.

radoulov · April 8, 2009, 11:23am

I would use Perl to calculate the date and time differences (using an external module Date::Manip):

#!/usr/bin/perl 

use warnings;
use strict;

use Date::Manip;

my $infile = 'file';

open FH, $infile or die "$infile: $!";

my ( $sdate, $edate, $delta, $err, $ustart, $uend, $rtime );

while (<FH>) {
    if (/SvcName: CO/) {
        /Start -/ and ( $sdate = ( split ',' )[0] ) =~ s|:(.*?):|/$1/|;
        /End -/   and ( $edate = ( split ',' )[0] ) =~ s|:(.*?):|/$1/|;
    }
    if ( $sdate && $edate ) {
        $ustart = UnixDate( $sdate, "%s" );
        $uend   = UnixDate( $edate, "%s" );
        if ($edate) {
            $rtime = $ustart - $uend;
            print $sdate, " - ", $edate, "\n";
            print DateCalc( $sdate, $edate, \$err ), "\n";
        }
    }
}

With your sample data it produces the following output:

% cat file
2009:04:03 08:21:41:513,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: OIS - EndUserName: - RequestId: null20090403082313
2009:04:03 08:21:41:775,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: OIS - EndUserName: - RequestId: null20090403082313 - StatusCode: 0 - StatusText: Success
2009:04:03 08:21:45:660,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: VCC - EndUserName: - RequestId: 411111111111111120090403082318
2009:04:03 08:21:46:171,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: VCC - EndUserName: - RequestId: 411111111111111120090403082318 - StatusCode: 0 - StatusText: Success
2009:04:03 08:21:49:583,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,Start - SvcName: CO - EndUserName: - RequestId: 20090403082321
2009:04:03 08:22:03:571,INFO ,servername,yyyy,undefined,,INFO,null,.....Out:xxService,End - SvcName: CO - EndUserName: - RequestId: 20090403082321 - StatusCode: 0 - StatusText: Success
zsh-4.3.9[t]% ./s       
2009/04/03 08:21:49:583 - 2009/04/03 08:22:03:571
+0:0:0:0:0:0:14

fpmurphy · April 8, 2009, 1:37pm

Here is another way to solve your problem using ksh93 ....

#!/usr/bin/ksh93

LOGFILE=./logfile
SVCNAME="CO"

awk -v str="$SVCNAME" -F, '$10 ~ "SvcName: "str {
     split($10, arr1, " ")
     date1=$1
     getline
     date2=$1
     print arr1[4], arr1[9], date1, date2
     next
}' $LOGFILE |  while read svc id sdate stime fdate ftime
do
    sdate1=$(printf '%(%s.%3N)T' "${sdate//:/-} ${stime:0:8}.${stime:9}")
    fdate1=$(printf '%(%s.%3N)T' "${fdate//:/-} ${ftime:0:8}.${ftime:9}")

    diff=$(( fdate1 - sdate1 ))
    idiff=$(( int(diff) ))
    usecs=$(( 1000 * (diff - int(diff)) ))
    secs=$(( idiff % 60 ))
    mins=$(( idiff % (60 * 60) / 60 ))
    hours=$(( idiff / (60 * 60) ))

    printf "%s %s - %s %s +%02d:%02d:%02d.%03d\n"  ${sdate//://} $stime ${fdate//://} $ftime $hours $mins $secs $usecs
done

exit 0

This is the output for your sample file

2009/04/03 08:21:49:583 - 2009/04/03 08:22:03:571 +00:00:13.988