Optimize awk code

SkySmart · February 5, 2016, 1:21pm

sample data.file:

0,mfrh_green_screen,1454687485,383934,/PROD/G/cicsmrch/sys/unikixmain.log,37M,mfrh_green_screen,28961345,0,382962--383934
0,mfrh_green_screen,1454687785,386190,/PROD/G/cicsmrch/sys/unikixmain.log,37M,mfrh_green_screen,29139568,0,383934--386190
0,mfrh_green_screen,1452858644,-684,/PROD/G/cicsmrch/sys/unikixmain.log,111M,mfrh_green_screen,732502,732502,,111849151,0,731818
0,mfrh_green_screen,1452858944,-888,/PROD/G/cicsmrch/sys/unikixmain.log,111M,mfrh_green_screen,732707,732707,,111918753,0,731819

Code i'm running against this file:

VALFOUND=1454687485
SEARCHPATT='Thu Feb 04'
awk "/,${VALFOUND},/,0" data.file | gawk -F, '{A=strftime("%a %b %d %T %Y,%s",$3);{Q=1};if((Q)&&(NF == 13)){split($4, B,"-");print B[2] "-" $3 "_0""-" $4"----"A} else if ((Q)&&(NF == 10)) {split($NF, B,"--");print B[2]-B[1] "-" $3 "_" $10"----"A}}' | egrep "${SEARCHPATT}" | awk -F"----" '{print $1}'

data.file is about 7MB in size and can grow quite bigger than that. when i run the above command on it it, it takes about 6 seconds to complete. Anyway to bring that number down???

RudiC · February 5, 2016, 1:55pm

I have to admit I can't resolve the logics of your pipe. But, almost sure, I can say that all that (time consuming) piping can be reduced to/done by one single awk command.
You start listing the lines at the epoch value 1454687485 , and list down to the end-of-file. Later you grep for Thu Feb 04 . Why don't you operate on the lines with $3 between 1454626800 and 1454713199 ? That would save the first awk , the egrep , and, as the output of A is no more needed, the last awk as well.
The (boolean) Q variable is redundant as well; it is set to 1 and never reset - so what's its meaning?

SkySmart · February 5, 2016, 10:48pm

Thanks RudiC. I took your suggestions into consideration and combined all those commands into one awk command. Thanks so much.

In doing the above, i discovered the code i originally pasted in this thread is not the reason why the script was slow. I found out that it is the for loop below that takes at least 4 seconds to complete.

can anyone help me optimize the below code?

Content of variable VALUESA:

VALUESA="1751-1451549113_0--1751
1445-1451549413_0--1445
1864-1451549713_0--1864
1410-1451550013_0--1410
655-1451550313_0--655
147-1451550613_0--147
209-1451550913_0--209
1472-1451551213_0--1472
1984-1451551513_0--1984
690-1451551813_0--690
652-1451552113_0--652
1161-1451552413_0--1161
1314-1451552713_0--1314
1030-1451553013_0--1030
428-1451553313_0--428
262-1451553613_0--262
95-1451553913_0--95"

The slow for loop:

                                        ZPROCC=$(
                                        for ALLF in $(echo ${VALUESA} | sort -r | xargs)
                                        do
                                                ALL=$(echo "${ALLF}" | gawk -F"-" '{print $1}') ; ZSCORE=$(gawk "BEGIN {if($STDEVIATE>0) {print (${ALL} - ${AVERAGE}) / ${STDEVIATE}} else {print 0}}")
                                                EPTIME=$(echo "${ALLF}" | gawk -F"-" '{print $2}' | awk -F"_" '{print $1}')
                                                FIXED=$(gawk -v c="perl -le 'print scalar(localtime("${EPTIME}"))'" 'BEGIN{c|getline; close(c); print $0;}')
                                                ACSCORE=$(echo ${FIXED} ${EPTIME} | gawk '{print "["$2"-"$3"-""("$4")""-"$5"]"}')
                                                echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
                                        done)

Scrutinizer · February 6, 2016, 3:07am

Hi, you did not specify the shell, since you are using GNU utilities, I presumed it to be bash , this would functionally be these equivalent, but should be a bit more efficient:

ZPROCC=$(
  while read ALLF 
  do
    IFS=_- read ALL EPTIME x <<< "$ALLF"
    ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
    read x mon day time year x <<< $(perl -le "print scalar(localtime($EPTIME))")
    ACSCORE="[$mon-$day-($time)-$year]"
    echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done <<< "$VALUESA"
)

It is unsorted, since
$(echo ${VALUESA} | sort -r | xargs) produces the same output as ${VALUESA}

So, as is, it could be further reduced to:

ZPROCC=$(
  while IFS=_- read ALL EPTIME x
  do
    ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
    read x mon day time year x <<< $(perl -le "print scalar(localtime($EPTIME))")
    ACSCORE="[$mon-$day-($time)-$year]"
    echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done <<< "$VALUESA"
)

Which leaves one external call to perl per iteration. To eliminate that one as well the whole loop would need to be eliminated in favor of -for example- one awk or perl program...

I don't know where AVERAGE and STDEVIATE are determined ? Is that is n a similar loop, if so I suspect similar gains could be made there?

---edit---

This would be a gawk equivalent:

ZPROCC=$(
  gawk -F'[_-]' -v av="$AVERAGE" -v sd="$STDEVIATE" '
    {
      zscore=(sd>0) ? ($1-av)/sd : 0
      acscore=strftime("%b-%e-(%H:%M:%S)-%Y",$2)
      printf "frq=%s,std=%s,time=%s,epoch=%s,avg=%s\n", $1, zscore, acscore, $2, av
    }
  ' <<< "$VALUESA"
)

SkySmart · February 6, 2016, 9:43am

scrutinizer:

Hi, you did not specify the shell, since you are using GNU utilities, I presumed it to be bash , this would functionally be these equivalent, but should be a bit more efficient:
ZPROCC=$(
  while read ALLF 
  do
   IFS=_- read ALL EPTIME x <<< "$ALLF"
   ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
   read x mon day time year x <<< $(perl -le "print scalar(localtime($EPTIME))")
   ACSCORE="[$mon-$day-($time)-$year]"
   echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done <<< "$VALUESA"
)
It is unsorted, since
$(echo ${VALUESA} | sort -r | xargs) produces the same output as ${VALUESA}

So, as is, it could be further reduced to:
ZPROCC=$(
  while IFS=_- read ALL EPTIME x
  do
   ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
   read x mon day time year x <<< $(perl -le "print scalar(localtime($EPTIME))")
   ACSCORE="[$mon-$day-($time)-$year]"
   echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done <<< "$VALUESA"
)
Which leaves one external call to perl per iteration. To eliminate that one as well the whole loop would need to be eliminated in favor of -for example- one awk or perl program...

I don't know where AVERAGE and STDEVIATE are determined ? Is that is n a similar loop, if so I suspect similar gains could be made there?

---edit---

This would be a gawk equivalent:
ZPROCC=$(
  gawk -F'[_-]' -v av="$AVERAGE" -v sd="$STDEVIATE" '
   {
   zscore=(sd>0) ? ($1-av)/sd : 0
   acscore=strftime("%b-%e-(%H:%M:%S)-%Y",$2)
   printf "frq=%s,std=%s,time=%s,epoch=%s,avg=%s\n", $1, zscore, acscore, $2, av
   }
  ' <<< "$VALUESA"
)

thanks so much. sorry for not specifying the shell. i intend to run this on a number of unix systems, some of which have old OSes...i.e. HP-UX, AIX, ubuntu, centos.

i'm afraid some of the bash commands wont work on the older systems.

the shell i'm using is "/bin/sh" for older systems. and "/bin/dash" for newer ones. so i suppose your modifications would most likely work for the newer systems.

Scrutinizer · February 6, 2016, 10:58am

You're welcome...

Alright, try:

ZPROCC=$(
  while IFS=_- read ALL EPTIME x
  do
    ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
    read x mon day time year x << EOF
      $(perl -le "print scalar(localtime($EPTIME))")
EOF
    ACSCORE="[$mon-$day-($time)-$year]"
    echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done << EOF
$VALUESA
EOF
)

A yet faster solution would be all perl code in this case..

SkySmart · February 6, 2016, 3:19pm

scrutinizer:

You're welcome...

Alright, try:

ZPROCC=$(
  while IFS=_- read ALL EPTIME x
  do
   ZSCORE=$(( STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 ))
   read x mon day time year x << EOF
   $(perl -le "print scalar(localtime($EPTIME))")
EOF
   ACSCORE="[$mon-$day-($time)-$year]"
   echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done << EOF
$VALUESA
EOF
)

A yet faster solution would be all perl code in this case..

it seems this solution doesn't do well when the numbers contain decimals.

error i received:

STDEVIATE>0 ? ( ALL - AVERAGE ) / STDEVIATE : 0 : 0403-057 Syntax error

Scrutinizer · February 6, 2016, 4:28pm

Oh yes, that's right, neither bash nor POSIX shell understands floating point (for that you need ksh93 or zsh), so that is another external program.. Try:

ZPROCC=$(
  while IFS=_- read ALL EPTIME x
  do
    ZSCORE=$(echo "scale=4; if ($STDEVIATE>0)  ($ALL - $AVERAGE ) / $STDEVIATE else 0" | bc -l)
    read x mon day time year x << EOF
      $(perl -le "print scalar(localtime($EPTIME))")
EOF
    ACSCORE="[$mon-$day-($time)-$year]"
    echo "frq=${ALL},std=${ZSCORE},time=${ACSCORE},epoch=${EPTIME},avg=${AVERAGE}"
  done << EOF
$VALUESA
EOF
)

or

ZPROCC=$(
  awk -F'[_-]' -v av="$AVERAGE" -v sd="$STDEVIATE" '
    {
      zscore=(sd>0) ? ($1-av)/sd : 0
      "perl -le \"print scalar(localtime(" $2 "))\"" | getline acscore
      printf "frq=%s,std=%s,time=%s,epoch=%s,avg=%s\n", $1, zscore, acscore, $2, av
    }
  ' << EOF
$VALUESA
EOF
)

--
perhaps full perl would be best here...