Help with speeding up my working script to take less time - how to use more CPU usage for a script

prvnrk · June 9, 2019, 1:00pm

Hello experts,

we have input files with 700K lines each (one generated for every hour). and we need to convert them as below and move them to another directory once.

Sample INPUT:-

[root@tst01 INPUT]#  cat test1
1559205600000,8474,NormalizedPortInfo,PctDiscards,0.0,Interface,BG-CTA-AX1.test.com,Vl111
1559205600000,8474,NormalizedPortInfo,HistoricalInterfaceSpeed,1000000000,Interface,BG-CTA-AX1.test.com,Vl111
1559205600000,8474,NormalizedPortInfo,SpeedIn,1000000000,Interface,BG-CTA-AX1.test.com,Vl111
1559205600000,8474,NormalizedPortInfo,FrameSizeIn,209.65929490852145,Interface,BG-CTA-AX1.test.com,Vl111
1559205600000,8474,NormalizedPortInfo,PctDiscardsIn,0.0,Interface,BG-CTA-AX1.test.com,Vl111
1559205600000,8474,NormalizedPortInfo,NonunicastIn,124,Interface,BG-CTA-AX1.test.com,Vl111
[root@tst01 INPUT]#

Sample output:-

[root@tst01 INPUT]#  cat ../OUTPUT/test1
TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;PctDiscards;0.0;Interface;Vl111
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;HistoricalInterfaceSpeed;1000000000;Interface;Vl111
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;SpeedIn;1000000000;Interface;Vl111
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;FrameSizeIn;209.65929490852145;Interface;Vl111
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;PctDiscardsIn;0.0;Interface;Vl111
2019-05-30 09:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;NonunicastIn;124;Interface;Vl111
[root@tst01 INPUT]#

I wrote a script which is working - what does is to convert epoch time to normal date in the 1st column, replace 2nd column with fixed values (86400) and remap the remaining columns as they are into different columns
The problem here is that my script processing ~ 40 lines / second resulting only 144K lines are done in an hour. we need to finish all 700K in <1 hour. CPU usage is just 12% of 1 core where it has 12-cores in single CPU. How could I improve its speed (in terms of script) and how could I let my script use all CPU cores to do parellel processing?
THANKS

My working script but processes only 40 lines per second

cat /usr/local/bin/script.sh
#!/bin/bash
BASEDIR=/tmp/tsight
INPUTDIR=${BASEDIR}/INPUT
OUTPUTDIR=${BASEDIR}/OUTPUT
DONEDIR=${BASEDIR}/DONE
mkdir -p ${BASEDIR}/INPUT
mkdir -p ${BASEDIR}/OUTPUT
mkdir -p ${BASEDIR}/DONE
cd ${INPUTDIR}
for inp in *
do
tail -n +2 ${inp} >/tmp/tempp
\mv /tmp/tempp ${inp}
echo "TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE" >${OUTPUTDIR}/${inp}
cat ${inp} | while read line
do
TIMES=`echo "${line}" |awk -F, '{print $1}'`
DURA="86400"
INTMF=`echo "${line}" |awk -F, '{print $3}'`
METRIC=`echo "${line}" |awk -F, '{print $4}'`
VALU=`echo "${line}" |awk -F, '{print $5}'`
MFDISP=`echo "${line}" |awk -F, '{print $6}'`
DEVC=`echo "${line}" |awk -F, '{print $7}'`
CNAME=`echo "${line}" |awk -F, '{print $8}'`
#TIMES=`echo "${line}" |awk -F, '{print $1}'`
NON_MIL=`expr "${TIMES}" / 1000`
EPO2DT=`date -d @${NON_MIL} '+%Y-%m-%d %H:%M:%S'`
echo "${EPO2DT};${DURA};${DEVC};${INTMF};${METRIC};${VALU};${MFDISP};${CNAME}" >>${OUTPUTDIR}/${inp}
done
\mv ${BASEDIR}/INPUT/${inp} ${DONEDIR}
done

[root@tst01 INPUT]#

RudiC · June 9, 2019, 1:49pm

No surprise the execution of your script is a bit sluggish - you execute 16 processes per input line. As you are using awk anyhow, why not do the entire thing with it?

Peasant · June 9, 2019, 2:11pm

Try this, save it as into file small.awk

BEGIN {
FS=","
OFS=";"
}
{
$1=strftime("%Y-%m-%d %H:%M:%S")
$2=86400
print $1,$2,$7,$3,$4,$5,$6,$NF
}

Run as :
awk -f small.awk test1 > ../OUTPUT/test1_done
See if this speeds up processing of one file.

Please specify the operating system in the future, when making such requests.
Due to date invocation in your script, i would figure linux.

Hope that helps
Regards
Peasant.

prvnrk · June 9, 2019, 2:41pm

Thanks Rudic & Peasant.

@Peasant - your solution worked awesome, didn't even take 1 second to finish a file.

RudiC · June 9, 2019, 2:56pm

Three comments on Peasant's fine proposal:

not all awk versions provide strftime() ; gawk may be required.
calling strftime() without a time stamp will return the system time; insert $1 for the desired output. Eliminating nanoseconds from it may be required.
a heading was required.

For awk s without strftime() , try (reducing process count as far as possible)

paste -d, <(date +"%Y-%m-%d %H:%M:%S" -f<(sed 's/^/@/; s/000,.*$//' file)) <(cut -d, -f2- file) | 
awk -F, -vOFS=";" '
BEGIN   {print "TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE"
        }
        {$2 = "86400" OFS $7
         $7 = $8; NF--
        }
1
'
TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;PctDiscards;0.0;Interface;Vl111
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;HistoricalInterfaceSpeed;1000000000;Interface;Vl111
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;SpeedIn;1000000000;Interface;Vl111
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;FrameSizeIn;209.65929490852145;Interface;Vl111
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;PctDiscardsIn;0.0;Interface;Vl111
2019-05-30 10:40:00;86400;BG-CTA-AX1.test.com;NormalizedPortInfo;NonunicastIn;124;Interface;Vl111

prvnrk · June 9, 2019, 3:17pm

Thanks Rudic for your script. My server is running RHEL 6 so i guess its awk has that capability.

I marked this thread as "solved"

MadeInGermany · June 9, 2019, 4:04pm

Even the shell becones faster if builtins are used: let the read command read the fields into variables, use $(( )) rather than expr, write the output file in one stream.

# set constants before the loop
DURA="86400"
cd "$INPUTDIR" || exit
for inp in *
do
  # the following code block has redirected stdin and stdout
  {
  # delete header line#1 
  read x
  # write header
  echo "TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE"
  while read TIMES x INTMF METRIC VALU MFDISP DEVC CNAME x
  do
    NON_MIL=$(( TIMES / 1000 ))
    EPO2DT=`date -d @${NON_MIL} '+%Y-%m-%d %H:%M:%S'`
    echo "$EPO2DT;$DURA;$DEVC;$INTMF;$METRIC;$VALU;$MFDISP;$CNAME"
  done
  } <"$inp"  >"$OUTPUTDIR/$inp"
  # after the block the files are closed
  \mv "$inp" "$DONEDIR"
done

Peasant · June 10, 2019, 12:22am

Rudi is right, i missed the conversion and header part, having hardcoded date value for the first field.

Please see correction :

BEGIN {
FS=","
OFS=";"
print "TS;DURATION;SYSNM;DS_SYSNM;SYSTYPENM;OBJNM;SUBOBJNM;VALUE"
}
{
ts=strftime("%Y-%m-%d %H:%M:%S",$1/1000)
$2=86400
print ts,$2,$7,$3,$4,$5,$6,$NF
}

Regards
Peasant.