Performance issue to read line by line

balu1729 · May 4, 2016, 3:54pm

Hi All- we have performance issue in unix to read line by line.
I am looking at processing all the records.

description: Our script will read data from a flat file, it will pickup first four character and based on the value it will set up variables accordingly and appended the final output to another flat file as show below.

Concern: script is working fine but we have a performance issue to read line by line, we are looking at something like it will ready all the lines at a time and dynamically identify the first four character and accordingly set up the individual variables and finally append the value.

Please find attached script, sample input data file.

actual script:
---------------------------------------------------------------------------------------

#!/bin/ksh
#set -x


process_each_record()
{

  record=$1
  ###### extract first four characters  ##############
  record_type=`echo $record | sed 's/\(^....\).*/\1/'`

case $record_type  in

  1111)  a1=100; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0;a8=0; a9=0;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  1112) a2=` expr ${a2} + 1 `; a3=` expr ${a3} + 2 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  1113) a7=` expr ${a7} + 1 `; a5=` expr ${a5} + 3 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  1114) a4=` expr ${a4} + 3 `; a6=` expr ${a6} + 4 `; 
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  1115) a7=` expr ${a7} + 1 `; a9=` expr ${a9} + 3 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  1116) a8=` expr ${a8} + 1 `; a5=` expr ${a5} + 1 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  2221) a6=0; a7=0;a8=0; a9=0;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  2222) a3=` expr ${a3} + 1 `; a7=` expr ${a7} + 3 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  3333) a8=` expr ${a8} + 1 `; a9=` expr ${a9} + 5 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;
  5555) a1=` expr ${a1} + 1 `; a2=` expr ${a2} + 3 `; a3=` expr ${a3} + 1 `; a4=` expr ${a4} + 1 `;
	echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;

     *) echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${record}" >> test1_all_data.log;
	;;

  esac


}



######## define variables #####
typeset -Z10 a1
typeset -Z7  a2
typeset -Z3  a3
typeset -Z6  a4
typeset -Z2  a5
typeset -Z7  a6
typeset -Z9  a7
typeset -Z5  a8
typeset -Z2  a8
typeset -Z4  a9
typeset -Z10  line_no

a1=0; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0;a8=0; a9=0;line_no=0;


if [ -f test1_all_data.log ]
then
	rm test1_all_data.log
fi


cat test2.tlog | while read  line1
do
  line_no=` expr ${line_no} + 1 `
  process_each_record  ${line1}
done

---------------------------------------------------------------------------------------

Scrutinizer · May 4, 2016, 7:13pm

Your script is slowed down considerably because of on average 5 calls to external programs and subshells for each line of the input file, which adds up to 50,000 (!) calls with the 10,000 line input sample in post #1 .

Instead of :

record_type=`echo $record | sed 's/\(^....\).*/\1/'`

try

record_type=${record%"${record#????}"}

And for all the expr statement in the case statement, you can use ksh's arthmetic expansions:

For example instead of

a2=` expr ${a2} + 1 `; a3=` expr ${a3} + 2 `

try:

a2=$(( a2 + 1 )); a3=$(( a3 + 2))

and instead of

a7=` expr ${a7} + 1 `; a5=` expr ${a5} + 3 `

try:

a7=$((a7 + 1) ; a5=$((a5 + 3))

and so on for all the other lines with ` expr ...` statements

--
Also instead of

cat test2.tlog | while read  line1
do
  line_no=` expr ${line_no} + 1 `
  process_each_record  ${line1}
done

You can try:

while read line1
do
  line_no=$((line_no + 1))
  process_each_record "${line1}"
done < test2.tlog

Note that there should be double quotes around $line1 to avoid unintended field splitting and wildcard expansions by the shell

If you do this throughout the script, then this will result in zero calls to external programs and or subshells and this should bring a dramatic performance gain..

--
yet another option - but that is a matter of taste - is to not cut off the first four characters, but instead use the case statement's pattern matching:

case $record in
  1111*) 
    foo
    ;;
  1112*)
    bar
    ;;
...

Don_Cragun · May 4, 2016, 10:51pm

If we take your original code (after converting the DOS <carriage-return><linefeed> character pair line terminators into UNIX <linefeed> (AKA <newline>) single character line terminators and modifying your sample input file ( test2.txt ) the same way) and we time running your script 7 times on a MacBook Pro built about 2 years ago with a 2.8GHz Intel Core i7 processor and a 1Tb SSD running OS X El Capitan Version 10.11.4, the average time output looks like:

real	1m13.54s
user	0m25.80s
sys	0m45.76s

(i.e., 73.54 seconds).

If we modify your code using the suggestions Scrutinizer supplied (using a logical equivalent of:

record_type=${record%"${record#????}"}

to extract the first four characters of each record) and also get rid of the test for the existence of the output file and redirect the output from the read loop (which opens and closes the output file once) instead of opening and closing the output file once for each line read from your input file (getting rid of 9,999 opens and closes when processing your sample input) and time the following script:

#!/bin/ksh
#set -x

process_each_record() {
	###### extract first four characters  ##############
	case ${1%"${1#????}"} in
	(1111)	a1=100; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0; a8=0; a9=0
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(1112)	a2=$((a2 + 1)); a3=$((a3 + 2))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(1113)	a7=$((a7 + 1)); a5=$((a5 + 3))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(1114)	a4=$((a4 + 3)); a6=$((a6 + 4))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(1115)	a7=$((a7 + 1)); a9=$((a9 + 3))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(1116)	a8=$((a8 + 1)); a5=$((a5 + 1))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(2221)	a6=0; a7=0; a8=0; a9=0
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(2222)	a3=$((a3 + 1)); a7=$((a7 + 3))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(3333)	a8=$((a8 + 1)); a9=$((a9 + 5))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(5555)	a1=$((a1 + 1)); a2=$((a2 + 3)); a3=$((a3 + 1)); a4=$((a4 + 1))
		echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	(*)	echo "$line_no$a1$a2$a3$a4$a5$a6$a7$a8$a9$1"
		;;
	esac
}

######## define variables #####
typeset -Z10	a1
typeset -Z7	a2
typeset -Z3	a3
typeset -Z6	a4
typeset -Z2	a5
typeset -Z7	a6
typeset -Z9	a7
typeset -Z5	a8
typeset -Z2	a8
typeset -Z4	a9
typeset -Z10	line_no

######## initialize variables #####
a1=0; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0; a8=0; a9=0; line_no=0

######## loop through the input #####
while read line1
do	line_no=$((line_no + 1))
	process_each_record "$line1"
done < test2.txt > test1_all_data.log

we get average time output:

real	0m0.32s
user	0m0.29s
sys	0m0.02s

You didn't say which version of the Korn shell you're using. The above code works with any Korn shell. If you have a 1993 or later version of ksh , you can change the line:

	case ${1%"${1#????}"} in

to:

	case ${1:0:4} in

and further reduce the average running time to:

real	0m0.17s
user	0m0.14s
sys	0m0.02s

That is better than a 99.75% reduction from your original script's running time.

If you are using a 1988 vintage ksh and don't have a /bin/ksh93 that you can use, we can still incorporate Scrutinizer's 2nd suggestion changing the above case statement to just:

	case $1 in

and change the patterns from the form:

	(1111)	assignments...

to:

	(1111*)	assignments...

and still reduce the average running time to:

real	0m0.28s
user	0m0.24s
sys	0m0.03s

which is still about a 99.62% reduction from your original script's running time and also works with any version of the Korn shell.

I hope this gives you some idea of how significant the improvement in running time can be when you get rid of unneeded invocations of external utilities and unneeded output file opens and closes.

balu1729 · May 5, 2016, 2:12am

Thanks Don- Will check the performance and update you.

I am also looking to write in one line code something like below..

cat test2.tlog | awk 'BEGIN { record_type=${record%"${record#????}"}; 
if (record_type == 1111) {a1=100; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0;a8=0; a9=0} 
if (record_type == 2222) {a2=$(( a2 + 1 )); a3=$(( a3 + 2))} 
...
...
else printf $line_no$a1....$9;
printf $line_no$a1....$9
}' >> test1_all_data.log

Kindly let me know if you have any suggestions on this.

---------- Post updated at 01:12 AM ---------- Previous update was at 12:41 AM ----------

Thanks a lot Don for your support...

My initial script took around 21 min 35 sec, with the new code it is taking 15 sec.

Scrutinizer · May 5, 2016, 2:43am

In addition, it will not matter much performance-wise, but the function could be further reduced to something like this, making it a bit easier to understand and thus more maintainable.

process_each_record() {
  ###### extract first four characters  ##############
  case $1 in
    (1111*)     a1=100; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0; a8=0; a9=0   ;;
    (1112*)     a2=$((a2+1)); a3=$((a3+2))                               ;;
    (1113*)     a5=$((a5+3)); a7=$((a7+1))                               ;;
    (1114*)     a4=$((a4+3)); a6=$((a6+4))                               ;;
    (1115*)     a7=$((a7+1)); a9=$((a9+3))                               ;;
    (1116*)     a5=$((a5+1)); a8=$((a8+1))                               ;;
    (2221*)     a6=0;   a7=0; a8=0;   a9=0                               ;;
    (2222*)     a3=$((a3+1)); a7=$((a7+3))                               ;;
    (3333*)     a8=$((a8+1)); a9=$((a9+5))                               ;;
    (5555*)     a1=$((a1+1)); a2=$((a2+3)); a3=$((a3+1)); a4=$((a4+1))   ;;
  esac
  echo "${line_no}${a1}${a2}${a3}${a4}${a5}${a6}${a7}${a8}${a9}${1}"
}

RudiC · May 5, 2016, 3:00am

In principle, the idea of switching to awk for the entire processing is not a bad one, although you shouldn't expect another performance improvement as noticeable as the one gained before.
But - you can't use shell syntax inside awk . E.g.

${record%"${record#????}"  ---> substr (record, 1, 4)
a2=$(( a2 + 1 ))           ---> a2+=1
$xyz                       ---> xyz (unless you want to access field xyz)

And, don't cat a file into awk 's stdin - awk can open and read a file immediately.

Don_Cragun · May 5, 2016, 4:20am

balu1729:

Thanks Don- Will check the performance and update you.

I am also looking to write in one line code something like below..
cat test2.tlog | awk 'BEGIN { record_type=${record%"${record#????}"}; 
if (record_type == 1111) {a1=100; a2=0; a3=0; a4=0; a5=0; a6=0; a7=0;a8=0; a9=0} 
if (record_type == 2222) {a2=$(( a2 + 1 )); a3=$(( a3 + 2))} 
...
...
else printf $line_no$a1....$9;
printf $line_no$a1....$9
}' >> test1_all_data.log
Kindly let me know if you have any suggestions on this.

---------- Post updated at 01:12 AM ---------- Previous update was at 12:41 AM ----------

Thanks a lot Don for your support...

My initial script took around 21 min 35 sec, with the new code it is taking 15 sec.

I would have thought that by now you would know that the awk command language and the shell command language are not the same.

I could rewrite the 62 line, 1,595 character ksh93 script I suggested to instead be a ksh script invoking awk that would run about twice as fast as the ksh93 script, and still produce exactly the same output as the other three scripts.

But, I would never attempt to do that if I thought you were going to try to convert that readable, maintainable, understandable 42 line, 840 character script into an unreadable, unmaintainable, not understandable 1-liner. And, if I were to create such a script, it would not contain an unneeded use of cat that would only slow it down (just like the cat in your original script did).

---------------------------------

I'm glad to hear that one of the three scripts I suggested is working well for you.

RudiC · May 5, 2016, 3:44pm

Would this come close to what you need?

awk '
                {RT = substr ($1, 1, 4)}

RT == 1111      {a1 = 100; a2 = a3 = a4 = a5 = a6 = a7 = a8 = a9 = 0}
RT == 1112      {a2++   ; a3 += 2}
RT == 1113      {a5 += 3; a7++}
RT == 1114      {a4 += 3; a6 += 4}
RT == 1115      {a7++   ; a9 += 3}
RT == 1116      {a5++   ; a8 ++}
RT == 2221      {a6 = a7 = a8 = a9 = 0}
RT == 2222      {a3++   ; a7 += 3}
RT == 3333      {a8++   ; a9 += 5}
RT == 5555      {a1++   ; a2 += 3; a3++; a4++}

                {print NR  a1 a2 a3 a4 a5 a6 a7 a8 a9 $0}
' /tmp/test2.txt

Don_Cragun · May 5, 2016, 4:03pm

rudic:

Would this come close to what you need?

awk '
   {RT = substr ($1, 1, 4)}

RT == 1111      {a1 = 100; a2 = a3 = a4 = a5 = a6 = a7 = a8 = a9 = 0}
RT == 1112      {a2++   ; a3 += 2}
RT == 1113      {a5 += 3; a7++}
RT == 1114      {a4 += 3; a6 += 4}
RT == 1115      {a7++   ; a9 += 3}
RT == 1116      {a5++   ; a8 ++}
RT == 2221      {a6 = a7 = a8 = a9 = 0}
RT == 2222      {a3++   ; a7 += 3}
RT == 3333      {a8++   ; a9 += 5}
RT == 5555      {a1++   ; a2 += 3; a3++; a4++}

   {print NR  a1 a2 a3 a4 a5 a6 a7 a8 a9 $0}
' /tmp/test2.txt

That is close to what I came up with, but you'll need to use printf with a format string producing fixed-width, leading-zero-filled formats for the NR and a1 through a9 fields instead of just using print . ( awk doesn't have the ksh typeset flags to set output formats for values assigned to variables.)

RudiC · May 6, 2016, 3:11am

Thanks! As you can easily see, I'm not a ksh er as I didn't have a clue what the typeset -Z could possibly mean...
Still I was wondering if the increasing field size for e.g. NR was actually desired...

Well, use

                {printf "%010d%010d%07d%03d%06d%02d%07d%09d%02d%04d%s\n", NR, a1, a2, a3, a4, a5, a6, a7, a8, a9, $0}

, then, assuming the last of two entries for a8 in post#1 should count for the field size.

Don_Cragun · May 6, 2016, 4:33am

Hi RudiC,
Yes, the 2nd typeset for a8 overrides the 1st typeset for a8 . I assume that the field widths were chosen such that there could never be a field overflow. (If there ever is an overflow, the string of decimal digits in the resulting output can't be deciphered since there are no field separators in the output; all of the fields are defined by the column positions they occupy in a line.)

The way I did it was similar to the way you did it, but using an if else tree instead of separate condition-action statements. It takes more space, but runs slightly faster. The patterns used are all mutually exclusive, so subsequent tests can be skipped once a match is found.

Hoping the the OP won't subvert this into a 1-liner, here is the way I did it:

#!/bin/ksh
awk '
{	rec_type = substr($0, 1, 4)
	if(rec_type == 1111) {
		a1 = 100
		a2 = a3 = a4 = a5 = a6 = a7 = a8 = a9 = 0
	} else if(rec_type == 1112) {
		a2++
		a3 += 2
	} else if(rec_type == 1113) {
		a7++
		a5 += 3
	} else if(rec_type == 1114) {
		a4 += 3
		a6 += 4
	} else if(rec_type == 1115) {
		a7++
		a9 += 3
	} else if(rec_type == 1116) {
		a8++
		a5++
	} else if(rec_type == 2221) {
		a6 = a7 = a8 = a9 = 0
	} else if(rec_type == 2222) {
		a3++
		a7 += 3
	} else if(rec_type == 3333) {
		a8++
		a9 += 5
	} else if(rec_type == 5555) {
		a1++
		a2 += 3
		a3++
		a4++
	}
	printf("%010d%010d%07d%03d%06d%02d%07d%09d%02d%04d%s\n",
		NR, a1, a2, a3, a4, a5, a6, a7, a8, a9, $0)
}' test2.txt > test1_all_data.log

bakunin · May 11, 2016, 8:48am

Just for the record: it is possible (in ksh) to create an "integer-environment" using the double-brackets. Your line

a2=$(( a2 + 1 ))

could also be written this (C-like) way:

(( a2 += 1 ))

This, of course, changes nothing about the correctness of your observations.

I hope this helps.

bakunin