Performance issue - to read line by line

All- We have a performance issue in reading a file line by line. Please find attached scripts for the same. Currently it is taking some 45 min to parse "512444" lines.

Could you please have a look at it and provide any suggestions to improve the performance.

Thanks,
Balu

------------------- start of the code --------------------------

echo " start of the script:`date`"
process_each_record()
{
  record=$1
  #record_type=`echo $record | sed 's/\(^.....\).*/\1/'`
  record_type=${record%"${record#?????}"}

   case $record_type  in
      'FHEAD')  record=`echo ${record_type}${VER}${INPUT_FILE}`;
                ;;

      'THEAD')
                REGISTER=`echo $record | cut -c 16-20 |tr "?" " "| tr -d ' '`;
                TRAN_NO=`echo $record | cut -c 35-44  |tr "?" " "| tr -d ' '`;
                TRAN_HEAD_SEQ_NO=$(( TRAN_HEAD_SEQ_NO + 1 ));
                TRAN_TRAN_DISC_SEQ_NO=0
                TRAN_ITEM_SEQ_NO=0
                TRAN_ITEM_DISC_SEQ_NO=0
                TRAN_ITEM_TAX_SEQ_NO=0
                TRAN_TENDER_SEQ_NO=0
                TRAN_CUSTOMER_SEQ_NO=0
                ;;

      'IDISC')
                TRAN_ITEM_DISC_SEQ_NO=$(( TRAN_ITEM_DISC_SEQ_NO + 1 ));
                ;;
      'TITEM')
                TRAN_ITEM_SEQ_NO=$(( TRAN_ITEM_SEQ_NO + 1 ));
                TRAN_ITEM_DISC_SEQ_NO=0
                TRAN_ITEM_TAX_SEQ_NO=0
                ;;


      'IGTAX')
                TRAN_ITEM_TAX_SEQ_NO=$(( TRAN_ITEM_TAX_SEQ_NO + 1 ));
                ;;

      'TTEND')
                TRAN_TENDER_SEQ_NO=$(( TRAN_TENDER_SEQ_NO + 1 ));
                ;;

      'TCUST')
                TRAN_CUSTOMER_SEQ_NO=$(( TRAN_CUSTOMER_SEQ_NO + 1 ));
                ;;

      *)
                ;;

  esac
        echo "${LINE_NO}${TRANS_SOURCE}${STORE_DAY_SEQ_NO}${STORE}${BUSINESS_DATE}${TRAN_HEAD_SEQ_NO}${REGISTER}${TRAN_NO}${SALESPERSON}${TRAN_TRAN_DISC_SEQ_NO}${TRAN_ITEM_SEQ_NO}${TRAN_ITEM_DISC_SEQ_NO}${TRAN_ITEM_TAX_SEQ_NO}${TRAN_TENDER_SEQ_NO}${TRAN_CUSTOMER_SEQ_NO}${TRAN_SEQ_FUTURE_USE}${record}" >> ${Test_output_data}

}


########### define the variables to appened to all the files ###
typeset -Z10 LINE_NO=0
export TRANS_SOURCE='C'
typeset -Z10 STORE_DAY_SEQ_NO=999901
typeset -Z4  STORE=9999
typeset -Z8  BUSINESS_DATE=20170314
typeset -Z10 TRAN_HEAD_SEQ_NO=0
typeset -Z10 REGISTER=0
typeset -Z10 TRAN_NO=0
typeset -Z11  SALESPERSON=0
typeset -Z3  TRAN_TRAN_DISC_SEQ_NO=0
typeset -Z4  TRAN_ITEM_SEQ_NO=0
typeset -Z4  TRAN_ITEM_DISC_SEQ_NO=0
typeset -Z4  TRAN_ITEM_TAX_SEQ_NO=0
typeset -Z4  TRAN_TENDER_SEQ_NO=0
typeset -Z3  TRAN_CUSTOMER_SEQ_NO=0
typeset -Z4  TRAN_SEQ_FUTURE_USE=0

export VER='CC'

export Test_output_data='test_output_data.log'
export INPUT_FILE='test_input_data.txt'


while read line1
do
  LINE_NO=$((LINE_NO + 1))
  process_each_record "${line1}"
done < ${INPUT_FILE}


echo " end of the script:`date`"

------------------- end of the code --------------------------

Hi, IMO this is the biggest culprit:

                REGISTER=`echo $record | cut -c 16-20 |tr "?" " "| tr -d ' '`;
                TRAN_NO=`echo $record | cut -c 35-44  |tr "?" " "| tr -d ' '`;

What is your OS and version and what is your shell ?

AIX nmrmsdbint01 1 7 00C801E74C00 (uname -a)
and korn shell

I second Scrutinizer: For those two lines, 2 * 3 * 73656 = 441936 processes must be costly created; fortunately, those are the only lines running external programs; all the remaining calculations are done using shell internals. Recent shells can do "parameter expansions" like "substring expansion" and "pattern substitution", so presumably no externals were required. Not sure why you tr anslate ? to a space, and then delete all spaces? You can delete several chars in one go with tr .
PLUS, the redirected output file is opened and closed 512444 times.

And, all THEAD records seem to be identical?

To come to a conclusion, I think shell is not the tool of choice when it comes to analysing large text files. Use taylored tools, awk or alike.

hi- Can you please help in writing the same in single awk command.

Note: I just copied THEAD multiple times to set the input data.

If you give us some sample output to work on - I'm not going to run some script for 45 min to know what the target would be.

Not sure I fully and correctly understood and interpreted your script, but you could try and comment on

awk '
BEGIN           {TRANS_SOURCE = "C"
                 STORE_DAY_SEQ_NO = 999901
                 STORE = 9999
                 BUSINESS_DATE = 20170314
                 TRAN_HEAD_SEQ_NO = 0
                 REGISTER = 0
                 TRAN_NO = 0
                 SALESPERSON = 0
                 TRAN_TRAN_DISC_SEQ_NO = 0
                 TRAN_ITEM_SEQ_NO = 0
                 TRAN_ITEM_DISC_SEQ_NO = 0
                 TRAN_ITEM_TAX_SEQ_NO = 0
                 TRAN_TENDER_SEQ_NO = 0
                 TRAN_CUSTOMER_SEQ_NO = 0
                 TRAN_SEQ_FUTURE_USE = 0
                 VER = "CC"
                }


                {RECORD   = $0
                 TYPE     = substr ($0, 1, 5)
                 if (TYPE == "FHEAD")    RECORD   = TYPE VER FILENAME

                 if (TYPE == "THEAD")   {REGISTER = substr ($0, 16,  5); gsub (/[? ]/, "")
                                         TRAN_NO  = substr ($0, 35, 10); gsub (/[? ]/, "")
                                         TRAN_HEAD_SEQ_NO++
                                         TRAN_TRAN_DISC_SEQ_NO = 0
                                         TRAN_ITEM_SEQ_NO      = 0
                                         TRAN_ITEM_DISC_SEQ_NO = 0
                                         TRAN_ITEM_TAX_SEQ_NO  = 0
                                         TRAN_TENDER_SEQ_NO    = 0
                                         TRAN_CUSTOMER_SEQ_NO  = 0
                                        }

                 if (TYPE == "IDISC")    TRAN_ITEM_DISC_SEQ_NO++

                 if (TYPE == "TITEM")   {TRAN_ITEM_SEQ_NO++
                                         TRAN_ITEM_DISC_SEQ_NO = 0
                                         TRAN_ITEM_TAX_SEQ_NO  = 0
                                        }

                 if (TYPE == "IGTAX")    TRAN_ITEM_TAX_SEQ_NO++

                 if (TYPE == "TTEND")    TRAN_TENDER_SEQ_NO++

                 if (TYPE == "TCUST")    TRAN_CUSTOMER_SEQ_NO++

                 printf "%10d%1c%10d%4d%8d%10d%10d%10d%11d%3d%4d%4d%4d%4d%3d%4d%s\n",   NR, TRANS_SOURCE, STORE_DAY_SEQ_NO, STORE, BUSINESS_DATE, TRAN_HEAD_SEQ_NO, 
                                                                                        REGISTER, TRAN_NO, SALESPERSON, TRAN_TRAN_DISC_SEQ_NO, TRAN_ITEM_SEQ_NO, 
                                                                                        TRAN_ITEM_DISC_SEQ_NO, TRAN_ITEM_TAX_SEQ_NO, TRAN_TENDER_SEQ_NO, TRAN_CUSTOMER_SEQ_NO, 
                                                                                        TRAN_SEQ_FUTURE_USE, RECORD
                }
' file
         1C    999901999920170314         0         0         0          0  0   0   0   0   0  0   0FHEADCCfile
         2C    999901999920170314         1      8050         1          0  0   0   0   0   0  0   0THEAD00000000028050?201703130000000000000001????????????????????SALE??SEND??0000000000?????1???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????10055969-1??????????????????????????????????????????????????P00000000000000000000????P00000000000000000000P00000000000000000000??????????HANNAH?PAINE??????????????????FORT?WORTH?TX?76177?????????????????????????????????????????OMS3????????????1005596927.00????????????????????????08:30:5810055969-1??E4X001034989357????????????????????????????sdfds9fwfww????????sdfds9fwfww????????
         3C    999901999920170314         1      8050         1          0  0   0   0   0   0  1   0TCUST000000000353973005????????test?123?456????????????????????????????????????????????????????????????????????????????????????????????????????????????32A?dsfsfs?erewrw????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????sdfdsf?WEST????????????????????????????????????????????????????????????????????????????????????????????????????????????42423??????????????????????????SDFDSFS???????????WERWERW???????????h.SDFDSFS@GMAIL.com???????????????????????????????????????????????????????????????????????????????????????????????00001
.
.
.
1 Like

Thanks Rudy- I guess given code is working fine and it took less than a minute.

You GUESS? Is the output comparable or - preferred! - identical, or not? Does it satisfy the needs?

1 Like

Thanks a lot RudiC- you are code is working fine and it is taking less than a minute to process file (initial time-45 min).

If you want leading zeros instead of leading spaces, like your typeset -Z does, change each %Nd to %0Nd in the printf format.

 printf "%010d%1c%010d%04d%08d%010d%010d%010d%011d%03d%04d%04d%04d%04d%03d%04d%s\n", ...

The code below is not removing "?" and " " chars.

                 if (TYPE == "THEAD")   {REGISTER = substr ($0, 16,  5); gsub (/[? ]/, "")
                                         TRAN_NO  = substr ($0, 35, 10); gsub (/[? ]/, "")

replace with:

                 if (TYPE == "THEAD")   {REGISTER = substr ($0, 16,  5); gsub (/[? ]/, "", REGISTER)
                                         TRAN_NO  = substr ($0, 35, 10); gsub (/[? ]/, "", TRAN_NO)
1 Like

Thanks, Chubler_XL, while having this in mind when laying out the code, I forgot about it in real coding...