Alternative solution to nested loops in shell programming

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted!

  1. The problem statement, all variables and given/known data:

Hi,

The problem statement is: I am trying to read line by line from a flat file by using a while loop. The flat file will contain 100k records and each record will have 25 columns. While reading each line, I have to read some values from an array and create a map of the values of the array and the fields extracted from each line. I tried using a for inside the while loop, but that is killing the performance. I would like to know any alternate approach to avoid the nested loops. Any help would be greatly appreciated.

  1. Relevant commands, code, scripts, algorithms:

Command to run the script:

Create_Index.ksh <config_file> "ABC" 1

Indexfields_1 will contain the values separated by "," for which the mapping needs to be created.

E.g: "A","B","C", "D" ...... like that 25 fields

#!/usr/bin/ksh

if [[ $# != 3 ]];then
        echo "Incorrect No .of aurguments sent to script"
        echo "Usage: Create_Index.ksh <config_file_name><table_identifier><segment_number> "
        echo "Insufficient parameters to continue execution. Exiting the $(basename ${0}) script with 1 at $(date)"
        exit 1
fi


config_file=${1}

if [ -s ${config_file} ]
then
        . ${config_file}
else
        log "Config file not found"
fi


#-------------------------------------
# function to log message to log file
#-------------------------------------
function log
{
        msg="$1"

        echo "== $(date '+%m/%d/%Y %H:%M:%S')  :${msg}" >>${IndexCreation_DAILY_LOG}
}

#-------------------------------------
# function ends
#------------------------------------


base_dir="${BASE_DIR}"
afp_dir="${AFP_DIR}"
index_dir="${INDEX_DIR}/$2/$2$3"
log_dir="${LOG_DIR}/$2/$2$3"
trigger_dir="${TRIGGER_DIR}/$2/$2$3"
log_filename_suffix="${LOG_FILENAME_SUFFIX}"
output_file_path="${OUTPUT_FILE_PATH}/$2$3"
IndexCreation_DAILY_LOG=${log_dir}/${log_filename_suffix}.$(date +%m%d%y_%H%M%S)
metadata_file_name="${METADATA_FILENAME}"
trigger_file_prefix=`basename ${metadata_file_name%.dat}`
trigger_file_name="${trigger_file_prefix}.indexing"


if [[ ! -d "${log_dir}" ]];then
mkdir -p "${log_dir}"
fi

#rm -rf ${index_dir}/*

if [[ ! -d "${index_dir}" ]];then
mkdir -p "${index_dir}"
fi

if [[ ! -d "${afp_dir}" ]];then
mkdir -p "${afp_dir}"
fi

log "**********************************************************************************"
log "********Script**started**at***$(date '+%m/%d/%Y %H:%M:%S')************************"
log "**********************************************************************************"

rm -rf ${index_dir}/*

if [ $? != 0 ]
then
log "Unable to delete the old index files. Indexing failed, so creating failed trigger"
> ${trigger_dir}/${trigger_file_prefix}.indexfailed
exit 1

else
log "Successfully deleted the old index files from the directory ${index_dir}"
fi

identifier=$2
declare -i i=1
declare -i outfilecount=0

#Fetches the index values for the identifier passed in the argument
grep $identifier Indexfields_1 > tempfile1
indexfieldsnumber=`awk 'BEGIN {FS=","} ; END{print NF}' tempfile1`
log "fields to be present in undex file are $indexfieldsnumber"
cat tempfile1


#Populates the fetched index values from previous step in an array.
declare -i j=1
declare -i k=0
while [[ $j -le $indexfieldsnumber ]] ; do
indexfieldname=`cut -d "," -f${j} tempfile1`
array[${k}]="$indexfieldname"
j=$j+1
k=$k+1

done
#Finished populating the index fields values for an identifier in the array.

declare -i outfilecount=0
declare -i numberoflinesread=0
declare -i linenumber=0 #debug purpose

while read line #read the metadata file
do

record="$line"
#record=$(echo "${record}" | tr -d '[[:space:]]')

declare -i mdfieldcount=0
declare -i arrayfieldnum=0

for fieldposition in "${array[@]}" #read the field name
        do

      #  groupfieldvalue=`echo ${line} | cut -d , -f${mdfieldcount}`

        #echo "fieldposition is $fieldposition and value is $groupfieldvalue"


        if [[ ${fieldposition} != ${2} ]]
        then
        groupfieldvalue=`echo ${line} | cut -d , -f${mdfieldcount}`
        groupfieldvalue=$(echo "${groupfieldvalue}" | tr -d '[[:space:]]')

#       if [[ $? != 0 ]]
#       then
#       log "unable to find the group field value for ${fieldposition}"
#       mv ${trigger_file_name} ${trigger_file_prefix}.failed
#       fi

                if [[ ${fieldposition} != "${DOCUMENT_NAME}" && ${fieldposition} != "${DOCUMENT_OFFSET}" && ${fieldposition} != "${DOCUMENT_LENGTH}" && ${fieldposition} != "${COMP_OFFSET}" && ${fieldposition} != "${COMP_LENGTH}" ]]
                then
                        echo  "GROUP_FIELD_NAME:${fieldposition}" >> ${index_dir}/afp${i}.ind
                        echo  "GROUP_FIELD_VALUE:${groupfieldvalue}" >> ${index_dir}/afp${i}.ind
                fi
        fi

        if [[ ${fieldposition} == "${DOCUMENT_NAME}" ]]
        then
        docname=${groupfieldvalue}
        docname="$(echo "$docname" | tr -d ' ')"
        fi

        if [[ ${fieldposition} == "${DOCUMENT_OFFSET}" ]]
        then
        docoff=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${DOCUMENT_LENGTH}" ]]
        then
        doclen=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${COMP_LENGTH}" ]]
        then
        complength=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${COMP_OFFSET}" ]]
        then
        compoffset=${groupfieldvalue}
        fi

        filename="Decomp_${docname}_${compoffset}_${complength}.out"
        indexfilename="Decomp_${docname}_${compoffset}_${complength}.ind"
        filename=$(echo "${filename}" | tr -d '[[:space:]]')
        indexfilename=$(echo "${indexfilename}" | tr -d '[[:space:]]')
        currentfilename=$filename

        if [[ $previousfilename != $currentfilename ]]
        then
        newcompoffset=true

        fi

        mdfieldcount=${mdfieldcount}+1 #Increment the metadata field count to fetch the next value from the metadt file

        done

        echo "GROUP_OFFSET:${docoff}" >> ${index_dir}/afp${i}.ind
        echo "GROUP_LENGTH:${doclen}" >> ${index_dir}/afp${i}.ind
        echo "GROUP_FILENAME:${output_file_path}/${filename}" >> ${index_dir}/afp${i}.ind


        #debug purpose only

        if [[ $linenumber == 5000 ]]; then

        i=i+1
        linenumber=0
        echo  "CODEPAGE:850" >> ${index_dir}/afp${i}.ind

        fi


        #debug purpose only

       echo "finished processing for $linenumber"
       linenumber=linenumber+1


done < ${metadata_file_name}

log "removing the temp file containing the indexed fields"
rm -rf tempfile
rm -rf  ${index_dir}/afp*.ind

mv "${trigger_dir}/${trigger_file_prefix}.indexinprogress" "${trigger_dir}/${trigger_file_prefix}.indexed"

log "*************************************************************************************************"
log "********Script***completed**at***$(date '+%m/%d/%Y %H:%M:%S')*************************************"
log "*************************************************************************************************"


  1. The attempts at a solution (include all code and scripts):

Included.

  1. Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):
    Utkal University, IND.

Note: Without school/professor/course information, you will be banned if you post here! You must complete the entire template (not just parts of it).

Please provide the information for #4 above - THANK YOU.

PS: you invoke ksh but seem to have some bash code in your example. It will not run.

1 Like

Jim,

The code runs, but the performance is slow. The for loop inside the while is causing the issue. It would be great if u can provide an alternate approach to avoid this nested loop.

True, the code is not ksh. I guess there is

$ ls -l /usr/bin/ksh
... -> /bin/bash

#4 the School/University is still missing!

Some data samples might help. Wouldn't a performance / time profile make sense?

1 Like

It doesn't look like it's the loop that's the problem, to me. It's the creation of all those tiny files, and all those external tr -d calls, and the >> re-opening the same file over and over and over.

1 Like

Hi Corona688/Rudi C,

As per your suggestion, I have changed the tr -d with the "sed" to remove spaces. But still I am seeing the same performance. Could you please suggest some alternate solution to this problem?

Thanks

How about answering my questions?

I did not suggest that.

That's going to be the same or worse.

Try a shell builtin.

$ VAR="value with spaces"
$ echo ${VAR// }

valuewithspaces

$