Alternative solution to nested loops in shell programming

Sandeep_Pattnai · September 21, 2015, 12:05pm

Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted!

The problem statement, all variables and given/known data:

Hi,

The problem statement is: I am trying to read line by line from a flat file by using a while loop. The flat file will contain 100k records and each record will have 25 columns. While reading each line, I have to read some values from an array and create a map of the values of the array and the fields extracted from each line. I tried using a for inside the while loop, but that is killing the performance. I would like to know any alternate approach to avoid the nested loops. Any help would be greatly appreciated.

Relevant commands, code, scripts, algorithms:

Command to run the script:

Create_Index.ksh <config_file> "ABC" 1

Indexfields_1 will contain the values separated by "," for which the mapping needs to be created.

E.g: "A","B","C", "D" ...... like that 25 fields

#!/usr/bin/ksh

if [[ $# != 3 ]];then
        echo "Incorrect No .of aurguments sent to script"
        echo "Usage: Create_Index.ksh <config_file_name><table_identifier><segment_number> "
        echo "Insufficient parameters to continue execution. Exiting the $(basename ${0}) script with 1 at $(date)"
        exit 1
fi


config_file=${1}

if [ -s ${config_file} ]
then
        . ${config_file}
else
        log "Config file not found"
fi


#-------------------------------------
# function to log message to log file
#-------------------------------------
function log
{
        msg="$1"

        echo "== $(date '+%m/%d/%Y %H:%M:%S')  :${msg}" >>${IndexCreation_DAILY_LOG}
}

#-------------------------------------
# function ends
#------------------------------------


base_dir="${BASE_DIR}"
afp_dir="${AFP_DIR}"
index_dir="${INDEX_DIR}/$2/$2$3"
log_dir="${LOG_DIR}/$2/$2$3"
trigger_dir="${TRIGGER_DIR}/$2/$2$3"
log_filename_suffix="${LOG_FILENAME_SUFFIX}"
output_file_path="${OUTPUT_FILE_PATH}/$2$3"
IndexCreation_DAILY_LOG=${log_dir}/${log_filename_suffix}.$(date +%m%d%y_%H%M%S)
metadata_file_name="${METADATA_FILENAME}"
trigger_file_prefix=`basename ${metadata_file_name%.dat}`
trigger_file_name="${trigger_file_prefix}.indexing"


if [[ ! -d "${log_dir}" ]];then
mkdir -p "${log_dir}"
fi

#rm -rf ${index_dir}/*

if [[ ! -d "${index_dir}" ]];then
mkdir -p "${index_dir}"
fi

if [[ ! -d "${afp_dir}" ]];then
mkdir -p "${afp_dir}"
fi

log "**********************************************************************************"
log "********Script**started**at***$(date '+%m/%d/%Y %H:%M:%S')************************"
log "**********************************************************************************"

rm -rf ${index_dir}/*

if [ $? != 0 ]
then
log "Unable to delete the old index files. Indexing failed, so creating failed trigger"
> ${trigger_dir}/${trigger_file_prefix}.indexfailed
exit 1

else
log "Successfully deleted the old index files from the directory ${index_dir}"
fi

identifier=$2
declare -i i=1
declare -i outfilecount=0

#Fetches the index values for the identifier passed in the argument
grep $identifier Indexfields_1 > tempfile1
indexfieldsnumber=`awk 'BEGIN {FS=","} ; END{print NF}' tempfile1`
log "fields to be present in undex file are $indexfieldsnumber"
cat tempfile1


#Populates the fetched index values from previous step in an array.
declare -i j=1
declare -i k=0
while [[ $j -le $indexfieldsnumber ]] ; do
indexfieldname=`cut -d "," -f${j} tempfile1`
array[${k}]="$indexfieldname"
j=$j+1
k=$k+1

done
#Finished populating the index fields values for an identifier in the array.

declare -i outfilecount=0
declare -i numberoflinesread=0
declare -i linenumber=0 #debug purpose

while read line #read the metadata file
do

record="$line"
#record=$(echo "${record}" | tr -d '[[:space:]]')

declare -i mdfieldcount=0
declare -i arrayfieldnum=0

for fieldposition in "${array[@]}" #read the field name
        do

      #  groupfieldvalue=`echo ${line} | cut -d , -f${mdfieldcount}`

        #echo "fieldposition is $fieldposition and value is $groupfieldvalue"


        if [[ ${fieldposition} != ${2} ]]
        then
        groupfieldvalue=`echo ${line} | cut -d , -f${mdfieldcount}`
        groupfieldvalue=$(echo "${groupfieldvalue}" | tr -d '[[:space:]]')

#       if [[ $? != 0 ]]
#       then
#       log "unable to find the group field value for ${fieldposition}"
#       mv ${trigger_file_name} ${trigger_file_prefix}.failed
#       fi

                if [[ ${fieldposition} != "${DOCUMENT_NAME}" && ${fieldposition} != "${DOCUMENT_OFFSET}" && ${fieldposition} != "${DOCUMENT_LENGTH}" && ${fieldposition} != "${COMP_OFFSET}" && ${fieldposition} != "${COMP_LENGTH}" ]]
                then
                        echo  "GROUP_FIELD_NAME:${fieldposition}" >> ${index_dir}/afp${i}.ind
                        echo  "GROUP_FIELD_VALUE:${groupfieldvalue}" >> ${index_dir}/afp${i}.ind
                fi
        fi

        if [[ ${fieldposition} == "${DOCUMENT_NAME}" ]]
        then
        docname=${groupfieldvalue}
        docname="$(echo "$docname" | tr -d ' ')"
        fi

        if [[ ${fieldposition} == "${DOCUMENT_OFFSET}" ]]
        then
        docoff=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${DOCUMENT_LENGTH}" ]]
        then
        doclen=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${COMP_LENGTH}" ]]
        then
        complength=${groupfieldvalue}
        fi

        if [[ ${fieldposition} == "${COMP_OFFSET}" ]]
        then
        compoffset=${groupfieldvalue}
        fi

        filename="Decomp_${docname}_${compoffset}_${complength}.out"
        indexfilename="Decomp_${docname}_${compoffset}_${complength}.ind"
        filename=$(echo "${filename}" | tr -d '[[:space:]]')
        indexfilename=$(echo "${indexfilename}" | tr -d '[[:space:]]')
        currentfilename=$filename

        if [[ $previousfilename != $currentfilename ]]
        then
        newcompoffset=true

        fi

        mdfieldcount=${mdfieldcount}+1 #Increment the metadata field count to fetch the next value from the metadt file

        done

        echo "GROUP_OFFSET:${docoff}" >> ${index_dir}/afp${i}.ind
        echo "GROUP_LENGTH:${doclen}" >> ${index_dir}/afp${i}.ind
        echo "GROUP_FILENAME:${output_file_path}/${filename}" >> ${index_dir}/afp${i}.ind


        #debug purpose only

        if [[ $linenumber == 5000 ]]; then

        i=i+1
        linenumber=0
        echo  "CODEPAGE:850" >> ${index_dir}/afp${i}.ind

        fi


        #debug purpose only

       echo "finished processing for $linenumber"
       linenumber=linenumber+1


done < ${metadata_file_name}

log "removing the temp file containing the indexed fields"
rm -rf tempfile
rm -rf  ${index_dir}/afp*.ind

mv "${trigger_dir}/${trigger_file_prefix}.indexinprogress" "${trigger_dir}/${trigger_file_prefix}.indexed"

log "*************************************************************************************************"
log "********Script***completed**at***$(date '+%m/%d/%Y %H:%M:%S')*************************************"
log "*************************************************************************************************"

The attempts at a solution (include all code and scripts):

Included.

Complete Name of School (University), City (State), Country, Name of Professor, and Course Number (Link to Course):
Utkal University, IND.

Note: Without school/professor/course information, you will be banned if you post here! You must complete the entire template (not just parts of it).

jim_mcnamara · September 21, 2015, 12:14pm

Please provide the information for #4 above - THANK YOU.

PS: you invoke ksh but seem to have some bash code in your example. It will not run.

Sandeep_Pattnai · September 21, 2015, 12:20pm

Jim,

The code runs, but the performance is slow. The for loop inside the while is causing the issue. It would be great if u can provide an alternate approach to avoid this nested loop.

MadeInGermany · September 21, 2015, 12:45pm

True, the code is not ksh. I guess there is

$ ls -l /usr/bin/ksh
... -> /bin/bash

#4 the School/University is still missing!

RudiC · September 21, 2015, 1:30pm

Some data samples might help. Wouldn't a performance / time profile make sense?

Corona688 · September 21, 2015, 3:20pm

It doesn't look like it's the loop that's the problem, to me. It's the creation of all those tiny files, and all those external tr -d calls, and the >> re-opening the same file over and over and over.

Sandeep_Pattnai · September 21, 2015, 7:35pm

Hi Corona688/Rudi C,

As per your suggestion, I have changed the tr -d with the "sed" to remove spaces. But still I am seeing the same performance. Could you please suggest some alternate solution to this problem?

Thanks

RudiC · September 22, 2015, 7:22am

How about answering my questions?

Corona688 · September 22, 2015, 11:06am

I did not suggest that.

That's going to be the same or worse.

Try a shell builtin.

Corona688 · September 22, 2015, 1:27pm

$ VAR="value with spaces"
$ echo ${VAR// }

valuewithspaces

$