Help with File Slow Processing

srattani · June 27, 2011, 5:30pm

Hello,

Hope you are doing fine. Let me describe the problem, I have a script that calls another script K2Test.sh, this script K2Test.sh (created by another team) takes date as argument and generates approx 1365 files in localcurves directory for given date.

Out of these 1365 I am only interested in 133 files, so I have created a list of file names (ZEROCURVEFILES as below) that we need to process.

I loop through these 1365 files (`ls $localcurves` as below) and check if file name is in 133 file list (ZEROCURVEFILES ) and if it is then I process the file by reading it line by line.

It seems it takes too long just to process 133 files, am I using some in-efficient code below? is there a way to process it faster? is it slow because I open and read 133 files line by line?

I need to run this script for 400 days which means I would be looping 400 * 1365 times i.e once per day and and for each day process 133 file.

I would really appreciate any help to help make it faster. Here is the code, I know it is too much code, please let me know if something in script.

#!/bin/sh
#e.g. 20110627 (june 27 2011)
currdate=$1
#e.g. 20100310 (march 10 2010)
enddate=$2

#directory where 1365 files get generated
localcurves="/home/sratta/feds/localCurves/curves"
outputdir="/home/sratta/curves"
#output fileto be generated
OUTFILE="/home/sratta/ZeroCurves/BulkLoad.csv"
touch $OUTFILE

# List of 133 curve file names
ZEROCURVEFILES="saud1-monthlinmid \
saud6-monthlinmid \
.....
suvruvr_usdlinmid \
szarzar_usdlinmid "

#Loop until currdate is not equal to enddate (reverse loop)
while [ $currdate -ne $enddate ]
do

  #Call K2test.sh which generates 1365 files for a given date in $localcurves directory
 ./K2test.sh $currdate
 filesfound=0

#Loop through the 1365 files generated by K2test.sh in $localcurves directory
 for FILE in `ls $localcurves`
 do
  filesfound=1
  #Check if the filename is one of the 133 files we want?  If it is only then process otherwise ignore
  zerocurvefile=`echo cat $ZEROCURVEFILES|grep $FILE`

  # If file is in the list then process it
   if [ "$zerocurvefile" != "" ]
   then
    echo "Processing $LOWERCASEFILE.$currdate file"

  #THIS PROCESSING IS SLOW LINE BY LINE
   exec 3<&0
  #Open the file
   exec 0<"$localcurves/$FILE"
   cnt=0
   rowstoprocess=0
  #Read file line by line
   while read line
   do
    cnt=`expr $cnt + 1`
    # First line in file contains number of records to process
    if [ "$cnt" -eq "1" ]
    then
     numheadrecords=`echo $line | awk '{FS=""}{print $1}'`
     rowstoprocess=`expr $numheadrecords + 2`
     echo "Total Number of Rows in header for $LOWERCASEFILE.$currdate is: $numheadrecords"
    fi
    
    if [ "$cnt" -gt "1" ] && [ "$cnt" -lt "$rowstoprocess" ]
    then
     julianmdate=`echo $line | awk '{FS=" "}{print $1}'`
     rate=`echo $line | awk '{FS=" "}{print $2}'`
     mdate=`echo $line | awk '{FS=" "}{print $4}'`
     # extract certain columns and put the data into out file
     echo "$LOWERCASEFILE,$currdate,$julianmdate,$rate,$mdate" >> $OUTFILE
    fi
    
   # If we have processed number of records as in first line then break the loop
    if [ "$cnt" -eq "$rowstoprocess" ]
    then
     break
    fi
   done
   exec 0<&3
  fi
 done
 
#Subtract 1 day from currdate (reverse loop)
 currdate=`./shift_date $currdate -1`
done

methyl · June 28, 2011, 8:25am

What Operating System and version are you running?
What Shell is /bin/sh on your computer?
How many lines are processed from the 133 files? Is it definitely not the whole of each file?
Does the script work?

What are these lines for? Is there a local reason for these complex redirects?

There is great scope for efficiency in this script but let's get a feel for the environment and the size of the data files first.

srattani · June 28, 2011, 8:42am

Hi methyl,

Thanks for look at my post I really appreciate it, I am new to Unix scripting so def. need guidance. Please see my answers

What Operating System and version are you running? It is sun solaris

What Shell is /bin/sh on your computer? How do I tell? I just know i am using sh

How many lines are processed from the 133 files? Is it definitely not the whole of each file? Each file has a number of records on very first line, I read that and process those many rows it can be anywhere from 10 to 200

Does the script work? Yes the script works but each file is taking approx 4 seconds to process and 133 files are taking 523 seconds which is almost 8 minutes for 133 files for 1 day and I have to process it for 400 days which wud take 53 hours

What are these lines for? Is there a local reason for these complex redirects? I copied it from a colleague so if you think there is no reason for these redirections I would appreciate your guidance

michaelrozar17 · June 28, 2011, 9:13am

Just a snippet. If your shell accepts then try changing the all the single square brackets to double square brackets. Ex.

while [ $currdate -ne $enddate ]
to
while [[ $currdate -ne $enddate ]]

aigles · June 28, 2011, 9:25am

Try this version of your script (not tested):

#!/bin/sh
#e.g. 20110627 (june 27 2011)
currdate=$1
#e.g. 20100310 (march 10 2010)
enddate=$2

#directory where 1365 files get generated
localcurves="/home/sratta/feds/localCurves/curves"
outputdir="/home/sratta/curves"
#output fileto be generated
OUTFILE="/home/sratta/ZeroCurves/BulkLoad.csv"
touch $OUTFILE

# List of 133 curve file names
ZEROCURVEFILES="saud1-monthlinmid \
saud6-monthlinmid \
.....
suvruvr_usdlinmid \
szarzar_usdlinmid "

#Loop until currdate is not equal to enddate (reverse loop)
while [ $currdate -ne $enddate ]
do

  #Call K2test.sh which generates 1365 files for a given date in $localcurves directory
 ./K2test.sh $currdate
 filesfound=0

#Loop through the 1365 files generated by K2test.sh in $localcurves directory
 for FILE in `cd localcurves; ls $ZEROCURVEFILES 2>/dev/null`
 do
  filesfound=1
  echo "Processing $LOWERCASEFILE.$currdate file"

  awk '
    FNR==1 {
        numheadrecords = $1;
        rowstoprocess  = numheadrecords + 2;
        printf "Total Number of Rows in header for %s.%s is %s\n", LowFile, Date, numheadrecords;
        next;
    }
    FNR<rowstoprocess {
        julianmdate = $1;
        rate        = $2;
        mdate       = $4
        printf "%s,%s,%s,%s,%s\n", LowFile, Date, juliandate, rate, mdate;
    }
  ' LowFile=$LOWERCASEFILE Date=$currdate $FILE
    
 done
 
#Subtract 1 day from currdate (reverse loop)
 currdate=`./shift_date $currdate -1`
done

Jean-Pierre.

srattani · June 28, 2011, 9:44am

michaelrozar17, I did put double square braces and I get a syntax error, what is this for? you want to know which shell it is ?

Thanks Jean-Pierre, I will try it out and let you know.

---------- Post updated at 09:44 AM ---------- Previous update was at 09:38 AM ----------

Jean-Pierre, I am encountering a problem.

The 1365 files generated in $localcurves directory are in mixed-case name i.e. e.g. sCADTierTwolinMid but I need in lower case

If you see the list $ZEROCURVEFILES is all lower case so when we do `ls $ZEROCURVEFILES` it will not find any. Is there a way to do ls case in-sensitive?

fpmurphy · June 28, 2011, 9:47am

For those who are unaware of it ... on Solaris 10 and earlier, the default shell (/bin/sh) is the Bourne Shell.

aigles · June 28, 2011, 10:14am

Try and adapt this new version (not tested) :

#!/bin/sh
#e.g. 20110627 (june 27 2011)
currdate=$1
#e.g. 20100310 (march 10 2010)
enddate=$2

#directory where 1365 files get generated
localcurves="/home/sratta/feds/localCurves/curves"
outputdir="/home/sratta/curves"
#output fileto be generated
OUTFILE="/home/sratta/ZeroCurves/BulkLoad.csv"
touch $OUTFILE

# List of 133 curve file names
ZEROCURVEFILES="saud1-monthlinmid
saud6-monthlinmid
.....
suvruvr_usdlinmid
szarzar_usdlinmid"

echo "$ZEROCURVEFILES" > /tmp/zerocurvefiles.tmp

#Loop until currdate is not equal to enddate (reverse loop)
while [ $currdate -ne $enddate ]
do

  #Call K2test.sh which generates 1365 files for a given date in $localcurves directory
 ./K2test.sh $currdate
 filesfound=0

#Loop through the 1365 files generated by K2test.sh in $localcurves directory
 for FILE in `cd localcurves; ls | /usr/xpg4/bin/grep -iwf /tmp/zerocurvefiles.tmp 2>/dev/null`
 do
  filesfound=1
  echo "Processing $LOWERCASEFILE.$currdate file"

  nawk '
    FNR==1 {
        numheadrecords = $1;
        rowstoprocess  = numheadrecords + 2;
        printf "Total Number of Rows in header for %s.%s is %s\n", LowFile, Date, numheadrecords;
        next;
    }
    FNR<rowstoprocess {
        julianmdate = $1;
        rate        = $2;
        mdate       = $4
        printf "%s,%s,%s,%s,%s\n", LowFile, Date, juliandate, rate, mdate;
    }
  ' LowFile=$LOWERCASEFILE Date=$currdate $FILE
    
 done
 
#Subtract 1 day from currdate (reverse loop)
 currdate=`./shift_date $currdate -1`
done

Jean-Pierre.

methyl · June 28, 2011, 10:36am

(Late post - lost connection, may be out of context)

The version is in the output from the "uname -a" command. It should then be possible to look up whether your Solaris is an old one which has the old Bourne Shell for /bin/sh or a new one with the more modern Posix Shell.

1) The big inefficency is using a Shell "read" to read records line-by-line from a data file, then using multiple "awk" runs to separate the fields.
I see now why you reassigned the channels because you are already using the Shell input channel to read a list of files.

I agree with the ideas behind "agiles" modifications.

2) As you have a list of required files, use that list.
I'd add a test to the script to check whether the file exists.
I see that "agiles" modification is ingeneous because it allows for this by sending errors to /dev/null:

3) Invoke awk only once and use it to read the data from the files.
A lot of the inefficiency comes from the number of times the original script starts "awk" to process the same $line .

4) Hold the list of 133 files in a real file not an environment variable and use "while" rather than "for". Some Bourne shells will not let you have an environment variable that big.

5) Consider making a version of K2test.sh which only generates the relevant 133 files in /home/sratta/feds/localCurves/curves .

6) Noticed that the variable $LOWERCASEFILE is not set anywhere.

7) If you have a journalling filesystem it is inefficient to repeatedly create a batch of files then overwrite them with Shell. Depends whether K2test.sh removes old files before generating new files.

srattani · June 28, 2011, 1:04pm

Hi Jean Pierre/Methyl,

Jean Pierre, I see that when you are using awk you are using printf, I want the data elements to be sent to a file and not printed, how do I make data elements to go to $OUTFILE where OUTFILE is a variable holding name of file.

I would appreciate your help.

Thanks
Regards

methyl · June 28, 2011, 5:52pm

Check "agiles" next post, but I think this is enough for the redirect:
' LowFile=$LOWERCASEFILE Date=$currdate $FILE >> ${OUTFILE}

However there are other problems:
e.g. There is no value in $LOWERCASEFILE .

It would be so much easier if $ZEROCURVEFILES was the name of a file containing a list of the required files with their correct names. This could be created from a "ls -1" report and deleting the ones you don't want. It could equally be created using a "here document" within the script.
Translating the mixed upper-and-lower filename to lower case is a trivial task for the unix "tr" command. Working from a lower case list is proving to be not trivial.

Any chance you can let us know your version of Solaris?

srattani · June 29, 2011, 6:11am

Thanks Jean Pierre and Methyl,

The awk command you sent will only run for number of record in header correct? For the lines following the first line, it uses space as delimiter within the line to extract fields? as I dont see mention of separator being space.

I will modify the $ZEROCURVEFILES so that it has exact names. After doing that, how do I loop through only those files basically i have to some how use ls command giving it this list then inside loop i can change file name to lower case using tr like you mentioned.

I will let you know the version of Solaris when I reach work. I appreciate ur help.

Thanks
Regards