Comparison and For Loop Taking Too Long

hanie123 · July 7, 2009, 11:54am

I'd like to

Check and compare the 10,000 pnt files contains single record from the /$ROOTDIR/scp/inbox/string1 directory against 39 bad pnt files from the /$ROOTDIR/output/tma/pnt/bad/string1 directory based on the fam_id column value start at position 38 to 47 from the record below. Here is an example of the record from the file in both directories:
PNT0220060503081122003700100000091049000005629001005146417001407712SFirstname Lastname
If fam_id is matched then move current file from the /$ROOTDIR/scp/inbox/string1 directory into the /$ROOTDIR/output/tma/pnt/bad/string1 directory.
If not then continue the normal process
The below code is worked but it took 2 plus hours to complete the comparison process. Please advice if there is a better way to re-write or improve the comparison process to make it run faster and better. Thanks

pntcnt1=`ls -l /$ROOTDIR/scp/inbox/string1 | grep 'PNT.*' | wc -l`
if [[ $pntcnt1 -gt 0 ]] then
 
for gfile in `ls -1 /$ROOTDIR/scp/inbox/string1/PNT.2*`
 do
   gline=`sed '1q' $gfile`
   x=`echo "$gline" | awk '{ print substr( $0, 38, 9 ) }'`
   for bfile in `ls -1 /$ROOTDIR/output/tma/pnt/bad/string1/PNT.2*`
    do
      bline=`sed '1q' $bfile`
      y=`echo "$bline" | awk '{ print substr( $0, 38, 9 ) }'`
if [ "$x" -eq "$y" ]
then
  echo "file moved $gfile"
  mv -f $gfile /$ROOTDIR/output/tma/pnt/bad/string1
 
break
fi
 
done
 
done
fi

otheus · July 10, 2009, 7:50am

There is room for improvement, but I'm not sure how much improvement it will be. In the end, you need to have a double-loop. There is a possibility for another way, below.

# pntcnt1=`ls -l /$ROOTDIR/scp/inbox/string1 | grep 'PNT.*' | wc -l`
## replaced with:
find /$ROOTDIR/scp/inbox/string1/ -name "*PNT.2*" -print |
# if [[ $pntcnt1 -gt 0 ]] then
## replaced with a while-pipe:
while read gfile 
 do
   # gline=`sed '1q' $gfile` # no longer needed here; awk does it all
   x=`awk 'FNR==1 { print substr( $0, 38, 9 ); exit }' $gfile`

   # for bfile in `ls -1 /$ROOTDIR/output/tma/pnt/bad/string1/PNT.2*`
   find /$ROOTDIR/scp/inbox/string1/ -name "*PNT.2*" -print |
   while read bfile
    do
      # let awk do the string comparison. 
      if awk -v x="$x" 'FNR==1 { if x == substr( $0, 38, 9 ) exit(0); exit(1); }' $bfile` 
      then
         echo "file moved $gfile"
         mv -f $gfile /$ROOTDIR/output/tma/pnt/bad/string1
         break
      fi
  done
done

The other method is memory-intensive: You go through the first directory and build up a tree of filename-string pairs; then you go through the second directory and compare each file's first row to your entries. It can be done in awk, but here's how to do it in perl:

#!/usr/bin/perl -w
$dir1= ; # put the first dir name here
$dir2= ; # put the second dir name here

opendir(D1,$dir1) || die "Cannot open $dir1: $!";
opendir(D2,$dir2) || die "Cannot open $dir2: $!";

# read record snippets from dir1
while ( $file1=readdir(D1) ) { 
   next unless $file1 =~ /PNT\.2/;
   open(FILE,$dir1."/".$file1) || do { warn "Could not open $dir1/$file1, skipping: $!"; next; }
   $line=<FILE>;
   $X{ substr($line,37,9) } = $file1;
}
close FILE;

# compare to files in dir2
while ( $file2=readdir(D2) ) { 
   next unless $file2 =~ /PNT\.2/;
   open(FILE,$dir2."/".$file2) || do { warn "Could not open $dir2/$file2, skipping: $!"; next; }
   $line=<FILE>; 
   $y=substr($line,37,9);
   if (exists $X{ $y }) { 
      print "mv -f $dir1/$X{$y} $dir2";
      delete $X{$y}; 
   }
}

That perl code is untested. It prints out the mv commands, rather than executing them. You can then examine the output is right, and replace the last "print" with "system". Files with spaces and funny characters in them might not work in this case. The substr...37 isn't a mistake. Perl starts counting strings at 0, while awk starts at 1.