Intermittent "cp: cannot stat" error with nested loop

I have a bash script that has been running (on SUSE 9.3) dozens of times over the past couple of years without error. Recently it has been hitting intermittent �cp: cannot stat FILE: No such file or directory� errors.

The script has nested loops that continuously process files in a directory until the end condition is met. There are usually between 50 and 2000 files processed per execution. About 60% of the files processed hit a condition that requires the file to be copied to a new name so the new file can be processed in a future iteration. After the file is processed it is moved to a processed directory.

The error has not occurred more than one time per execution. Most times I can just re-run the script with the same data set and it works fine.

The strange thing is, according to the output log, the file it is complaining about did exist in the source directory and was moved to the processed directory. Another strange thing is the location of the �cannot stat� error in the output log seems to be random. Sometimes it appears in the middle of the output for the next file or even four or five files later.

Below is a distilled version of the code.

  Function_A() {                local ID=$1
               local SUB=$2

               if [ $CURRENT_COUNT -lt $MAX_COUNT ]
               then                        nohup ChildProcess $ID $SUB > "$LOG" 2>&1 &

                       let CURRENT_COUNT++
                       RC=0
elseecho "Cannot start another process right now"
                           RC=1
fi

               return $RC
} # End Function_A
   
  ##########################
  # Main
   
  #
  # some unrelated detail here...
  #
   
  cd $DIR
   
  WaitingForChildrenToFinish=1
   
  while [ $WaitingForChildrenToFinish -eq 1 ]
  do            # Process each file
for FILE in `ls -tr ${MGR_ID}_*.msg 2>/dev/null`
           doecho -e "\nProcessing file: \"$FILE\""

                   Function_A $MGR_ID $SUBJECT
                   RC=$?

                   if [ $RC -ne 0 ]
                   then              RETRY="$DIR/${MGR_ID}_${RETRY}_${SUBJECT}.msg"
cp -p $FILE $RETRY
fi

                   # Finished processing this file so move it out
mv $FILE $PROCESSED_DIR

                   #
                   # some unrelated detail here...
                   #

                   # End condition
                   if [ $CO_END_FLAG -eq 1 ]
                   then                                WaitingForChildrenToFinish=0
                               break
fi
done # End for each file 

           if [ $WaitingForChildrenToFinish -eq 1 ]
           then                    echo "No files to process so sleep..."
                   sleep 5
fi
done # End WaitingForChildrenToFinish
   
   
  Sample output looks like this:
   
  �.
   
  Processing file: "00wm4793_AAA_111.msg"
  Cannot start another process right now
   
  Processing file: "00wm4793_AAA_112.msg"
  cp: cannot stat `00wm4793_AAA_111.msg': No such file or directory
  Cannot start another process right now
   
  �

Any help will be appreciated. Thanks.

Might be here:

RETRY="$DIR/${MGR_ID}_${RETRY}_${SUBJECT}.msg"
cp -p $FILE $RETRY

On the right hand side of the assignment to RETRY, you have ${RETRY}. Perhaps this should be just the string RETRY? The more times the assignment executes the longer and more garbled the value of ${RETRY} becomes.

Sorry, in my effort to cut out extraneous details I messed up that line.

It should be:

RETRY="$DIR/${MGR_ID}_${RETRY_WAIT}_${SUBJECT}.msg"
cp -p $FILE $RETRY