Find Unread Files

Hi

I have requirement to read only unread files from the directory and load into database.

Scenario: I do receive bunch of files into my unix directory for every 15 mins. My ETL Process every once in a day and reads the files and loads into db table. I cannot move these file into different location after extraction as source system will ftp for previous 15 days so i receive the files again if there is no file. So i will have to keep files for 15 days atleast.

Could you please advise how we can write the script to read only unread files?

There are two options.

  1. Rename the file after it is loaded to the db with a suffix, say .processed. Use this suffix to identify already read file.

  2. Create a list file which would contain names of files that are processed (i.e., loaded to db). When you load the file to db, make an entry in this list file with the name of the file loaded.
    Now, use this list file to identify read and unread files.

Thanks Krish

1st option may not be possible because if i rename the file, I will receive the same file again from sourece as they do ftp for 15 days of files.

2nd Option: i don't understand how do we identify processed and not processed files(code difficulty as i am not very good at Unix).

I've got one approach:
First i create file(FILE.ALL) with all the files
Second i find the difference between FILE.ALL and FILE.BACKUP and write into FILE.LIST
After processing all the files in FILE.LIST, i will append the list from FILE.LIST into FILE.BACKUP file.
seems so far so good, Now i need to remove the filenames from FILE.BACKUP which are older than 15 days.

Filename has date in it Eg:Filename_093013_xxxx.csv.

Could some one advise is this the good approach? and how do we remove the file name from FILE.BACKUP file by comparing dates?

Empty the processed files (maybe backup somewhere else before), so the names persist and keep ftp from copying them again.

Thanks Rudi,

I don't want to receive the files again. that is the reason i do keep all the processed files in the same directory location. I do house keeping activity on which received before 15 days.

If I understand you correctly you have a Unix box to which another node is ftp'ing files regularly. You have a process on this Unix box which needs to read these files but not the one's already processed. I assume that you can process these files in the chronological order that they are received in????? If so, here's another option......

At the end of your processing job you put the command:

 
date > timestamp

to create a file at the time of the process run called "timestamp".

At the start of the job you put:

 
find * -newer timestamp <other switches, whatever>

to only select files created (ftp'd onto the box) since the last run finished.

That way, all the historical files can be left in the directory and not be selected for processing.

The above assumes that I have completely understood you but, if not, do post back the issues.

1 Like

Each file has three time stamps :

[a] access (read the file's contents) -atime
[b]change the status (modify the file or its attributes) -ctime
[c] modify (change the file's contents) -mtime
If your file are read-only, you can comapre the atime and the mtime

here u go.

lastfilename=`cat $HOME/lastfilename.txt`

find * -newer $lastfilename > $HOME/listoffilestoprocess

while read line
do
    ETL_PROCESS.sh $line
    echo $line > $HOME/lastfilename.txt
done < $HOME/listoffilestoprocess

rm -rf $HOME/listoffilestoprocess

This is a good first cut, but there are a couple of problems here:

  1. If there are enough files in the directory, the expansion of * may overflow ARG_MAX limits on your system.
  2. The list returned by find will not be sorted by timestamp, so there is no guarantee that the last file processed by this script will be the newest file. If it isn't, the next time you run the script some files will be processed again.

I think the following script will get around those problems:

#!/bin/ksh
lastfile="$HOME/lastfilename.txt"
if [ -f "$lastfile" ]
then    read -r newest < "$lastfile"
else    newest=""
fi
ls -rt|( 
        if [ -n "$newest" ]
        then    # lastfile was not empty.  Skip over files older than the file
                # named in lastfile.
                while read -r file
                do      if [ "$file" = "$newest" ]
                        then    break
                        fi
                done
        fi
        # Process all files newer than the one previously listed in last file
        # (or all files in the directory if lastfile didn't exist or was empty).
        while read -r file
        do      # Process newer files in order from oldest to newest...
                ETL_PROCESS.sh "$file"
                # The script should abort here if ETL_PROCESS.sh failed...
                # Record the last file processed.
                printf "%s\n" "$file" > "$lastfile"
        done
)

But, if someone edits the last file processed in this directory after more files are added, this script (and the original script) will ignore the new files added after the last time the script ran until the time the file was edited. If that is a concern, the following may be a safer approach:

#!/bin/ksh
processed="$HOME/processed.txt"
# If the list of already processed files does not exist, create an empty list.
if [ ! -f "$processed" ] 
then    touch "$processed"
fi
ls -rt | grep -vF -f "$processed" | while read -r file
# Process all files newer that haven't already been processed...
do      # Process newer files in order from oldest to newest...
        ETL_PROCESS.sh "$file"
        # This script should skip the next step if ETL_PROCESS.sh failed.
        # Add current file to the list of processed files.
        printf "%s\n" "$file" >> "$processed"
done

It keeps a list of files processed and skips any file in that list when the script is run again later. It doesn't care about timestamps other than the fact that it will hand ETL_PROCESS.sh unprocessed files in order from the oldest to the newest.

Note, however, that this script can fail if a filename in the directory containing files to be processed can contain a file name that is a substring of another file's name. You haven't given us any indication of how files are named, so if this is a concern the grep command in the pipeline in this script would have to be adjusted to account for the actual filenames you'll be using. And, of course, the list of processed files should be edited to remove old files when they are removed from the directory.

Assuming that ETL_PROCESS.sh provides some indication that it successfully processed a file, all of these scripts should verify that a file was processed successfully before continuing with later files. The first two scripts should exit and not process any newer files until the problem is fixed or some files may never be processed. The last script above only needs to avoid adding the failed file to the list of processed files (unless ETL_PROCESS.sh has to process input files in the order in which they were received).

Both of these scripts were written and tested using ksh, but there is nothing here that is ksh specific as long as you're using a shell that recognizes basic POSIX shell syntax requirements (such as bash and ksh).

Hope this helps...

Don C. provided, IMO, the best answer. Requires no extra files. Also works when the read program has issues and fails. It keeps the filenames unchanged. The files are not deleted after 15 days. You have to script that as well
I assume you use ksh -> #!/bin/ksh uses ksh as the shell

# "read_process" is your code or shell script to "read" the file
#    hopefully read_process returns failure when it fails
#!/bin/ksh
cd /directory/with/files
ls | while read fname   # get the name of every file in the directory
do
   if [ -s $fname ] ; then    # file has data in it?  it is not empty?
      read_process $fname   # not empty: run read_process
      if [ $? -eq ] ; then      # read_process ran ok?
         > $fname               # read_process worked make the file zero length (empty)
      fi
   fi
done

This script should be run once a week or maybe every day, as you decide. Do not change the 16 to a 15 or you will have problems - i'm not going into why fully but days are not dates they are the number of (86400 seconds) in the past. Not calendar days. I assume you want an email and have email on your UNIX box

#/bin/ksh
# assume that a file could keep failing on the read_process,so we keep it
find /directory_path_to_files -type f -mtime +16 -size 0 -exec rm {} \;
find /directory_path_to_files -type f -mtime +16 -size +1 > t.lis
if [ -s t.lis ] ; then
 uuencode t.lis t.txt | /usr/bin/mailx  -s 'you missed processing some files ' you@yourcompany.com  
fi