Help with Archiving multiple files based on name and date

shankar1dada · August 1, 2011, 3:11pm

Dear Gurus,

I am a novice in shell scripts. I have a requirement where I need to move files every day from Current Folder to Archive folder.

Daily I will be receiving 5 files in the folder - /opt/data/feeds/.
The feeds folder has two sub-folders - Current and Archive.

For example the first day, the I receive file and names will be like:

File1_extract_08012011
File2_extract_08012011
File3_extract_08012011
File4_extract_08012011
File5_extract_08012011

The last 8 characters are the mmddyyyy.

Very first time Current folder will be empty so I just move the files from
/opt/data/feeds/ to /opt/data/feeds/Current.

Again the second day, I will be receiving the following files:

File1_extract_08022011
File2_extract_08022011
File3_extract_08022011
File4_extract_08022011
File5_extract_08022011

When I receive the files, I should the move the existing files from Current to Archive folder and then place the new files in the "Current" folder.

Also there is another condition where say for example on the third day, I receive only 3 out of 5.

File1_extract_08032011
File3_extract_08032011
File5_extract_08032011

In this case, the current folder should keep the file2 and file4 (which is from 08022011) and file1,3,5 (which is from 08032011). The file1,3,5(of 08022011) should be moved to Archive. The current folder should always have the most recent date for each file.

Also the archive folder should have only the last 6 days for each file.

So every day the current folder will have 5 files and Archive folder will have 30 files.

Please help me on this.

Thanks in advance.
Shankar

DGPickett · August 1, 2011, 4:40pm

Usually, the trick is to put the files all in the archive, and as each comes in, link it to current without a date in the name, after removing any link from before.

How do you know the file is fully written and can be used? Discovery seems silly, when the producer could just install it.

shankar1dada · August 1, 2011, 4:54pm

These files are actually report outputs with .CSV extension. They are moved to this folder via application.

I am using something like below:

CURR_DIRECTORY=/opt/data/feeds/Current
ARCH_DIRECTORY=/opt/data/feeds/Archive

ls -1 *.xlsx > all_archive_files.txt
archive_list=${CURR_DIRECTORY}/all_archive_files.txt
echo archive_list: $archive_list
for archive_file in `cat $archive_list`
do
echo archive_file:$archive_file
echo 
cd ${ARCH_DIRECTORY}

if [ -f $archive_file ]; then
echo "This filename [$archive_file] exists"
echo "Move Unsuccessful :-("
else
echo "The filename [$archive_file] does not exist"
mv -f ${CURR_DIRECTORY}/$archive_file "${ARCH_DIRECTORY}"
echo "Move Successful :-)"
fi 
done

The above script just checks whether the file in Current directory is different from Archive directory and then does the move.
Daily you get files and the last 8characters of the file name will be current system date.
I am struggling to check the file names with the file saved date and then archiving for 6 days.

Any help with the script is highly appreciated.

Thanks

DGPickett · August 1, 2011, 5:07pm

Well, if you list file names stripped of date and detect duplicates, you know who needs moving down. Don't overwork making file lists, env var and pipes are fine:

cd ${CURR_DIRECTORY}
to_move=$(
 ls | sed 's/[0-9]*$//' | sort | uniq -d | while read p
  do
    ls -tr $p* | sed '$d'
  done
 )
if [ "$to_move" != "" ]
then
 mv $to_move ${ARCHIVE_DIR}
fi

Shell_Life · August 1, 2011, 5:22pm

See if this works for you:

#!/usr/bin/ksh

mNew='/opt/data/feeds/'
mCurrent='/opt/data/feeds/Current/'
mArchive='/opt/data/feeds/Archive/'

#
# Removing 5 files from 5 days ago in Archive:
#
typeset -i mCnt=31
ls -1at ${mArchive}File?_extract_* | while read mFName; do
  mCnt=${mCnt}-1
  if [[ ${mCnt} -le 5 ]]; then
    echo "Now removing <$mFName>."
    rm -f ${mFName}
    if [[ ${mCnt} -eq 1 ]]; then
      break
    fi
  fi
done

#
# Moving existing Current files to Archive:
#
mv ${mCurrent}File?_extract_* ${mArchive}

#
# Moving new files to Current and
# make sure all 5 files are there:
#
mMMDDYYYY=$(date +"%m%d%Y")
mCnt=1
while [[ ${mCnt} -le 5 ]]; do
  mFName='File'${mCnt}'_extract_'${mMMDDYYYY}
  touch ${mCurrent}${mFName}
  mv ${mNew}${mFName} ${mCurrent}${mFName}
  mCnt=${mCnt}+1
done

shankar1dada · August 2, 2011, 10:48am

Hi Shell_life,

I am still not getting the script to work. The issue I notice is the file name.
The names I gave was an example.
If the file name format changes then the script won't work.
The only non changing criteria in the file name is that it always have the date(mmddyyyy) at the end. It can have any name/length as prefix.
For example: The files can be
File1_extract_08012011
Open_inv_08012011
BAXT_CONV_08012011

Please help.

DGPickett · August 2, 2011, 2:45pm

You can do it going through a sorted list with history in variables. If you hit a later file, you move the last file to archive. Here, * is a sorted list, and the ksh does it all but the mv internally, until 2040:

#!/usr/bin/ksh
 
cd $CURRENT_DIR
 
for file in *
do
 file_base=${file%_[01][0-9][0-3][0-9]20[0-3][0-9]}
 
 if [ $file_base = "$last_file_base" ]
 then
  mv last_file $ARCHIVE_DIR # one at a time for simplicity
 fi
 
 last_file=$file  last_file_base=$file_base
done

Really, polling sucks! The creator app should do this, too, after a good file create.

shankar1dada · August 3, 2011, 10:55am

The code still not working. May be I am not getting it to work as expected. I will try tweaking the code.

If you can throw some more light with comments and any new code will be helpful.

Thanks

DGPickett · August 3, 2011, 11:23am

I sense the requirements are still blurry. The files come into current, and then you want to move the prior same-prefix to archive, but if there is already that name in archive, what do you want to do?

Shell_Life · August 3, 2011, 11:32am

I provided a solution based on the original post and it does work as per original requirement.

Then the requirement was changed and it is still not clear.

For instance, are these the only files in the specified directories?

If so, then simply change any references from "File?_extract_" to "".

Or if there are other files there and they do not end with a date, then change the file name to "*[01][0-9][0-3][0-9]20[1-9][0-9]".

shankar1dada · August 3, 2011, 11:40am

Dear GURUS,

First of all I am thankful for being patient with me. Let me explain in more detail:

Current will always have the filename with the latest "MMDDYYYY".
Archive will be having the last 6 MMDDYYYY for each of the filenames.

Whenever a file comes with a new MMDDYYYY then only you move the old file from CURRENT to ARCHIVE else you leave the old file in the CURRENT itself.

Say on August 1 is the first day. Both Current and Archive will be empty.
You are getting three files:
File1_extract_08012011.csv
BAXT_INV_08012011.csv
HMIA_CLS_08012011.csv

We will copy all the 3 file to Current folder.

Then next day Aug2, you are getting the following 3 files:
File1_extract_08022011.csv
BAXT_INV_08022011.csv
HMIA_CLS_08022011.csv

Now all the three files with 08012011 will be moved to ARCHIVE and the CURRENT will the 08022011 files.

Now on Aug3, we get only two files:
File1_extract_08032011.csv
BAXT_INV_08032011.csv

Now we should move only File1_extract_08022011.csv and BAXT_INV_08022011.csv to ARCHIVE and keep the HMIA_CLS_08022011.csv in the CURRENt folder. The latest 08032011 files will also be copied to CURRENT.

As of Aug3, ARCHIVE folder will have:

File1_extract_08012011.csv
BAXT_INV_08012011.csv
HMIA_CLS_08012011.csv
File1_extract_08022011.csv
BAXT_INV_08022011.csv

CURRENT folder will have:
File1_extract_08032011.csv
BAXT_INV_08032011.csv
HMIA_CLS_08022011.csv

This continues and at any point of time in future the ARCHIVE should have only the last 6 MMDDYYYY's.

Hope this gives a clear picture of what I am trying to do.

Thanks

DGPickett · August 3, 2011, 11:59am

So, a file to be moved to archive does not need to worry about overwrite.

Do you want to integrate the archive 6 file limit at the same time?

#!/usr/bin/ksh
 
cd $CURRENT_DIR
 
for file in *
do
 file_base=${file%_[01][0-9][0-3][0-9]20[0-3][0-9]}
 
 if [ $file_base = "$last_file_base" ]
 then
  mv last_file $ARCHIVE_DIR # one at a time for simplicity
  while (( 6 < $( ls $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9] 2>/dev/null | wc -l ) ))
  do
    ls -tr $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9] | read x
    rm -f $x
  done
 fi
 
 last_file=$file  last_file_base=$file_base
done

shankar1dada · August 3, 2011, 12:46pm

Yes Please that will be great.

Also if possible if you an include comments that will be informative for a beginner like me.

For example if you can add some comments here:

while (( 6 < $( ls $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9] 2>/dev/null | wc -l ) ))

Rather than blindly copying the code, I would like to learn and use it.

Thanks

DGPickett · August 3, 2011, 1:23pm

Well,

go to the current dir so names have no dir,
Get the sorted list of visible files and go through them one at a time in order as file,
Strip off the date portion at the end, with underscore, as file_base,
If the stored last file base is the same (blank first time never equal), then the last file is younger. Actually, it's be nicer if the suffix was YYYY-MM-DD, as this fails at new year.

Try again:

#!/usr/bin/ksh
 
cd $CURRENT_DIR
 
ls *_[01][0-9][0-3][0-9]20[0-3][0-9] | sed '
  s/\(.*\)_\([01][0-9][0-3][0-9]\)20\([0-3][0-9]\)/\3\2 \1 &/
 ' | sort | while read xxkey file_base file
do
 if [ $file_base = "$last_file_base" ]
 then
  mv last_file $ARCHIVE_DIR # one at a time for simplicity
  while (( 6 < $(
                     ls $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9] 2>/dev/null | wc -l
                     ) ))
  do
    ls -tr $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9] | read x
    rm -f $x
  done
 fi
 
 last_file=$file  last_file_base=$file_base
done

Go to the current dir so file names have no dir prefix,
list just the well named files to pipe,
prefix them with the key field YYMMDD on pipe, sticking the base in the stream using sed for simplicity,
sort them by date and then prefix pipe to pipe,
'while read' puts the three fields into three variables for each line from stdin pipe until EOF.
If same base as last file, last file must be moved.
If file is moved, in a subshell that captures stdout as a string $(...), list the archive dir for that base and date wild card suffix and count the lines of the list,
(( )) is ksh arithmetic mode, so you can say 6 < for testing the line count, # I put 6 first, > 6, as 6 is smaller tha $(...), gets lost at end.
while 6 is less than that line count,
list by file mod time oldest first (is mod time a safe test, or do we need a key rearrange and sort like above?),
read the first name,
remove that name,
end while 6 loop with done,
end if bases are same test with fi,
save file name and prefix in last_* variables for next pass,
end while read file loop with done

g.pi · August 3, 2011, 1:47pm

Shankar, this may be of help. You may need to tweak it a bit.

!/bin/ksh

FEED_DIR=~/tmp
ARCH_DIR=~/tmp/arch
CURR_DIR=~/tmp/curr

for file in $(ls -1 $FEED_DIR/File?_extract*)
do
    file_pfx=$(basename ${file%%_[0-9][0-9]*})

    echo $file_pfx

    [ -e $CURR_DIR/${file_pfx}* ] && mv $CURR_DIR/${file_pfx}* $ARCH_DIR

    mv ${file_pfx}* $CURR_DIR
done

#   Remove files that are 6 days or older, from the archive directory.

find $ARCH_DIR -name 'File?_extract*' -atime +6 -exec rm {} \;

DGPickett · August 3, 2011, 2:01pm

Ingenious, but not robust or in requirement.

He said the app writes them new file to $CURRENT_DIR, and the app is not to be touched. So, at best you have to move the file to /tmp first. That is expensive for big files, as mv becomes cp (twice) for change of file system, endangers the file if /tmp is too full or if unplanned reboot clears /tmp, and will change the mod time stamp.

find -mtime runs on real time, so you may get more or less than 6 files, and on mod time, so if the mod times get messed up, may not remove the right files at the right time.

g.pi · August 3, 2011, 2:30pm

Good points, DGPickett. I may have misunderstood his requirements.

I am using -atime in the find command, with the assumption that once the file has been processed, it is not accessed again.

DGPickett · August 3, 2011, 3:05pm

Missed that, and never investigated the ramifications of atime. I think every open changes that. ctime might be before mtime, as it reflects non-time inode values. Time stamps are a fluffy thing, so using the file name date is safest, but drat that mdy date.

Staying on the fs with mv also means even if the file is open for write, it can move around without problems. After all, it is just an entry name and inode# in a dir node or two!

shankar1dada · August 3, 2011, 4:44pm

DGPickett: When I execute the script it doesn't do anything or throw error.

g.pi: When I execute your script what ever is in the CURRENT goes to the ARCHIVE. But the new file from Source does not move to CURRENT.
Am I missing something?

for file in $(ls -1 $FEED_DIR/*.*)
do
    file_pfx=$(basename ${file%%_[0-9][0-9]*})

    echo $file_pfx

    [ -e $CURR_DIR/${file_pfx}* ] && mv $CURR_DIR/${file_pfx}* $ARCH_DIR

    mv ${file_pfx}*.* $CURR_DIR
done

#   Remove files that are 6 days or older, from the archive directory.

find $ARCH_DIR -name 'File?_extract*' -atime +6 -exec rm {} \;

DGPickett · August 3, 2011, 4:52pm

He assumed feed sir was not curr dir.

Forgot suffix! Fixed rm not to be mod time:

#!/usr/bin/ksh
 
cd $CURRENT_DIR
 
ls *_[01][0-9][0-3][0-9]20[0-3][0-9].csv | sed '
  s/\(.*\)_\([01][0-9][0-3][0-9]\)20\([0-3][0-9]\)\.csv/\3\2 \1 &/
 ' | sort | while read xxkey file_base file
do
 if [ $file_base = "$last_file_base" ]
 then
  mv last_file $ARCHIVE_DIR # one at a time for simplicity
  while (( 6 < $(
               ls $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9].csv 2>/dev/null | wc -l
               ) ))
  do
    ls $ARCHIVE_DIR/${file_base}_[01][0-9][0-3][0-9]20[0-3][0-9].csv | sed '
      s/.*_\([01][0-9][0-3][0-9]\)20\([0-3][0-9]\)\.csv/\2\1 &/
     ' | sort | read y x
    rm -f $x
  done
 fi
 
 last_file=$file  last_file_base=$file_base
done

The beauty of scripting is you can try parts on your command line to see where it falls short or goes off track.