Delete files based on specific MMDDYYYY pattern in filename

shankar1dada · August 8, 2011, 3:23pm

Hi Unix gurus,

I am trying to remove the filenames based on MMDDYYYY in the physical name as such so that the directory always has the recent 3 files based on MMDDYYYY. "HHMM" is just dummy in this case. You wont have two files with different HHMM on the same day.

For example in a directory, I have files like

OPEN_INV_01012011_1345.xls
OPEN_INV_01022011_1230.xls
OPEN_INV_01032011_1145.xls
OPEN_INV_01042011_2456.xls
OPEN_INV_01012011_3456.txt
OPEN_INV_01022011_1134.txt
OPEN_INV_01032011_0812.txt
OPEN_INV_01042011_3467.txt

When I run the script it should delete the OPEN_INV_01012011_1345.xls and OPEN_INV_01012011_3456.txt

Note that the before the file extension, we always have "MMDDYYYY_HHMM"

I am using the following script:
This is what I am trying to do;

#!/usr/bin/ksh

archivedir=/opt/data/files/archive

typeset -i MAX_ARCHIVE_COUNT
typeset -i archive_file_count
typeset -i remove_archive_count

cd $archivedir
for archpref in $(ls | sed 's/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.[^.]*$//' | sort | uniq)
do
  archive_file_count=$(ls -1t ${archpref}* | wc -l)
  MAX_ARCHIVE_COUNT=3
  remove_archive_count=${archive_file_count}-${MAX_ARCHIVE_COUNT}
  if [ ${remove_archive_count} -gt 0 ]
  then
# List the files in date order (most recent first), suppress the first 3, and delete the rest
    rm $(ls -1rt | tail -${remove_archive_count})
  fi
done

Any help is greatly appreciated.

Thanks

g.pi · August 8, 2011, 5:26pm

Shankar, I think your problem is on this line:

for archpref in $(ls | sed 's/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\.[^.]*$//' | sort | uniq)

Try running the commands in the parenthesis - starting with ls up to uniq - on the command line and see if it picks up any files. I believe the sed command should look like:

sed 's/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-9]*\.[^.]*$//'

Or, better still:

sed 's/[0-9]*\_[0-9]*\.[a-z]*//'

shankar1dada · August 9, 2011, 11:24am

G.pi,

I tried all your suggestions and it is still not working. It considers .txt and .xls as same set and removes 5 files leaving the count as 3.

The idea is to leave 3 files for each set (.xls and .txt).

I think the issue is with the "sort and uniq"

Current it shows the archnt=10 and remove_count=7. Which is wrong as it combines both extensions.
It should actually be archnt=4 and remove_count=1 for each set.

Please if anyone can throw some ideas or modify the sed command will be great.

Thanks in advance

g.pi · August 11, 2011, 3:25pm

Shankar,

I had overlooked earlier response.

My earlier suggestion was based on your existing script. A quick (easy) solution would be to create 2 for loops, instead of the single one that you have now. The first one should only deal with .txt and the second, with .xls. So in the first for loop, your sed command would look something like this:

sed 's/[0-9]*\_[0-9]*\.txt//'

Make sure your rm command only removes the .txt files. So your rm command should look like this:

rm $(ls -1rt *.txt | tail -${remove_archive_count})

The second for loop should be identical, except, it will only deal with .xls.

Hope that helps.

GP

anuragpgtgerman · August 11, 2011, 11:50pm

we can achieve this with sort and grep command