With Unix script, how to delete files older than 10 years

With Unix script, how to delete files older than 10 years, based on time stamp in filename.
Filename will be like this 20130101_06_31_ABC.csv (YYYYMMDD_HH_MI_ABC.csv).

What is your OS?

uname -a
cat /etc/*release

What did you try?

Can you clarify the requirement a little, please.

Is it sufficient to delete files for specific years (e.g. 1970 to 2012 inclusive), or would you require as of today (2023-Feb-20) to delete files up to 2013-Feb-19 ?

Is your requirement confined to specific directories only, or to a complete subtree of directories, or to a complete filesystem ?

find . -type f -mtime +3666 -exec rm -f {} \;

This will work. Try to modify it as there might be some syntax errors.

Yes, using the file's mtime is much simpler than exracting the date from the name!

Hopefully still helpful after 8 months.

The rm takes multiple arguments, so a + collects arguments and runs one or few rm with several arguments, in contrast to the \; that each time runs an rm with one argument.

find . -type f -name "????????_??_??_*.csv" -mtime +3652 -exec rm -f {} +

Do you know the computer science concepts behind difference of + and ; ? In programming way?

The + bundles the arguments for the command.
The find collects the arguments (filenames) until the maxargs for the command is reached.
The command is delayed, and finally run with the collected arguments.

1 Like

Notably, find with + is aware of the limit or the number of args to the command (to be more specific, the combined length of all the arguments). As with xargs, this can result in multiple invocations of the command.

In this case with rm, there should be no side effects, but other commands may have a problem.

For example, consider this:

find . -type f -name '*.csv' -mtime +3652 -exec tar cfz Archive.tgz -f {} '+'

If there are 6300 files, and the args list is limited to 2000 filenames, then the first three archives will be overwritten by their successor, and the surviving .tgz will only contain 300 of the files.

Many commands have a work-around for this issue. For example, with tar you can have find create a file containing a list of null-terminated filenames using -fprint0 myList, and have tar read that list with tar --null -T myList. So the filenames no longer need to appear in the command-line as arguments.

4 Likes

The -T can take a dash for stdin:

find ... -print0 | tar -c -z -f /tmp/Archive.tgz --null -T -

GNU find/tar required.

1 Like

Well, anyone who is looking to delete files older than 10 years can use the find command and can use the below script as an example:

#!/bin/bash

# Specify the directory where your files are located
directory="/path/to/your/files"

# Calculate the timestamp for files older than 10 years
timestamp=$(date -d "now - 10 years" +%Y%m%d_%H_%M)

# Use find to locate and delete files older than the specified timestamp
find "$directory" -type f -name '*_*_*_*.csv' | while read -r file; do
    # Extract timestamp from the filename using awk
    file_timestamp=$(echo "$file" | awk -F '[_.]' '{print $1$2$3"_"$4"_"$5}')
    
    # Compare timestamps and delete files older than 10 years
    if [ "$file_timestamp" -lt "$timestamp" ]; then
        echo "Deleting $file"
        rm "$file"
    fi
done

Thanks

1 Like

@gulshan212
nitpicking here.
I think your timestamp and file_timestamp should be PURELY numeric to do the -lt NUMERIC comparison. Hence, I'd remove the _-s from both.

1 Like

Or:
since the order is from the biggest (YYYY) to the smallest (MM), an alphabetic comparison should work :
[[ "$file_timestamp" < "$timestamp" ]]
Can be further simplified:

timestamp=$(date -d "now - 10 years" +%Y_%m_%d_%H_%M)
find "$directory" -type f -name '*_*_*_*.csv' | while IFS= read -r file; do
    if [[ "$file" < "$timestamp" ]]; then
        ...
    fi
done

We don't know how many files are in the directory subtree (either older than 10 years or not), but having ten years worth suggests there might be a considerable number.

I would not want to run a subshell and awk invocation for every filename, and a rm process for every older file. That could add up to thousands of processes.

I would skip the shell while read loop entirely. Pipe the complete output from find to a single awk process which is passed -v timestamp="$timestamp", have awk extract and compare the date from each filename, and pipe the list of older filenames into xargs rm so it can batch the files.

Polish by using the relevant -print0, RS=ORS="\0";, and --null options for the commands to deal with whitespace safely.

1 Like