Delete all files with specific extension in directory tree

I'm sure this has been asked many times, but a search didn't turn up a definitive best method for this (if there ever is such a thing).

I have been using rsync to back up my main data directory, but I have accumulated a large number of older backups that I don't need. All of the files I don't need anymore have the extension .back, so I need to troll through all of the folders and sub-folders and delete everything with the .back extension. I thought I would need to do some kind of recursive ls and pipe the results to rm, but I'm not sure what that would look like so I did a search.

Many of the solutions I found use find and look like,

find . -name *.back -exec file {} \; -exec rm -i {} \;

or

find /path . -name '*.back' -type f -delete

or

find /path -iname "*.back" -type f -delete

As usual, there appear to be many ways of doing things and I have no basis on which to make a choice. These files are copies, so I could always rebuild the backup if there was a disaster with the cleaning, but that would take time and I try to avoid putting my foot in it to that extent.

Any suggestions?

LMHmedchem

Go to the main directory and run

find . -name "*.back" -type f -exec rm -f {} \;
1 Like

Alright, I will give that a go. If you have a minute to answer, what is the difference between the method you posted and the other examples I gave in my original post?

LMHmedchem

IMO, the first command will not work because the o/p of 'find' command is not piped in any way to the exec command. But the second and third should work for sure.
I wasn't sure about the -delete' action. That is why I gave you the command I am familiar and experienced with.

Any better explanations from geeks are welcome. :slight_smile:

1 Like

Oh dear, I'm classing myself as a geek. Well, if the name fits, ......

The way you have tried, the shell will expand *.back before trying to run the command. If you happend to have a file at the top level called this.back then the command actually run will become:-

find . -name this.back -exec file {} \; -exec rm -i {} \;

so you will not actually match anything other than the file at the top level.

The others are various errors. What jaiseaugustine has suggested is the correct format for you. It will pass in *.back as it is to the find command and then it can be used for pattern matching.

If there are no files at the top level, you might get away with it depending how your shell reacts, but if there is more that one file called *.back, e.g. this.back & that.back, then you will probably get the error:-

find: There is a missing conjunction

because the shell will try to run:-

find . -name that.back this.back -exec rm {} ;

I hope that this clarifies things a bit.

Robin
Liverpool/Blackburn
UK

2 Likes

Further, the -exec file {} \; -exec rm -i {} \; is tailored for interactive use,
while -type f -exec rm -f {} \; is tailored for scripts.

BTW -delete is a new option in new find versions. Would be more efficient if thousands of files are deleted.

3 Likes

This is more or less always what I end up doing and I guess it is a reasonable way to proceed in most cases. I keep notes on what I have used for various situations, especially those methods that worked well.

The command,

find . -name "*.back" -type f -exec rm -f {} \;

worked well and cleared out about 50GB or older incremental versions. I ran this while I was out for a while and I didn't run it under time, so I can't comment on how fast the method is compared to other possibilities. I generally presume that there is no fast script based method to process a directory tree with 4+ million files.

I also did a defrag/optimize (auslogics) and clean out of MFT records. All told it took almost 24 hours to run, but I find I need to keep these backups well maintained, or they eventually bork and you have to reformat and start again. It seems as if rsync tends to lead to very fragmented repositories. I have never quite understood why you get lots of fragmenting on a drive with 500GB of empty space.

Thanks for all the additional explanations. I do always try to understand what a script is doing and why you would choose one method over another. I think I need to read a bit about exec.

LMHmedchem

1 Like

As MadeInGermany informs us that the -delete is available in newer versions, then that will probably run better than -exec rm {} \; as the latter will spawn a new process for each file, and that in itself will take a small amount of time. Multiply by perhaps 100,000 hits suddenly becomes a lot of time spent just creating a new process for each delete.

To my embarrassment, most of my servers are rather behind the times (AIX 4.3.3 for some) so I've only got this flag in RHEL 6.3.

Interestingly in the RHEL man page, I found this:-

Perhaps a neater way (with examples) of what I was trying to say earlier.

Glad that we could collectively help, and I've learned something too :cool:

Robin

Hi.

You mention MFT -- is this filesystem NTFS? ... cheers, drl

Yes, this is windows XP, but I do the heavy lifting with cygwin. Nothing beats a linux tool box for large scale file operations. I suppose there may be a dos equivalent, but I never bothered to learn dos when I could use cygwin and learn a real shell like bash instead.

LMHmedchem

Hi.

I generally advise people to use whatever seems best for them. However, we need to accept drawbacks along with the advantages.

For your rsync / fragmentation issue, I think it's the XP NTFS filesystem, not the application. Some links you may want to look over are:

Defragmentation - Wikipedia, the free encyclopedia

A forum discussion (mostly slanted toward Linux, against MS): Why does Windows suffer from disk fragmentation when Linux doesn't? [Archive] - Ubuntu Forums

A blog that is more MS-centric view: Rants and Raves Linux File System Fragmentation

File system fragmentation - Wikipedia, the free encyclopedia

And many many more from Google searches.

My own experience with rsync (via rsnapshot): my workstation backup is run 6 times per day. I have been running it for more than 1 year. The numbers from fsck are:

Filesystem            Size  Used Avail Use% Mounted on
/dev/sdc1             241G   35G  194G  16% /media/big_disk_1

big_disk_1: 1,285,708/16,007,168 files (0.3% non-contiguous), 8984671/64000944 blocks

So very low fragmentation for a disk that sees a fair amount of activity. This does not necessarily parallel your use, but I have never used a defragmentation code on any Linux filesystem. When I used W2K, I seemed to need a defragmentation run quite often.

Best wishes ... cheers, drl

That's a very inefficient method of deleting a large number of files with Cygwin. The fork required to delete each file performs very, very poorly. Either use -delete or the + version of -exec.

Regards,
Alister

1 Like

So would this be the preferable version if there is the potential for a large number of files to be involved?

find . -name "*.back" -type f -delete

I'm not familiar with what the "+ version of -exec" would be.

What is the difference between using double and single quotes for the extension? I have seen both, meaning ".back" or '.back'?

LMHmedchem

To reduce the calls to programs (like rm), there are two implementations

  1. Unix, later defined by Posix
find . -name "*.back" -type f -exec rm -f {} +

I.e. you replace \; by + and the program (here rm) must accept multiple arguments.
This implementation seems difficult; I have met some buggy ones.
2. GNU find, by means of the xargs program that converts an input stream to multiple arguments:

find . -name "*.back" -type f -print0 | xargs -0 rm -f

Note the corresponding -print0 and -0 ; without them there is incorrect handling of filenames with space characters.
--
A demonstration of quoting types:

echo "this is $HOME"
echo 'this is $HOME'

Concerning *.back there is no difference.

Yes. I believe that is the optimal solution.

Regards,
Alister