I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with
Untested, but should be close to what you need. If the list of rm commands produced by the following looks correct, remove the echo and run it again to actually remove the files:
#!/bin/bash
ls -t | awk -F_ '/xml$/ && ++dup[$1] >= 2' | while IFS= read -r i
do
echo rm $i
done
One could also try:
#!/bin/bash
ls -t | awk -F_ '/xml$/{if($1 in dup) print; else dup[$1]}' | while IFS= read -r i
do
echo rm $i
done
which, with lots of files, consumes a little bit less memory.
Lke Don Cragun stated, be careful before you run rm :
find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' |
sort -rg |
sed -r 's/[^ ]* //' |
awk '{w=$0; sub(".*/", "", w); sub("_[0-9_][0-9_]*.*", "", w);} a[w]++' | while read f
do
echo "rm -f $f"
done > rm_file
Verify files that actually need to be deleted then run:
sh rm_file
The above post was to identify the latest (the files you want to keep?) as shown in case you just wanted to do a move or copy of files to a new directory and keep all data.
Which one? One of the two in post #2 that only processes xml files in the current directory using ls -t ? Or the one in post #5 that uses find , sort , and sed to process all .xml files in the entire file hierarchy rooted in the current directory?