To Delete the duplicates using Part of File Name

I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with

find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;

but not getting same outcome as ls -t and my logic also not working.

#!/bin/bash
for i in `ls -t *xml|awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;
do
rm $i 
done

File names like

AECZ00205_010917_1506689024063.xml
AECZ00205_010917_1506689024064.xml
AECZ00205_010917_1506689024066.xml [Latest]
AECZ00207_010917_1506690865368.xml
AECZ00207_010917_1506690865369.xml
AECZ00207_010917_1506690865364.xml [Latest]
AECZ00209_010917_1506707811518.xml
AECZ00209_010917_1506707811519.xml
AECZ00209_010917_1506707811529.xml [Latest]
1 Like

Untested, but should be close to what you need. If the list of rm commands produced by the following looks correct, remove the echo and run it again to actually remove the files:

#!/bin/bash
ls -t | awk -F_ '/xml$/ && ++dup[$1] >= 2' | while IFS= read -r i
do
	echo rm $i 
done

One could also try:

#!/bin/bash
ls -t | awk -F_ '/xml$/{if($1 in dup) print; else dup[$1]}' | while IFS= read -r i
do
	echo rm $i 
done

which, with lots of files, consumes a little bit less memory.

1 Like
find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk '{w=$0; sub(".*/", "", w); sub("_[0-9_][0-9_]*.*", "", w);} !a[w]++'
1 Like

How to delete the duplicates using this?

Lke Don Cragun stated, be careful before you run rm :

find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' |
   sort -rg |
   sed -r 's/[^ ]* //' |
   awk '{w=$0; sub(".*/", "", w); sub("_[0-9_][0-9_]*.*", "", w);} a[w]++' | while read f
   do
      echo "rm -f $f" 
   done > rm_file

Verify files that actually need to be deleted then run:

sh rm_file

The above post was to identify the latest (the files you want to keep?) as shown in case you just wanted to do a move or copy of files to a new directory and keep all data.

1 Like

It working perfectly fine. Can I know what that AWK does please explain i have to brief my coworker

Which one? One of the two in post #2 that only processes xml files in the current directory using ls -t ? Or the one in post #5 that uses find , sort , and sed to process all .xml files in the entire file hierarchy rooted in the current directory?

how about using fdupes?

The fdupes utility (on systems that have it) looks for files with the same sizes or contents; not files with similar names.