To Delete the duplicates using Part of File Name

gold2k8 · January 22, 2018, 8:01pm

I am using the below script to delete duplicate files but it is not working for directories with more than 10k files "Argument is too long" is getting for ls -t. Tried to replace ls -t with

find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;

but not getting same outcome as ls -t and my logic also not working.

#!/bin/bash
for i in `ls -t *xml|awk 'BEGIN{FS="_"}{if (++dup[$1] >= 2) print}'`;
do
rm $i 
done

File names like

AECZ00205_010917_1506689024063.xml
AECZ00205_010917_1506689024064.xml
AECZ00205_010917_1506689024066.xml [Latest]
AECZ00207_010917_1506690865368.xml
AECZ00207_010917_1506690865369.xml
AECZ00207_010917_1506690865364.xml [Latest]
AECZ00209_010917_1506707811518.xml
AECZ00209_010917_1506707811519.xml
AECZ00209_010917_1506707811529.xml [Latest]

Don_Cragun · January 22, 2018, 8:35pm

Untested, but should be close to what you need. If the list of rm commands produced by the following looks correct, remove the echo and run it again to actually remove the files:

#!/bin/bash
ls -t | awk -F_ '/xml$/ && ++dup[$1] >= 2' | while IFS= read -r i
do
	echo rm $i 
done

One could also try:

#!/bin/bash
ls -t | awk -F_ '/xml$/{if($1 in dup) print; else dup[$1]}' | while IFS= read -r i
do
	echo rm $i 
done

which, with lots of files, consumes a little bit less memory.

rdrtx1 · January 22, 2018, 8:54pm

find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' | sort -rg | sed -r 's/[^ ]* //' | awk '{w=$0; sub(".*/", "", w); sub("_[0-9_][0-9_]*.*", "", w);} !a[w]++'

gold2k8 · January 22, 2018, 9:11pm

How to delete the duplicates using this?

rdrtx1 · January 22, 2018, 10:41pm

Lke Don Cragun stated, be careful before you run rm :

find . -type f \( -iname "*.xml" \) -printf '%T@ %p\n' |
   sort -rg |
   sed -r 's/[^ ]* //' |
   awk '{w=$0; sub(".*/", "", w); sub("_[0-9_][0-9_]*.*", "", w);} a[w]++' | while read f
   do
      echo "rm -f $f" 
   done > rm_file

Verify files that actually need to be deleted then run:

sh rm_file

The above post was to identify the latest (the files you want to keep?) as shown in case you just wanted to do a move or copy of files to a new directory and keep all data.

gold2k8 · January 22, 2018, 11:36pm

It working perfectly fine. Can I know what that AWK does please explain i have to brief my coworker

Don_Cragun · January 22, 2018, 11:46pm

Which one? One of the two in post #2 that only processes xml files in the current directory using ls -t ? Or the one in post #5 that uses find , sort , and sed to process all .xml files in the entire file hierarchy rooted in the current directory?

dodona · January 23, 2018, 2:51am

how about using fdupes?

Don_Cragun · January 23, 2018, 3:01am

The fdupes utility (on systems that have it) looks for files with the same sizes or contents; not files with similar names.