To Find Duplicate files using latest in Linux

gold2k8 · November 18, 2017, 11:40pm

I have tried the following code and with that i couldnt achieve what i want.

    #!/usr/bin/bash
    find ./ -type f \( -iname "*.xml" \) | sort -n > fileList
    sed -i '/\.\/fileList/d' fileList
    NAMEOFTHISFILE=$(echo $0|sed -e 's/[]\/()$*.^|[]/\\&/g')
    sed -i "/$NAMEOFTHISFILE/d" fileList
    cp fileList auxFileList
    while read FILENAME
    do
        sed -i '1d' auxFileList
        #echo "Comparing $FILENAME with :"
        #Read the aux file and compare current file with every other element in the file
        while read COMPFILENAME
        do
            RETURN=$(diff $FILENAME $COMPFILENAME)
            if [ "$RETURN" == "" ]
            then
            cat $FILENAME | awk ' BEGIN { FS="_" } { printf( "%03d\n",$2) }' | sort | awk ' { printf( "data_%d_box\n", $1)  }'
             #echo "$FILENAME AND $COMPFILENAME are identical"
             #rm -r $FILENAME
            fi
            #echo "  $COMPFILENAME"
        done<auxFileList
    done<fileList
    rm fileList auxFileList &>/dev/null
    printf '\n\n'

this code selecting all the files initially. I have to amend my code in such a way that only recent modified filename patterns for example

    File 1: AAA_555_0000 
    File 2: AAAA_123_123 
    File 3: AAAA_452_452 [latest]
    
    File 4: BBB_555_0000 
    File 5: BBB_555_555 
    File 6: BBB_999_999 [latest]
    
    File 7: CCC_555_0000 
    File 8: CCC_000_000 
    File 9: CCC_000_111 [latest]

Script has to pick latest file in all the filename patterns in the folder and it should compare and delete the duplicates.

Appreciate if you can help me with this logic.

Thanks much!

gold2k8 · November 19, 2017, 12:59am

I have a folder with series of filename patterns like the below.

./ARCZ00300_010117_1504690829222.xml
./ARCZ00300_010117_1507101655366.xml [latest]
./ARCZ00301_010117_1504691829478.xml
./ARCZ00301_010117_1507101655591.xml  [latest]
./ARCZ00302_010117_1504691451495.xml
./ARCZ00302_010117_1507101656182.xml  [latest]
./ARCZ00303_010117_1504691526615.xml
./ARCZ00303_010117_1507101657147.xml  [latest]
./ARCZ00304_010117_1504691981689.xml
./ARCZ00304_010117_1507101657249.xml  [latest]
./ARCZ00305_010117_1507101657610.xml
./ARCZ00306_010117_1507101658585.xml
./ARCZ00307_010117_1504691981668.xml
./ARCZ00307_010117_1507101658940.xml  [latest]
./ARCZ00577_010117_1504692004529.xml
./ARCZ00580_010117_1504691562602.xml
./ARCZ00580_010117_1507101892930.xml  [latest]

Script has to pick latest file in all the filename patterns in the folder and it should compare and delete the duplicates.

Appreciate if you can help me with this logic.

Thanks much!

Don_Cragun · November 19, 2017, 2:44am

Maybe this would come closer to what you want:

#!/bin/bash
ls -r *.xml | while read -r file
do      if [ "$last" = "${file%%_*}" ]
        then    echo rm "$file"
        else    last=${file%%_*}
        fi
done

If that gives you the list of files you want to remove, remove the echo shown in red and run the script again.