Fastest way to delete duplicates from a large filelist.....

OK

I have two filelists......

The first is formatted like this....

/path/to/the/actual/file/location/filename.jpg

and has up to a million records

The second list shows filename.jpg where there is more then on instance.

and has maybe up to 65,000 records

I want to copy files only (i.e. not retaining the full path) from the first filelist as long as that filename does not appear in the second list.

At the moment I have a script that roughly does this....

FULLPATHFILENAME=`cat /FileWithPath.txt`
DUPLICATESLIST=`cat /DuplicateFiles.txt`

for REMOVEDUP in $FULLPATHFILENAME ; do

  for THISDUP in $DUPLICATESLIST ; do

         ISITADUP=`echo $REMOVEDUP | grep -v $THISDUP`

         echo "$ISITADUP" >> /ListWithoutDups.txt

 done

done

As you can see the script pulls up a record from the "with the path filelist" and does an inverse grep to see if the filename is in the duplicate list, if it isn't then it outputs that filename with it's path to /ListWithoutDups.txt In the actual script it actually does some copies and other actions on the file.

This is a pretty inefficient way of doing it IMHO as it has to pull in each record individually and then check to see if it's in the duplicates list (and that could mean 1m records * 60,000 duplicate checks).

Can anyone suggest a better/more efficient way to code this to achieve the same result?

Thanks

First off, it is wise to avoid 'loading' variables with cat -- in your case, with a million filenames/pathnames, you are likely to exceed the amount that can be stuffed into a variable. Something like this would allow you to do the same thing without issues:

while read filename
do
    echo $filename
done <file-list-file

That said, you are correct your approach isn't efficient. I interpreted your requirements to be that you need a list of files from 'filewithpath.txt' that are NOT listed in the duplicate list file. If that is the case, this should work for you:

sed 's!.*/!!' FileWithPath.txt | sort -u >/tmp/f1      # strip pathname and sort removing any dups
sort -u DuplicateFiles.txt >/tmp/f2                       # both files must be sorted for comm, remove dups just in case
comm -23 /tmp/f1 /tmp/f2 >ListWithoutDups.txt
rm /tmp/f1 /tmp/f2

I have always found the options to comm to be difficult to understand and have to read the man page nearly every time I use it. In this case, comm reads both files in parallel (thus they must be sorted) and keeps the records that are unique to the first file (not listed in the second file).

Thanks for this but I'm not sure it'll work for me. My original example is the problem, sorry.

When you strip the pathname on the first line it therefore removes where to get the file that I want to perform the process's on?

Perhaps this may be of use:

$ cat filenames
no1
no2
$ cat paths
yes0
/path/to/file/yes1
/path/to/file/yes2
/path/to/file/no1
/path/to/file/no2
/path/to/file/no2/yes3
$ awk -F/ 'FNR==NR {fn[$0]; next} !($NF in fn)' filenames paths
yes0
/path/to/file/yes1
/path/to/file/yes2
/path/to/file/no2/yes3

Regards,
Alister

1 Like

Alister

Thanks for this worked great, reduced the processing time from hours to minutes!!