Unique files in a given directory

I keep all my files on a NAS device and copy files from it to usb or local storage when needed. The bad part about this is that I often have the same file on numerous places. I'd like to write a script to check if the files in a given directory exist in another.

An example:

say I have a directory called "Stuff" and another called "AllMyFiles" I want a script to check the directory "Stuff" to tell me which files already exist in "AllMyFiles". The way I currently do this is to use fdupes to create a list of all duplicate files in both directories, then use grep to spot which duplicate files are in "Stuff". The drawback to this is that it checks all files for any duplicates, including those in "AllMyFiles", so fdupes takes a long time. Is there a clever way of avoiding this and checking only the files in "Stuff" to see if a duplicate exists for it in "AllMyFiles"?

Keep a list of your fie cksums, and use that to filter new files (still cksums all Stuff every time):

#!/usr/bin/bash
 
# first time only # ( cd AllMyFiles ; find * -type f | xargs -n99 cksum > ~/AllMyFiles.ck )
 
( cd Stuff ; find * -type f | xargs -n99 cksum | sort -u +0 -1 > ~/Stuff.ck )
 
comm -23 <( cut -d ' ' -f 1 ~/Stuff.ck ) <( cut -d ' ' -f 1 ~/AllMyFiles.ck | sort ) > ~/newStuff.ck
 
join ~/newStuff.ck ~/Stuff.ck | while read ck len fn
do
 cp Stuff/$fn AllMyFiles/$fn
 echo $ck $len $fn >> ~/AllMyFiles.ck
done
1 Like

Thanks for the great help DGPickett. Could you please explain what some of the switches do and their importance; for example -n99 in xargs and -u +0 -1 in sort. The reason I ask is that I rewrote this using parallel and I'm wondering if my script will lead to some pitfalls that I've overlooked.

#!/bin/bash
#cdupes.sh
MasterL=Master.ck
CompareL=Compare.ck

#Parallel
find "$1" -type f | parallel cksum > "$MasterL"
find "$2" -type f | parallel cksum | sort > "$CompareL"

join "$CompareL" <(cut -d' ' -f1 "$MasterL" | sort -u) | cut -d' ' -f3-

xargs is a very nice way to get economy of scale in shell scripting, like calling grep once for every 99 files, not for every file. -n99 does 2 things, recommends trying to fit 99 on the command line (really, commands execvp()'d are arrays of pointers to arrays of characters, not one string), and also says do not run for empty.

Sort has old and new keys. These are old keys, zero-based and for whole white space separated fields, so sort -u +0 -1 is sort on the first field and toss any late duplicate first field records. If many files have the same checksum, they are probably identical, in fact probably empty!

You can "man sort" and "man xargs" for this, or use the "Man Pages" link above, or google.

I make lists, like database tables. I can cut off the first, key field and make key lists, then run them through comm to find out what is in list 1 but not 2 nor both. Then I can use that still sorted key in join to pull the desired file names. "while read x y z" says read lines and divide fields by $IFS (white space by default) to x first, y second and z rest.

Gnu parallel is much like xargs, but on steroids. I am not sure how it distributes the lines and how it syncs them back to sequential, in terms of costs, latency and disk space and such. I have several parallel tools, but xargs is good enough for many things. Since this feeds a sort, line buffering might be fine for many fd wrting one pipe, and who cares about order! I will look into it! One wonders if and how it buffers thread 2-n until 1 is done. Thanks!

Speedup: find all files in Stuff and then use sort, cut and comm to find out which files are new (not on the old Stuff list), and cksum them only making a new Stuff list, and finally add these cksums to the new Stuff list.

1 Like

alternatively you can use finddup also: Finddup - Find duplicate files by content, name

Find the duplicate files by name
./finddup -n

Displays files of the current directory which are all same by its name.
1 Like

Neat! Most files are different really quickly, which might save time in a specialized code. Not reviewing old files is a competing tactic, so it depends on the numbers (and if files change under the same name).

Speedup idea above might also include files newer than last checksum file, to pick up revisions, if not linked in.

thanks thegeek, for duplicate filenames I have a perl script which I have been using for a long time. It's pretty fast and I can add my own criteria for the search.

but I might just start using finddup instead of fdupes and the perl script for simpler searches.

After using the script above, to move to a new directory with same paths:

./cdupes.sh directory1 directory2 | parallel 'file={}; fpatha=${file#*./}; fpath=${fpatha%/*}; mkdir -p "./ToDelete/$fpath"; mv "$file" "./ToDelete/$fpatha"