Shellscript to sort duplicate files listed in a text file

deaddevil · October 28, 2009, 12:45am

I have many pdf's scattered across 4 machines. There is 1 location where I have other Pdf's maintained. But the issues it the 4 machines may have duplicate pdf's among themselves, but I want just 1 copy of each so that they can be transfered to that 1 location.

What I have thought is:
1) I have designed a script that will scan each of the 4 machines, and print the list of pdf files in a text file named list.txt.
2)So now I have all the pdf's listed in the list.txt file.
3) I need a shellscript that will now check this list and sort duplicate files. So that I know where are they located and even have them grouped together.
The list.txt contains the path along with the file name. so I guess we have to check just the ending file name part before ".pdf".
Please help me do this.

The list.txt looks like below, which is already generated.

/home/santosh/z_literature/MIF_Oxime_ph4_JBC_May2007.pdf
/home/santosh/z_literature/J_immun_biochemOFmif.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/z_literature/sak/san/03_IEJMD_05_nkr1.pdf
/home/santosh/z_literature/sak/san/07_JCAMD_06_CoRIA.pdf
/home/santosh/z_literature/sak/san/DDP-IV-JMM2007.pdf

cfajohnson · October 28, 2009, 1:16am

Copy them all to the single location; duplicates will be overwritten.

Then rm all the other PDFs.

daptal · October 28, 2009, 1:20am

If you want to do it using a script

cat abc.txt
/home/santosh/z_literature/MIF_Oxime_ph4_JBC_May2007.pdf
/home/santosh/z_literature/J_immun_biochemOFmif.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/z_literature/sak/san/03_IEJMD_05_nkr1.pdf
/home/santosh/z_literature/sak/san/07_JCAMD_06_CoRIA.pdf
/home/santosh/z_literature/sak/san/DDP-IV-JMM2007.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/y_literature/sak/san/06_JCTC_06_bome.pdf

use inline perl

cat abc.txt |perl -e 'my %hash;while($full_filename = <>){ chomp ($full_filename);my @cols = split("/",$full_filename);push @{$hash{$cols[-1]}}, $full_filename;}print "-"x80,"\n";foreach my $fn (keys %hash){print "$fn\n";map {print "$_\n";} @{$hash{$fn}};print "-"x80,"\n";}'

Added output formatting for readability.

Replace abc.txt with what ever file you have.

HTH,
PL

Franklin52 · October 28, 2009, 3:50am

Another approach:

awk -F"/" 'a[$NF]{print a[$NF];print $0;next}{a[$NF]=$0}' file

deaddevil · October 28, 2009, 5:34am

thanks Franklin52 your script did the trick!!
Also the others who helped thanks a lot, really appreciated your time you spared for the script!!

protocomm · October 28, 2009, 6:03am

Franklin52....Could you explain your command line ???

I don't understand what is the value of the array a

Thanx

Franklin52 · October 28, 2009, 6:38am

awk -F"/" 'a[$NF]{print a[$NF];print $0;next}{a[$NF]=$0}' file

Explanation:

{a[$NF]=$0}

The value of array a is the current line, the index is the filename (last field : $NF).

a[$NF]{print a[$NF];print $0;next}

If a line has a file ($NF) defined in array a, print the the saved line of the element a[$NF] and the current line.

I hope this helps.

deaddevil · October 28, 2009, 7:05am

thanks for the explanation, needed that, also what modification is needed to display the non duplicate files as well but after all duplicate ones are displayed?

Franklin52 · October 28, 2009, 7:24am

The easiest way is to redirect the output of the command to a file, eg dup_files and use grep to get the other files:

grep -v -f dup_files file

deaddevil · November 2, 2009, 1:07am

I now have a list of duplicate files, but the issue is I need to eliminate only the ones that are same not the ones that are different but still have the same name.

For eg
if the files are
david/project1/symbiosys.pdf
tom/project1/symbiosys.pdf

if both are workng on same project the pdf's may be similar, but I need to be sure, maybe by md5 checksum or something that can be found out,
but if the file size differs i need to save both of them, in 2 different folders to prevent them from overwriting.

Any suggestions or help in regards to shellscript needed.

Franklin52 · November 2, 2009, 4:31am

Not sure if this is what you want but you can use ls -l to check the file size of the files:

awk -F"/" 'a[$NF]{system("ls -l " a[$NF]);system("ls -l " $0);next}{a[$NF]=$0}' file

deaddevil · November 2, 2009, 4:54am

what I need is somewhat difficult to code, so i'll do this manually, also thanks for the help!!!
really appreciate the time!