Shellscript to sort duplicate files listed in a text file

I have many pdf's scattered across 4 machines. There is 1 location where I have other Pdf's maintained. But the issues it the 4 machines may have duplicate pdf's among themselves, but I want just 1 copy of each so that they can be transfered to that 1 location.

What I have thought is:
1) I have designed a script that will scan each of the 4 machines, and print the list of pdf files in a text file named list.txt.
2)So now I have all the pdf's listed in the list.txt file.
3) I need a shellscript that will now check this list and sort duplicate files. So that I know where are they located and even have them grouped together.
The list.txt contains the path along with the file name. so I guess we have to check just the ending file name part before ".pdf".
Please help me do this.

The list.txt looks like below, which is already generated.

/home/santosh/z_literature/MIF_Oxime_ph4_JBC_May2007.pdf
/home/santosh/z_literature/J_immun_biochemOFmif.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/z_literature/sak/san/03_IEJMD_05_nkr1.pdf
/home/santosh/z_literature/sak/san/07_JCAMD_06_CoRIA.pdf
/home/santosh/z_literature/sak/san/DDP-IV-JMM2007.pdf

Copy them all to the single location; duplicates will be overwritten.

Then rm all the other PDFs.

If you want to do it using a script

cat abc.txt
/home/santosh/z_literature/MIF_Oxime_ph4_JBC_May2007.pdf
/home/santosh/z_literature/J_immun_biochemOFmif.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/z_literature/sak/san/03_IEJMD_05_nkr1.pdf
/home/santosh/z_literature/sak/san/07_JCAMD_06_CoRIA.pdf
/home/santosh/z_literature/sak/san/DDP-IV-JMM2007.pdf
/home/santosh/z_literature/sak/san/06_JCTC_06_bome.pdf
/home/santosh/y_literature/sak/san/06_JCTC_06_bome.pdf

use inline perl

cat abc.txt |perl -e 'my %hash;while($full_filename = <>){ chomp ($full_filename);my @cols = split("/",$full_filename);push @{$hash{$cols[-1]}}, $full_filename;}print "-"x80,"\n";foreach my $fn (keys %hash){print "$fn\n";map {print "$_\n";} @{$hash{$fn}};print "-"x80,"\n";}'

Added output formatting for readability.

Replace abc.txt with what ever file you have.

HTH,
PL

Another approach:

awk -F"/" 'a[$NF]{print a[$NF];print $0;next}{a[$NF]=$0}' file

thanks Franklin52 your script did the trick!!
Also the others who helped thanks a lot, really appreciated your time you spared for the script!!

Franklin52....Could you explain your command line ???

I don't understand what is the value of the array a

Thanx

awk -F"/" 'a[$NF]{print a[$NF];print $0;next}{a[$NF]=$0}' file

Explanation:

{a[$NF]=$0}

The value of array a is the current line, the index is the filename (last field : $NF).

a[$NF]{print a[$NF];print $0;next}

If a line has a file ($NF) defined in array a, print the the saved line of the element a[$NF] and the current line.

I hope this helps.

thanks for the explanation, needed that, also what modification is needed to display the non duplicate files as well but after all duplicate ones are displayed?

The easiest way is to redirect the output of the command to a file, eg dup_files and use grep to get the other files:

grep -v -f dup_files file

I now have a list of duplicate files, but the issue is I need to eliminate only the ones that are same not the ones that are different but still have the same name.

For eg
if the files are
david/project1/symbiosys.pdf
tom/project1/symbiosys.pdf

if both are workng on same project the pdf's may be similar, but I need to be sure, maybe by md5 checksum or something that can be found out,
but if the file size differs i need to save both of them, in 2 different folders to prevent them from overwriting.

Any suggestions or help in regards to shellscript needed.

Not sure if this is what you want but you can use ls -l to check the file size of the files:

awk -F"/" 'a[$NF]{system("ls -l " a[$NF]);system("ls -l " $0);next}{a[$NF]=$0}' file

what I need is somewhat difficult to code, so i'll do this manually, also thanks for the help!!!
really appreciate the time!