Find duplicates among 2 directories

I have 2 directories,

/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/
/media/andy/MAXTOR_SDB1/Linux_Files/.
 

I want to find which files are duplicates so I can delete them from one of those directories.

Hello drew77,

After doing 100+ posts in UNIX.com, we expect you to show us at least whatever you have tried in order to solve your own problem. It is always good to add your efforts in questions as we all are here to learn.

Kindly do add your efforts with CODE TAGS and do let us know then.

Thanks,
R. Singh

2 Likes

And, pls add a definition of what makes a "duplicate" - a common file name? Common meta data as e.g. size, time stamps? Identical contents / check sum?

Yes, the same file name.

I want to put programs in their own directory while putting documents, and other changing files in another directory.

--- Post updated at 03:33 AM ---

R. Singh I did use code tags in my post.

Code:

/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/
diff /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04 /media/andy/MAXTOR_SDB1/Linux_File
Only in /media/andy/MAXTOR_SDB1/Linux_Files: Briggs_Stratton_Generator.zip
Only in /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04: Brinkmann_8109415-W.zip
Only in /media/andy/MAXTOR_SDB1/Linux_Files: Brother_2240_Drivers.zip
 

I do not know what "Only in" means.

I have some html files in those directories.

My diff command showed this. (partial list)

Is there a way to not show the internals of those files?

sonType%3DcheckForloginAndRegister%26WT.z_eCTAid%3Dct1_eml_ViewPlan__ct1_eml_tra_eml_0day%26WT.z_edatesent%3D12082017&reasonCode=-1&appid=TRK_MC_CTA" ADD_DATE="1512750666" LAST_MODIFIED="1512750875" LAST_CHARSET="UTF-8">Log in | UPS Andy77586 Mar...7</A>
---
>         <DT><A HREF="https://beautifultaiwantea.com/collections/white-tea/products/silver-needle" ADD_DATE="1509652385" LAST_MODIFIED="1515190477" ICON_URI="https://beautifultaiwantea.com/favicon.ico" ICON="data:image/png;base64,R0lGODlhAQABAIABAAAAAP///yH5BAEAAAEALAAAAAABAAEAAAICTAEAOw==" LAST_CHARSET="UTF-8">Silver Needle White Tea | Beautiful Taiwan Tea Company</A>

There is a tool that can determine the identity of files using the md5 sum.

apt install fdupes

Look at the listing

fdupes dir1/ dir2/

Interactive mode with a choice to remove

fdupes -d dir1/ dir2/

The following case is suitable for use in the script
All duplicates of the file will be deleted exclude only the first file (in order of sorting name files and then name dirs!) will be saved.
A simple way to change the directory with the saved file try to use -i option. It does not change the save directory, but in the reorganized sort order, the upper file may be in the folder you need
try
fdupes -i dir1/ dir2/
and then use
fdupes -Nd dir1/ dir2/
Well, before you delete something, be sure to read the man pages on the command and make training tests on its use.

1 Like

Sure that filenames are enough? Try

diff <(ls /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04) <(ls /media/andy/MAXTOR_SDB1/Linux_File)

given your system (which you fail to mention, btw) has a shell that provides "process substitution".

diff is a utility that compares text files: you give it two text files and it will tell you the differences between these two. Up to now i didn't know that the GNU-version can compare directories too but obviously it can. I have learned something new today.

Two understand how diff works let us suppose for the moment it works on lines only (it doesn't). In principle there are three possibilities:

1) a line is present in both files
2) a line is present in file 1 (only) but not in file 2
3) a line is present in file 2 (only) but not in file 1

This is the situation you have here. Your output means the two directories will contain the same files once you:

1) copy Briggs_Stratton_Generator.zip from /media/andy/MAXTOR_SDB1/Linux_Files to /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04
2) copy Brinkmann_8109415-W.zip from /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04 to /media/andy/MAXTOR_SDB1/Linux_Files
3) copy Brother_2240_Drivers.zip also from /media/andy/MAXTOR_SDB1/Linux_Files to /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04

Notice, though, that two files which have the same name (all others not mentioned in the output) do not necessarily be the same: they still could differ in content, so you would have to compare file sizes too as a first step and even if sizes are the same it might be that the content is different. You would have to use diff again (this time on the two individual files) to find out.

I hope this helps.

bakunin

1 Like

Isn't that pretty much clear, and more than easily verifyable?
I'd propose the interpretation that Brinkmann_8109415-W.zip is available in /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04 AND NOT in /media/andy/MAXTOR_SDB1/Linux_Files

fdupes /media/andy/MAXTOR_SDB1/Linux_Files/ /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/


/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/Send_Email_Via_Command_Line.zip
/media/andy/MAXTOR_SDB1/Linux_Files/Send_Email_Via_Command_Line.zip

/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/My_Sounds.zip
/media/andy/MAXTOR_SDB1/Linux_Files/My_Sounds.zip

/media/andy/MAXTOR_SDB1/Linux_Files/Efax-gtk_Setup_IMPT.zip
/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/Efax-gtk_Setup_IMPT.zip

/media/andy/MAXTOR_SDB1/Linux_Files/SDB1_Maxtor_Drive
/media/andy/MAXTOR_SDB1/Linux_Files/MAXTOR_SDB1
/media/andy/MAXTOR_SDB1/Linux_Files/NEVER_DELETE_THIS_DIRECTORY
/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/2019-02-23_23:42
/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/2019-02-25_03:47

/media/andy/MAXTOR_SDB1/Linux_Files/multi-timer.zip
/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/multi-timer.zip
 

Is this a list of what files are identical, both in file name and content?

Yes. You can even option -S (size) indicate that there was no confusion

--- Post updated at 15:05 ---

diff utility with such files should be silent

diff \
/media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/My_Sounds.zip \
/media/andy/MAXTOR_SDB1/Linux_Files/My_Sounds.zip

It doesn't REALLY seem so, does it? Just from looking at it, and trying to apply some logics and common sense, I'd say SDB1_Maxtor_Drive , MAXTOR_SDB1 , and NEVER_DELETE_THIS_DIRECTORY are missing in /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04 /, and 2019-02-23_23:42 , and 2019-02-25_03:47 are missing in /media/andy/MAXTOR_SDB1/Linux_Files/ .
The other filenames are paired and seem to exist in either dir but cannot be considered equal based on the info known thus far.

1 Like

To see which files will be deleted
fdups -f
This will display the entire list without the top files in each section.
Well, to remove the displayed list, use
fdups -Nd
But in your case I would use an interactive way. At least until you get acquainted with the subtleties of this tool.
Good luck

--- Post updated at 15:28 ---

But it is not a fact if you compare binary files. The real is that the "fdupes" utility even works with binary files and "diff" is a text tool

Since my directories contain binaries, and diff only works with text files, it would not help me.

--- Post updated at 10:47 AM ---

The reason for my post is this.

I use Clonezilla to make images of my main drive to a 2nd older drive.

I also make images of my 2nd drive to my main drive.

That uses a lot more space than simply copying files to my main drive.

I will make a cp script for that.

rsync --progress -r -u /media/andy/MAXTOR_SDB1/Ubuntu_Mate_18.04/* /home/andy/Ubuntu_18.04_Programs/

If you want to synchronise two directories/filesystems the way you described your goal now to be then rsync is the way to go. rsync was built for exactly this purpose. You don't even need to check anything before because the toolw will do that itself and simply do nothing if there is nothing to do (that is, if the directories are in sync already).

I hope this helps.

bakunin