Compare files in directories with md5sum

And not to start. I can compare files, that's easy. The problem is that I compare files in a directory, and check if these files exist in another directory. The problem is that the file names are not the same. So I have to compare with "md5sum" or something similar. How I can do?

All this in python.

Thanks! :slight_smile:

I don't understand this sentence, but ignoring that, the logic seems to be like this for me:-

  • Find all the files in your first directory. For each one, get the checksum and write a line in a work-file.
  • Find all the files in your first directory. For each one, get the checksum and check if it matches one recorded in the work-file.

I could do this in shell script, but I cannot assist with a Python.

Robin

Sorry, my English is not good.

I can also use bash.

I have downloaded some files from a url.

Files are downloaded to a directory.

wget  --mirror --no-check-certificate --no-directories --no-host-directories -l1 htts://site.com/out/

rm -f index.html

I now compare the files (they may not have the same name and be identical).

There is where'm lost and do not know how.

So, from my suggested logic above (if that's acceptable) :-

find 1st-directory -type f -exec md5sum {} \; > /tmp/1st_list
find 2nd-directory -type f -exec md5sum {} \; > /tmp/2nd_list

This will get you two files containing the file-names and the md5-checksums. You can then compare the files with diff but the output can be a bit messy. It's neater to run two commands. The following will get you files in the second list that do not match those in the first list:-

grep -vFf /tmp/1st_list /tmp/2nd_list

You can reverse this to get those in the first list that are not in the second (i.e. you might not have all the files):-

grep -vFf /tmp/2nd_list /tmp/1st_list

.
If the filenames are not important (but I rather think that they are) then you can get just the checksums like this:-

cut -f1 /tmp/1st_list > /tmp/1st_md5_only
cut -f1 /tmp/2nd_list > /tmp/2nd_md5_only

You can then show what files from your second directory are not in the first:-

grep -vFf /tmp/1st_md5_only /tmp/2nd_list

or reverse it to show what files from the first directory are missing from the second:-

grep -vFf /tmp/2nd_md5_only /tmp/1st_list

.

The grep command is 'Get Regular ExPression', so it's a way to select rows of data.

  • The -v flag means to negate the selection
  • The -F flag uses Fixed strings, else they are interpreted as expressions.
  • The -f flag uses the next item as an input file to compare to.
  • The last item is the file to scan.

I hope that this helps, but if you are still concerned, then let us know your results.

There is a good chance that you will have the same filename in the two lists with different checksums if they are downloaded at different times as fixes are released.

Robin

Thank you all for your support. At the end I resolved well.

for a in `ls *.TXT`; do
   if [ -f $RECIBIDOS$a ]
   then
     rm -f $a
   fi
done

cp *.TXT $RECIBIDOS

.......

And then processes and files.

Thanks again for your help.

You are just going to delete files if the name matches here. I thought that you were not concerned about matching names, but wanted to find duplicate files, hence the md5sum tests.

Oh well, so long as you are happy and you have a working solution.

Robin

Honestly, it's a temporary solution. When you can be able to compare files using md5sum, then I'll change the script.

Of course I prefer to use the md5sum method.

Gluing together what I have before then:-

find 1st-directory -type f -exec md5sum {} \; > /tmp/1st_list
find 2nd-directory -type f -exec md5sum {} \; > /tmp/2nd_list

grep -vFf /tmp/1st_list /tmp/2nd_list > /tmp/2nd_only_list
grep -vFf /tmp/2nd_list /tmp/1st_list > /tmp/1st_only_list

cut -f1 /tmp/1st_list > /tmp/1st_md5_only
cut -f1 /tmp/2nd_list > /tmp/2nd_md5_only

grep -vFf /tmp/1st_md5_only /tmp/2nd_list > /tmp/2nd_files_not_matched_md5
grep -vFf /tmp/2nd_md5_only /tmp/1st_list > /tmp/1st_files_not_matched_md5

This will generate a few interesting files. Alternately, you could try this:-

cd 1st-directory
find . -type f -exec md5sum {} \; > /tmp/md5file

cd 2nd-directory
md5sum -c /tmp/md5file

This will not pick up any new files in 2nd-directory, so you might need to run it both ways round.

Output should be one file at a time with a messages of:-

  • OK for matching files.
  • FAILED open or read for a missing file.
  • FAILED for a difference in the files.

Does this get you further on? Without the md5 tests, you might as well just be doing:-

cp *.TXT $RECIBIDOS
rm -f *.TXT

I hope that this helps,
Robin