Making a script to copy files not seen before (using md5sum)

Hello,

I would like to make a script that searches through a SRC folder and copies only files it's never seen before to a DEST folder.

SRC = /user/.phonesync/photos-backup
DST = /usr/.phonesync/photos-new

So basically, I'd start with a:

md5sum /user/.phonesync/photos-backup/* > /user/.phonesync/photos-backup.md5

Then, I'd like to run some sort of "cp" or "rsync" from SRC to DEST after comparing the current contents of /photos-backup/ to the .md5 list. If a file is not in the MD5 list, then copy it to /photos-new/ and then add it to the photos-backup.md5. The md5 list will always get appended to and grow and grow, but I'll deal with that at some point by deleting it and generating a new md5 list.

You can check if the file name is in your 'md5' file and if not copy it:

for f in $SRC/*
do
  grep -q $f md5_files.txt
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
  fi
done
1 Like

So I'm thinking more as I go... I haven't test this yet, but it this a valid modification?

SRC=/user/nick/.phonesync/photos-backup
DST=/user/nick/.phonesync/photos-new
MD5=/user/nick/.phonesync/photos-backup.md5

for f in $SRC/*
do
  grep -q $f $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
  fi
done

Then the one thing I would want to add to this is to do an append of the md5sum of each file copied. So maybe having a "md5sum $f >>$MD5" before the close of the if statement. Does that sound like it would work? If so, how do I properly include that in the if? (Yes, I'm very new to this.)

Thanks!

You seem to have it:

for f in $SRC/*
do
  grep -q $f $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done
1 Like

OK... I have to go home and test this, but if I wanted to be able to run a command like this...

./copy_new_photos.sh nick

and that would look for the user "nick" for new photos and then copy them, then I would script it like this...

#!/bin/bash

USER=$1
SRC=/user/$USER/.phonesync/photos-backup
DST=/user/$USER/.phonesync/photos-new
MD5=/user/$USER/.phonesync/photos-backup.md5

for f in $SRC/*
do
  grep -q $f $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Did I get it right?

Have you tried rsync?

rsync -a "${SRC}"/ "${DST}"/

As far as I see, you only copy files that are not on the md5sum list. I do not see you are doing any md5sum check against the files though.

Actually you could just use --ignore-existing of rsync if you want to ignore the files that both exist in the source and destination directories.

rsync -avv --ignore-existing src/ des/

If you still want to keep track of that list and have the command that does the copying and excluding, you could use --exclude-from={PATH_TO_YOUR_LIST} of rsync

Thank you for suggestions. But rsync is not my solution. I want to be able to delete all the files in the $DST folder but yet not have files copy over again, even if the $DST is empty and the $SRC is full. Bottom line, if the files that are "currently" in $SRC have already been copied once over to $DST, then the don't copy again (unless I intentionally restart be deleting the MD5 list).

I was ponder this code last night too and I realized that it's just parsing a filename in an MD5 list. It's almost like I need to do something like this...

#!/bin/bash

USER=$1
SRC=/user/$USER/.phonesync/photos-backup
DST=/user/$USER/.phonesync/photos-new
MD5=/user/$USER/.phonesync/photos-backup.md5

for f in $SRC/*
do
  FMD5=md5sum $f
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Unfortunately, I'm not where I can test this right now, but does that seem like it would work?

---------- Post updated at 10:34 AM ---------- Previous update was at 09:39 AM ----------

OK... I had a moment to test a script and this is what happened.

Here are the two directories I'm working with as a sample:

nick@server ~$ ls t* -a -l
test1:
total 12
drwxr-xr-x 2 nick nick 4096 Aug 30 10:12 .
drwxr-xr-x 7 nick nick 4096 Aug 30 10:20 ..
-rw-r--r-- 1 nick nick   51 Aug 30 10:28 testfile1.jpg

test2:
total 8
drwxr-xr-x 2 nick nick 4096 Aug 30 10:09 .
drwxr-xr-x 7 nick nick 4096 Aug 30 10:20 ..

Here is my code:

#!/bin/bash

# The source directory where the photo folder on the phone is mirrored to
SRC=test1

# The destination directory where we want to copy only new photos we have copied before
DST=test2

# The MD5 list file that tracks which files we have copied before
MD5=test1.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=md5sum $f
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Here is the output:

nick@server ~$ ./copy.sh
./copy2.sh: line 16: test1/testfile1.jpg: Permission denied

Anyone tell me where I went wrong?

Try it using the full path in your variables:

SRC=/user/nick/test1
DST=/user/nick/test2

OK. Revised the script code.

#!/bin/bash

# The source directory where the photo folder on the phone is mirrored to
SRC=/hd1/home/nick/test1

# The destination directory where we want to copy only new photos we have copied before
DST=/hd1/home/nick/test2

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/nick/test1.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=md5sum $f
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Here is the actual location on the system:

nick@server /$ ls /hd1/home/nick/t* -l
-rw-r--r-- 1 nick nick    0 Aug 30 10:32 /hd1/home/nick/test1.md5

/hd1/home/nick/test1:
total 4
-rw-r--r-- 1 nick nick 51 Aug 30 10:28 testfile1.jpg

/hd1/home/nick/test2:
total 0

And the output:

nick@server ~$ ./copy2.sh
./copy2.sh: line 16: /hd1/home/nick/test1/testfile1.jpg: Permission denied

First off, I'm a little confused why you're getting an error after running copy.sh but the error points to another file (copy2.sh). Do you have 2 script files?
The error message you are getting points to line #16, but this is what I see in line #16:

FMD5=md5sum $f

Is that correct?
Also, is the script located in the same directory as the test1 and test2 directories?

---------- Post updated at 07:57 PM ---------- Previous update was at 07:52 PM ----------

Also, you could try turning on debugging mode with

set -x

right after #!/bin/bash
Like so

#!/bin/bash
set -x
[The rest of your script goes here]

And run it again.
This will tell you what happens with your script step by step as it is executed.

1 Like

OK. Here is a simpler view of directory (I hope) so all know what files are where:

nick@server ~$ cd ~
nick@server ~$ ls -R
.:
copy2.sh  test1  test1.md5  test2

./test1:
testfile1.jpg

./test2:
nick@server ~$ cd /hd1/home/nick
nick@server ~$ ls -R
.:
copy2.sh  test1  test1.md5  test2

./test1:
testfile1.jpg

./test2:

So you can see that my home directory is actually /hd1/home/nick
Inside my home folder there are two directories: test1, and test2
The "test1" folder has file in it.
The "test2" folder is empty.
I have a "test1.md5" file in my home directory that is empty (created with "touch test1.md5" command as one of my troubleshooting attempts).

And I made the modification (Adding set -x) to the copy2.sh

nick@server ~$ cat copy2.sh
#!/bin/bash
set -x

# The source directory where the photo folder on the phone is mirrored to
SRC=/hd1/home/nick/test1

# The destination directory where we want to copy only new photos we have copied before
DST=/hd1/home/nick/test2

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/nick/test1.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=md5sum $f
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

And here is the new output.

nick@server ~$ ./copy2.sh
+ SRC=/hd1/home/nick/test1
+ DST=/hd1/home/nick/test2
+ MD5=/hd1/home/nick/test1.md5
+ for f in '$SRC/*'
+ FMD5=md5sum
+ /hd1/home/nick/test1/testfile1.jpg
./copy2.sh: line 17: /hd1/home/nick/test1/testfile1.jpg: Permission denied
+ grep -q /hd1/home/nick/test1.md5
^C
nick@server ~$

It stops there and I have to CTRL+C to stop it and get back to prompt.

Oh and the copy.sh vs copy2.sh issue, I don't know. Perhaps that was an errant keystroke on the delete key from me when typing up my post. Can't reproduce that.

There you go.
Refer to this line:

+ grep -q /hd1/home/nick/test1.md5

The fact that the script stops there (and the fact that the command is incomplete) tells us that the error is in the previous command.
Replace the following line:

FMD5=md5sum $f

With

FMD5=$(md5sum $f)

The error resides in that the shell is not expanding the result of the md5sum command, therefore you lack one argument in the grep command that follows.
Please let me know how it goes.

1 Like

So I made that change. Here is the new script:

nick@server ~$ cat copy2.sh
#!/bin/bash
set -x

# The source directory where the photo folder on the phone is mirrored to
SRC=/hd1/home/nick/test1

# The destination directory where we want to copy only new photos we have copied before
DST=/hd1/home/nick/test2

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/nick/test1.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=$(md5sum $f)
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Here is the new output.

nick@server ~$ ./copy2.sh
+ SRC=/hd1/home/nick/test1
+ DST=/hd1/home/nick/test2
+ MD5=/hd1/home/nick/test1.md5
+ for f in '$SRC/*'
++ md5sum /hd1/home/nick/test1/testfile1.jpg
+ FMD5='56eb2d747f5e0e08d8e743c968fc53aa  /hd1/home/nick/test1/testfile1.jpg'
+ grep -q 56eb2d747f5e0e08d8e743c968fc53aa /hd1/home/nick/test1/testfile1.jpg /hd1/home/nick/test1.md5
+ [[ 1 -ne 0 ]]
+ cp /hd1/home/nick/test1//hd1/home/nick/test1/testfile1.jpg /hd1/home/nick/test2
cp: cannot stat `/hd1/home/nick/test1//hd1/home/nick/test1/testfile1.jpg': No such file or directory
+ md5sum /hd1/home/nick/test1/testfile1.jpg


nick@server ~$ ls t*
test1.md5

test1:
testfile1.jpg

test2:
nick@server ~$ ~

This will do the trick:

#!/bin/bash
set -x

# The source directory where the photo folder on the phone is mirrored to
SRC=test1

# The destination directory where we want to copy only new photos we have copied before
DST=test2

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/nick/test1.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=$(md5sum $f)
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

Once you fixed the shell expansion thing you had to revert to a relative pathname for both the source and the destination.

1 Like

OK... well there's only one hitch with that... I actually wanted to start my finalized script like this:

#!/bin/bash

# The username variable passed by command line to this script
USER=$1

# The source directory where the photo folder on the phone is mirrored to
SRC=/hd1/home/$USER/.phonesync/photos-backup

# The destination directory where we want to copy only new photos we have copied before
DST=/hd1/home/$USER/.phonesync/photos-new

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/$USER/.phonesync/photos-backup.md5

This way I could set up cron jobs to execute this one script multiple times but for different users.

---------- Post updated at 09:31 AM ---------- Previous update was at 09:23 AM ----------

So this is getting closer to the final script I would like to have:

#!/bin/bash
set -x

# The username variable passed by command line to this script
USER=$1

# The source directory where the photo folder on the phone is mirrored
SRC=/hd1/home/$USER/.phonesync/photos-backup

# The destination directory where we want to copy only new photos that we have never copied before
DST=/hd1/home/$USER/.phonesync/photos-new

# The MD5 list file that tracks the files that we have copied before out of photos-backup directory
MD5=/hd1/home/$USER/.phonesync/photos-backup.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
for f in $SRC/*
do
  FMD5=$(md5sum $f)
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
    md5sum $f >> $MD5
  fi
done

# In case this script gets run as root, redo the file ownership so users can access their photos
chown -R $USER:$USER $DST

I can't test that at the moment but I will set up a real test with those directories later today and put up the result.

If you do this:

SRC=/hd1/home/$USER/.phonesync/photos-backup

Then

DST=/hd1/home/$USER/.phonesync/photos-new

And if your for loop is initialized like this:

for f in $SRC/*

The copy command cp $SRC/$f $DST will end up doing this:

cp /hd1/home/$USER/.phonesync/photos-backup///hd1/home/$USER/.phonesync/photos-backup/yourfile.jpg /hd1/home/$USER/.phonesync/photos-new
Which, as you can see, just doesn't make sense.

So here's what I would do:

cd $SRC
for f in *
do
  FMD5=$(md5sum $f)
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $f $DST
    md5sum $f >> $MD5
  fi
done

As I said earlier, it's all a matter of messing up with pathnames, but this should do the trick.

1 Like

Why not simply check the dest folder .....

SRC=/user/nick/.phonesync/photos-backup
DST=/user/nick/.phonesync/photos-new

for f in $SRC/*
do
  test -x $DST/$f
  if [[ $? -ne 0 ]]; then
    cp $SRC/$f $DST
  fi
done

The "why not" was explained in message #8 in this thread.

But, even if this idea would work, the code presented above would not; $f consists of the expansion of $SRC a slash and filename where filename is the name of a file in $SRC , so:

  test -x $DST/$f

tests for the existence of a file with the pathname $DST/$SRC/filename which will never be true.

1 Like

This is the final working script!

#!/bin/bash

# The username variable passed by command line to this script
USER=$1

# The source directory where the photo folder on the phone is mirrored to
SRC=/hd1/home/$USER/.phonesync/photos-backup

# The destination directory where we want to copy only new photos we have not copied before
DST=/hd1/home/$USER/.phonesync/photos-new

# The MD5 list file that tracks which files we have copied before
MD5=/hd1/home/$USER/.phonesync/photos-backup.md5

# Check files against the MD5 list and then copy if not previously copied
# Then add the md5 for that file to the MD5 list
cd $SRC
for f in *
do
  FMD5=$(md5sum $f)
  grep -q $FMD5 $MD5
  if [[ $? -ne 0 ]]; then
    cp $f $DST
    md5sum $f >> $MD5
  fi
done

# In case this script gets run as root, redo the file ownership so users can access their photos
chown -R $USER:$USER $DST

THANK YOU ALL! I've learned a little through this process. I wish I could say I understood it all. But the most useful thing I probably learned here was the "set -x" tool to help me see what a failing script is doing. Thank you everyone!