Alternative to cp command

Stellaman1977 · June 7, 2018, 1:56pm

Good Afternoon,

I'm backing up a folder from one NAS to another using a unix script using

cp

. Its a lot of files and takes several days to complete. Most of the files don't change from week to week. Is there a command that would be quicker?

Also note, the backup needs to be ready-to-use in an instant- not in an archive or something that would need to be extracted or anything.

disedorgue · June 7, 2018, 2:49pm

Hi,

Maybe, you can to see to side: https://www.unix.com/unix-for-dummies-questions-and-answers/128363-copy-files-parallel.html

Regards.

Corona688 · June 7, 2018, 2:58pm

We're not going to improve "several days" into "instant" no matter what the means. Further, any program which creates files does largely does the same thing as cp.

Parellizing is usually a non-starter. The bottleneck you're already hitting would get worse.

Speeding it up, then, is a matter of improving or bypassing the connection to your NAS.

The holdup is very likely protocol latency multiplied by thousands of tiny files. If that's handled locally on your NAS, it will be much faster. You say "no archive", but that's still my answer. You don't have to wait for it to transfer, or even store it after all, the whole point of a UNIX tarball is you can extract it on the fly. You can transfer it over a network pipe of some sort and extract it while it's still being transferred.

Something like:

tar -C /path/to/localfolder -cf - | ssh -T username@host 'cd /path/to/destination ; tar -tf -'

Change tar -tf to tar -xf once you've tested and see that it does what you want.

hicksd8 · June 8, 2018, 3:35am

I think that I get the question as you've described it quite clearly. Different people will have different solutions but this is what I would do.

(Obviously, if the (original) copy takes several hours then users could be modifying files during that time so you need somehow to cope with that.)

Do the first copy using find piped to cpio and create a timestamp of the event:

# cd <source directory>
# date > timenow
# find . -depth -print | cpio -puvdm <destination directory>
# mv timenow timelastcopy

NOTE: The <destination directory> MUST already exist before the command is run otherwise it will fail, so create it manually if need be.

After the first copy, select only files that have changed since the last copy by using the -newer switch on find :

# cd <source directory>
# date > timenow
# find . -newer timelastcopy -depth -print | cpio -puvdm <destination directory>
# rm timelastcopy
# mv timenow timelastcopy

Note that we create the timestamp (timenow) before we start to copy because users might modify files whilst the copy is executing.

This way files that have not changed since before the very start of the last copy will not be copied again. The incremental copies will therefore be much quicker than a full copy. If the job fails to complete then the timelastcopy will not get updated so these files will get selected again on the next run.

Hope that helps and I hope I've explained that clear enough. If not, post back your questions.

Peasant · June 8, 2018, 11:38am

Considering you posted that most files remain unchanged, a rsync will not copy same files based on date or checksum.

This will lower io on the disks and network, but increase cpu usage if checksum is used.

What's a lot of files ?

Regards
Peasant.

jim_mcnamara · June 8, 2018, 11:40am

This is clearly NOT an alternative to a command. So it may not meet your needs.

What file system? ZFS, EXT4...?

Some file systems support snapshots, so you backup from the snapshot. If you create a snapshot at time T, then run your backup against the time T snap at T + 10 days, you still get what was there at time T. No corruption.

You can also clone a file system to a different name, filesysA -> filesysB, then backup filesysB at your leisure.

You can get snapshot and clone worthy filesystems for Linux and Solaris. I do not know about HP or AIX.