Best way to copy 4Tb of data from one filesystem to another

psychocandy · May 24, 2017, 3:53am

User needs to copy a number of directories. Currently, they are using cp -pr which is taking way too long.

Wonder if its worth tarring and compressing first?

Any recommendations for what would be quickest way?

Neo · May 24, 2017, 5:31am

Did you search the forums before posting?

For example:

The Fastest for copy huge data

rbatte1 · May 24, 2017, 9:50am

The answer will always be 'It depends.'

What is the data like (how much can it compress)?
Are the file attributes set to compress them? This will cause delays in itself.
Is the target on the same server?
[list]
For a local copy, there could be disk contention.
What is you disk infrastructure like? Do you have multiple controllers, SAN disk (how many paths etc.) or simple disks with logical volume mirrors?
How much data are you copying?
How many files are there? A file creation takes several IO operations beyond the volume of data so many small files can copy slower than few bigger ones.
How much memory do you have? For a copy to the same server with lots of free memory, you might be able to cache the files to reduce IO contention something like this:
find /path/to/source -type f -exec cat {} \; /dev/null
[/list]
For a remote copy, what is your network like - and that could be endless........

What can you tell us about the server(s) and data? It will probably be trial and error.

If this is a regular process, consider rsync which will only copy differences. It might still be slower if you have milliions over small files though. If there is lots of data change, perhaps you should consider snapshots, but then you haven't told us what OS you have, so i don't know if that's available to you in any useful way.

Robin

Corona688 · May 24, 2017, 11:58am

First, a sanity check: 4 terabytes, divided by a good spinning-disk transfer rate of 100 megabytes per second, is ten hours at minimum. Easily 20 for more average disks. You'd need a fancy striped RAID array, or an implausibly large SSD, to beat that.

A great deal depends on your data of course, but knowing why trillions of teeny files are slower than fewer, larger files is not helpful in making the system access trillions of teeny files faster.

Generally speaking? cp does not have a "go faster" button, or else we'd be pushing it all the time anyway. cp is not slow, naive, or wasteful.

Peasant · May 24, 2017, 12:15pm

Do it on a block level if you can.

Use zfs send / receive using netcat (the data stream is not encrypted over network).
It really versatile, poor man enterprise level file system replication, features matching expensive storage arrays.

For local stuff, mirror the pool and use zpool split.
If you are using SVM, mirror, remove from mirror, use the copy.

Using cp/rsync on large number of files is bound to take longer on same hardware.

Alternatively, you can use dd as well if you want entire disks, but it's a one time crude operation and long requires downtime on source.

gandolf989 · May 24, 2017, 2:28pm

Personally I prefer rsync to cp. But for 4TB, it probably doesn't make enough of a difference. Are both volumes on the same server? Are you using a SAN? You might find that SAN snapshot and replication is the best way to move the data. But that depends on your storage system. You might want to provide more information on the type of disk storage that you use and check with your SAN vendor, if you use a SAN, to see what they think.

bakunin · May 24, 2017, 4:51pm

This sounds like a good idea, worth trying. How about this to "blockify" the I/O:

cd /source ; tar cf - * | (cd /target ; tar xf - )

I hope this helps.

bakunin

psychocandy · May 25, 2017, 3:44am

:rolleyes:

achenle · May 25, 2017, 6:53am

This might be better, as it won't run into globbing issues if there are too many files, and it will copy hidden files/directories too:

cd /source ; tar cf - . | (cd /target ; tar xf - )

But it probably won't be any faster than cp anyway.

The real solution: buy faster disks.

bakunin · May 25, 2017, 7:59am

True, but my point was rather to suggest the general idea than to provide a watertight solution. I could have written by creating an I/O-stream (via tar ) from the various files and using the restore-part of tar instead of single-threaded file-I/O as with cp it might be - probably depending on the exact implementation of tar - a bit faster than using cp as well.

Yes, i could be mistaken and i don't have any prior experience with copying this much data (the only times i had to move so much data i did it with SAN-methods, foregoing the OS completely). It is just - IMHO - worth a try.

At any rate: you surely are correct that in using different storage technologies - faster disks, tiered storage, etc., in other words changing the underlying physics - will have more effect than anything even the most clever OS trick can hope to achieve.

bakunin

drl · May 25, 2017, 8:41am

Hi.

It made a difference in this informal test for a 2 GB file: Simple speed comparison between cp, mv, and rsync | RothWerx

That, however, was not zfs (probably).

Best wishes ... cheers, drl

Peasant · May 25, 2017, 11:33am

The OP has not specifed the filesystem / mirroring techniques / external disks he is using.
This information is essential, as well as what the actual objective is in more detail.

Most filesystems have atime on by default, this will cause the system to update atime property for each file accessed.
Will slow things down causing more i/o.

For SVM ,It is just faster to mirror / break mirror approach for initial copy, followed by a incremental using rsync if you need to traverse the data.
rsync is a great and versatile tool, with many options, from which i especially like --link-dest when using hard links.

For ZFS locally i would make a attach of new disk on current setup if possible (three way mirror for instance, from 1+1).
Why mirror, well zfs will copy only used data and in turn check your data for corruption in the process. Wonderfull isn't it
After the mirror completes, you do a zpool split and data is available for mounting on that on any other system (for fc, iscsi..)