Speed problems with tar'ing a 500Gb directory on an eSATA drive

omnisppot · April 11, 2012, 6:03am

I'm trying to compress a directory structure on an external hard drive, connected by eSATA cable to my linux (Ubuntu 10.04) desktop. The total volume is 500Gb with half a million files, ranging from Kb to Mb in size. The drive is 2Tb, with 0.8Tb free space before compression.

running "tar -pcf directory.tar directory" worked for a previous, entirely analogous, 400Gb set of data in about 10 hours.
This time, the command has been running for 7 days, and the tar file is now only growing at 2 Gb/hour - estimated another 50+ days for completion.

I've run it twice now (the cable fell out the first time after two days) and the lack of results is reproducible. Deleting some of the other data from the external drive made no difference.

I'm about to try installing a large RAID0 system in the linux desktop (current drive is almost full), do a straight cp of the directory to there, and repeat the tar locally.
But if anyone has any ideas why this process might be so painfully slow it would be appreciated!

Thanks.
Simon

methyl · April 11, 2012, 6:09pm

I don't think that tar or cp are the right commands.

To make a straight copy to another mounted filesystem and preserve permissions:

cd /filesystem_to_copy
find . -xdev -print | cpio -pdumv /new_filesystem

Ps. I have never used tar to back up anything. It is sometimes useful for moving files to alien systems.

There is meaning to the -p switch to tar in this context.

Corona688 · April 11, 2012, 6:26pm

Not being your preferred commands isn't what's making them operate slow, however. I sincerely doubt cpio is going to break the speed barrier here.

What bus speeds would you expect from your disks, omnisppot? Could you be having southbridge issues -- perhaps the bus is saturated?

methyl · April 11, 2012, 6:50pm

I agree with Corona688. Probably Hardware problem.
However I have seen a modern tar (i.e. one which can deal with files larger than 2Gb) crawl when it demands more memory.

omnisppot · April 13, 2012, 6:28am

[quote="methyl,post:2,topic:308151"]
I don't think that tar or cp are the right commands.

To make a straight copy to another mounted filesystem and preserve permissions:

cd /filesystem_to_copy
find . -xdev -print | cpio -pdumv /new_filesystem

Ps. I have never used tar to back up anything. It is sometimes useful for moving files to alien systems.

Thanks for the input, but the goal is to move the 500Gb of data from the external drive to an offsite compute cluster. I believe the only way I can do this is ftp, and ftp only supports moving single files, not directories. GUIs like Filezilla don't work as they prompt for a new password every time the token-generated one expires.
I don't think it's possible to mount the external hard drive from a cluster that's behind a firewall - I can only connect to the cluster, not from it

---------- Post updated at 05:25 AM ---------- Previous update was at 05:20 AM ----------

Sorry, I don't know how to precisely answer that question! I know I can (if I had enough internal hard drive space) "cp -r" all the data down the SATA cable in a few hours without any issues. It certainly seems I/O on the external drive is the bottleneck with tar. Hopefully this is at the external disk end and not the mobo bus end. Hopefully (I'll find out next week) doing it on a 4-drive RAID0 will overcome that!

---------- Post updated at 05:28 AM ---------- Previous update was at 05:25 AM ----------

My linux box has 16Gb RAM, but whilst doing this system usage didn't exceed 3Gb (including running the O/S and everything else).

methyl · April 13, 2012, 9:48am

Can you use walknet? i.e. take the external disc drive to the target computer.

rbatte1 · April 13, 2012, 10:09am

Would you consider:-

# cd source_directory
# tar -cvf - . | rsh target_server "cd target_directory ; tar -xvf -"

I'm assuming it's rsh not remsh for your OS.

If the server is remote or the network is the bottleneck, you could consider:-

# cd source_directory
# tar -cvf - . | compress | rsh target_server "cd target_directory ; uncompress | tar -xvf -"

Of course, this latter option costs on CPU and is best on multi-proc servers so that the tar and compress are not competing.
I've shovelled 200Gb between remote sites over 2M link in a weekend with something like the above, although the syntax will need to be checked. I must have got pretty good compression I suppose. I can't really test it at the moment.

You will need to ensure that the local server can remote shell to the target. An entry in /.rhosts should suffice, but if this seems a good plan but you can't get remote shell working, let us know.

I hope that this helps
Robin
Liverpool/Blackburn
UK

omnisppot · April 13, 2012, 10:53am

You mean like...
(that xkcd comic that I can't link to as I haven't made 5 posts yet)

Actually, that's how I got the data onto the HD in the first place...taking it on a train and leaving it with someone else over a weekend. I'll discuss cutting out the middleman in future with them, but I suspect there will be issues with access control on our respective systems (ie I might have to get the train there to log in for him, or vice versa).

omnisppot · April 16, 2012, 1:22pm

Thanks Rbatte1!

One slight modification
cd source_directory tar -cvf - . | rsh target_gateway "ssh -X target_server; cd target_directory ; tar -xvf -"seems to be working

I would never have guessed that a dot and a hypen would descend through all the subdirectories. How do you learn this type of syntax?

Corona688 · April 16, 2012, 1:28pm

The hyphen has nothing to do with that. Neither does the dot, really. tar is simply recursive by design. Give it one folder and it'll archive all the contents. I don't think you can even turn that off.

There literally is a folder named dot no matter where you are. It's just short form for "current folder" and will work with any program that uses folders. Try ls . There's also a .. which means "one folder up from the current folder". These folder shortcut things are an extremely old feature and found nearly everywhere.

The hyphen tells tar to write to standard output instead of to an actual file. It'd spew it straight to the terminal if you didn't redirect it. But since you put a pipe after it, it puts it straight into rsh.

On the other end, where you extract with tar, the same - is taken to mean "read from standard input". And what is standard input? If you didn't pipe anything into it, it'd be reading straight from your terminal, but because of the pipe, it's reading from the program before -- tar.

So in this manner you create a tarball, feed it over the network, and extract it on the other end.

migurus · April 16, 2012, 7:55pm

If rsync is available I would use it, I had in the past tar | rsh conveyer fail when dealing with several gigabytes worth of data, but never had problems with rsync