tar and gzip extraction issues

athos · August 3, 2011, 1:01pm

Not sure if this is really in the right forum but here goes....

Looking for a way to extract individual compressed files from a compressed tarball WITHOUT tar -zxvf and then recompressing. Basically we need to be able to chunk out an individual compressed file while it still remains compressed.

For whatever reason, when we tar out the single, then recompress it's destroying the file integrity. The files are so large that we can't just decompress the whole then pick out pieces, so that's out.

We had looked at going in and extracting the bits and writing them elsewhere, but from what I understand about how tar and gzip work, we'd get garbage because the gzip algorithm is predicated on the overall file rather than the pieces when working with a tarball. To gunzip the individual pieces would rely upon the initial Huffman double encoded "rosetta stone" that was generated from the overall tarball structure, right? Without that "rosetta stone", we'd not be able to accurately decompressed the individual .gzs....and a reverse algorithm wouldn't work because the initial was encoded off of patterns that were present in the whole but may not be present in the individual.

I'm a bit of a n00b, so I just need to check and make sure I've absorbed this all correctly. But, in case I've processed it all incorrectly and if there is a way or a script that can accomplish this, please point me in the right direction. Thanks.

Corona688 · August 3, 2011, 1:25pm

If you don't want to decompress a tar, don't compress it.

But if you want random seek, tar might not be the best anyway -- tar is a streaming protocol and read beginning to end.

I think cpio supports random seek but aren't positive. It also has a filesize limit however, it can't hold files larger than 4 gigs apiece.

You could also check out 7zip.

athos · August 3, 2011, 1:43pm

Unfortunately, I don't get to determine the format. It's not a compressed tarball I made. It is ready-made and I have to make lemonade with it. Otherwise, I'd probably set something else up if it were up to me.

Corona688 · August 3, 2011, 2:05pm

If it's compressed, you have to decompress it.

If you have space, try just decompressing it separately. That might speed up extraction later.

athos · August 3, 2011, 2:15pm

Ugh. Yeah, I was afraid that that was the answer. I guess there's not much I can do on my end at this point but use the process of elimination to determine if it's my or the others' work where the corruption is creeping in.

drl · August 3, 2011, 3:09pm

Hi.

Instead of using the compressed tar file, uncompress and untar the entire file, then compress the individual files, then tar the individual compressed files. That would allow you to extract a file, then uncompress only that file. It will also probably lower the risk of losing everything past a damaged place in the large compressed file. In fact, keeping a directory of the compressed individual files would allow "random access" because they would be available by filename.

The compression savings would probably differ from the original. Experimentation with a subset should allow you to estimate the difference.

Good luck ... cheers, drl

methyl · August 3, 2011, 7:52pm

What Operating System and version are you using?
How big is the largest archive (before and after compression)?
How big is the largest file (before and after compression)?

As others suggest, compressing an archive is foolish because you cannot extract individual files without decompressing the entire archive.

Is fitting lots more disc an option? In general there is no reason nowadays to compress files (because disc space is cheap) unless you need to copy them across a network.