Extracting file from .tar.gz file

Hi,

I receive a huge .tar.gz file that may reach gega's , and i try as much as i can to avoid disk space issues

Is there a way to extract from such a big file withought the need to decompress the file? given that i know the filesnames format that i will need to extract from the tar file: they are start with 'D' and end with 'T' , for e.g D1fdS34T

My big issue is the need to decompress first as it eats the space !

Thanks in advance,
Eman El Badawy

You don't need to store the entire decompressed file on disk.

gunzip -c < file.tar.gz | tar -xf - filename1 filename2 filename3

Or on some systems, just

tar -zxf file.tar.gz filename1 filename2 filename3

As far as I know you can't extract all files beginning with a certain string though.

It has to scan the whole file to find the ones you want.

True, but you can make one pass to get the names of the files in the tar archive and save the names of the files you want using grep (or some other tool) and then extract the files you want on a second pass. The 1st pass would be something like:

tar -ztf file.tar.gz | grep -E '(^|/)D.*T(/|$)'

Again, this does not save the entire uncompressed archive to disk. And, of course, if you try this and find that it gives you the list of files you want, you could try:

tar -zxf file.tar.gz $(tar -ztf file.tar.gz | grep -E '(^|/)D.*T(/|$)')

as long as the list of filenames to be processed doesn't cause that command line to overflow your system's ARG_MAX limit.

It looks to me like your grep regular expression is underspecified, because D.*T is allowed to span pathname components. [^/] seems a better choice.

Regarding avoiding ARG_MAX, one filename per line can be read from a file, using GNU tar's -T or BSD tar's -I. Perhaps other implementations offer similar functionality.

To the OP:
When asking for help with tar, always specify the implementation you're using (or at least the operating system). Aside from core functionality, tar isn't well standardized.

If you're using GNU tar (untested):

tar xzf file.tar.gz --wildcards --no-anchored 'D*T'

Regards,
Alister

What about:

pax -rzf ./bla.tar.gz 'D*T' 

The pattern, D*T , matches against the entire pathname, not just the basename. Also, / need not be matched explicitly, so the pattern can span multiple components. Whether this is a dealbreaker depends on information we do not have. It wouldn't be a problem if the archive is guaranteed to always be a simple, flat list of files.

Regards,
Alister

1 Like

Ok, good point. So depending on the content we would need:

pax -rzf ./bla.tar.gz 'D*T'

or

pax -rzf ./bla.tar.gz '*/D*T'

No. */D*T doesn't work either; for example, it matches /D/T .

This problem cannot be solved with sh pattern matching notation, because its grammar is unable to express the concept of a string of arbitrary length where none of the characters is a foward slash (in POSIX basic regular expression grammar, [^/]* ).

If we knew the length of the basename, then matching could be accomplished with a pair of tedious patterns. Assuming a length of 4: */D[!/][!/]T and D[!/][!/]T .

In my opinion, a case can be made for a command option extension which promotes the pattern operand to a POSIX BRE. While it would complicate the implementation and impede portability, it would provide useful, missing functionality -- unlike the indefensible -j/-z extensions which cannot offset those disadvantages since they provide absolutely no functionality not already achievable with sh pipeline composition.

Regards,
Alister