Show Percentage Compression in GZIP

vinay4889 · July 16, 2013, 10:47am

Hi,

I used gzip command to compress a huge tar file. But I saw that compression % was more than 100%.
It might have inflated instead , probably because tar file is already packed properly.

So I thought of unzippping it. Now after unzip I expected the tar file to be of less size than .tar.gz file. But ,to my surprise it was more than that. Does that mean gzip actually reduced size eventhough , it show % compression more than 100%?

Here are statistics:

 
after gzip
a.tar.gz  -  20,915,558,979
 
after gunzip
a.tar        - 22,213,027,840
 
Compression % = 175.7

(Sorry, I forgot to check size of original tar , means tar before I zipped)

Your suggestions are greatly appreciated :).

Thanks

Corona688 · July 16, 2013, 10:54am

gzip won't ever make a file that much larger than it started, so it must be meaning the opposite of what you think.

vinay4889 · July 16, 2013, 11:08am

Thanks for quick resonse.
I am confused a bit. My assumption was based upon some pages on Internet. Can you please take a look at below.

Why GZip compression increases the file size for some extension?

Also , if it is really compressed why would it show compresion % more than 100%. I checked for some small files and compression % was 91%.

Corona688 · July 16, 2013, 11:24am

There is of course some overhead but very little since, as you noted, it's smart enough to switch algorithms when faced with a file that compresses badly. It's much better than some older compressors which, worst case, could double the size of a file.

I have no idea where the 175% comes from, it makes no apparent sense either way you consider it.

jim_mcnamara · July 16, 2013, 11:40am

From gzip documentation -

    Compression is always performed, even if the compressed file
     is  slightly larger than the original. The worst case expan-
     sion is a few bytes for the gzip file header, plus  5  bytes
     every  32K  block, or an expansion ratio of 0.015% for large
     files. Note that the  actual  number  of  used  disk  blocks
     almost  never increases.  gzip preserves the mode, ownership
     and timestamps of files when compressing or decompressing.

What you show does not indicate 175% of anything, the documentation says 'No way'

$ jim> bc -l
20915558979 / 22213027840
 .94158973417106202123

So, I think:

Your problem is that whatever you used to do the % calculation overflowed integers and gave garbage results. Those are 20GB files.
In fact, you appear to have had about 6% compression.
That 6% was probably because there were a lot of executable files/binary data files in the tar file. Those do not compress as well as text. "Normal" compression is on the order of 70%.

vinay4889 · July 16, 2013, 11:46am

Thanks Jim. I used

gzip -v

to get compression %. And again

gunzip -v

while decompression. Both showed same , 175.7%.

And yes it is ture, no matter what compression % it showed , file is compressed.

Thanks for the help - Corona and Jim :).

Corona688 · July 16, 2013, 11:50am

what versions of gzip/gunzip? Perhaps it could use an update.

jim_mcnamara · July 16, 2013, 12:21pm

You may have a 32bit implementation of gzip -= the file command will show something like below - note the text in red. I would guess 32bit in your case.

file /usr/bin/gzip
/usr/bin/gzip: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped
/usr/bin/gzip: ELF 64-bit MSB executable SPARC Version 1, dynamically linked, stripped

vinay4889 · July 16, 2013, 12:47pm

jim mcnamara:

You may have a 32bit implementation of gzip -= the file command will show something like below - note the text in red. I would guess 32bit in your case.
file /usr/bin/gzip
/usr/bin/gzip: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped
/usr/bin/gzip: ELF 64-bit MSB executable SPARC Version 1, dynamically linked, stripped

Please have a look at below details.

 
$ gzip -V
gzip 1.2.4 (18 Aug 93)
Compilation options:
DIRENT UTIME STDC_HEADERS HAVE_UNISTD_H
 
$ file /usr/bin/gzip
/usr/bin/gzip: executable (RISC System/6000) or object module
$ uname
AIX

Corona688 · July 16, 2013, 12:55pm

You have a 20-year-old version of gzip.

vinay4889 · July 16, 2013, 1:46pm

Chubler_XL · July 16, 2013, 4:43pm

It's amazing it can process files 20Gb in size, considering the average HDD size back then was less than 240MB

Corona688 · July 18, 2013, 11:11am

Not really... It's not supposed to know or care how long the file is, it just reads and writes and churns until the OS says 'ok, all done'. If a stream compressor can't handle input of arbitrary length, that's a bug.

So the code that broke down here technically had nothing to do with the compression. Fortunately.