ery weird wget/curl output - what should I do?

jstilby · September 13, 2011, 5:48am

Hi,
I'm trying to write a script to download RedHat's errata digest.
It comes in a txt.gz format, and i can get it easily with firefox.

HOWEVER: output is VERY strange when donwloading it in a script. It seems I'm getting a file of the same size - but partially text and partly binary! It contains the first message in the digest, and then garbled data of what i can only assume is the rest of the .gz file.
Here is the basic request (I removed the http prefix because i'm not allowed to post links in the forum):
[mod]When posting a command line, use [url=http://www.unix.com/misc.php?do=bbcode\#code]

 tags, which allow you to post URLs as they aren't parsed
wget http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

I think this is an attempt by redhat to block people who try to retrieve the errata by script.... so I tried messing with the user agent ID string. no luck. output is the same. Here is an example of what I tried:

wget -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3" http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

curl also gives incorrect output - only the text of the first message. it probably tosses out the garbled binary data.

curl --silent http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

curl -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" http://www.redhat.com/archives/enterprise-watch-list/2011-July.txt.gz

This is really annoying. Again, firefox gets it ok as a gz file. what should I do?

Thanks in advance....

Corona688 · September 13, 2011, 10:24am

I can confirm that this is happening at least, but remain as mystified as you. I also tried --referer in wget, to no avail.

I don't believe this is intentional. If they wanted to deny you the file, they'd just deny you the file, not find creative ways to botch its contents.

alister · September 13, 2011, 12:13pm

I too was able to replicate the observed behavior. Corona688 is correct in that it's not an attempt to deny access to the file. It's either apache or wget being stupid. I cannot confirm at the moment which since the wget header dump only included the server side of the conversation (@#$@@#?).

In any case, this is what's happening.

When Firefox requests the file, it indicates that it accepts gzip encoding. When wget or curl ask for it, they do not indicate this. In a bizarre attempt to be helpful, instead of sending you the compressed text file, or redirecting, or refusing to comply, the webserver sends you plain text.

That in itself seems foolish, as depending on the client headers you may download a .gz file that may or may not be a gzip'd file. Meanwhile, the Content-Type header always indicates "application/x-gzip".

(We're just getting warmed up.)

The server response, in the Content-Length header, indicates that the data (you know, the gzip'd text which is actually gunzip'd text) that it's sending you is 13258 bytes long. In its infinte wisdom, their Apache decides to close the connection one byte short of the advertised size.

(Just when you think things couldn't get more messed up ...)

When wget reconnects to finish the transfer, their webserver begins sending at the byte offset requested, but in the original, gzip compressed data file ... and continues to send until the end of that compressed data. This is why you see an identical file size that begins with text followed by "garbled data".

Using dd to skip the first 13257 bytes in the mangled file, I used cmp to compare the remaining bytes with their counterparts in the file downloaded from Firefox. They were identical.

So, in the end, the transfer received is not the 13258 bytes advertised by the first server response, but the 86777 bytes file size of the gzip'd compressed file with the first 13257 bytes as uncompressed text and the remainder as gzip'd data.

Long story short: Tell Apache that you can handle gzip'd data. Using curl, the following option works around the problem:

-H 'Accept-Encoding: gzip'

Regards,
Alister

---------- Post updated at 12:13 PM ---------- Previous update was at 11:52 AM ----------

Nah. curl is simply not retrying after the webserver closes the connection. Both curl and wget are sent plain text before the connection closes. Only wget reconnects and begins receiving gzip'd data.

Regards and welcome to the forum,
Alister

Corona688 · September 13, 2011, 12:22pm

Could this be server-side compression gone wrong? Many webservers support sending text as zipped data, but to do the reverse operation is just weird. It'd make sense for character encodings but not for a file on disk. You don't have to say you accept binary/unknown to download binary/unknown...

---------- Post updated at 10:22 AM ---------- Previous update was at 10:19 AM ----------

--header 'Accept-Encoding: gzip' works for wget too.

alister · September 13, 2011, 3:29pm

And even that seemingly straightforward behavior can be a pain in the derri�re: daniel.haxx.se HTTP transfer compression

Regards,
Alister

jstilby · September 14, 2011, 2:26am

Hi,

This does indeed work. Your help is much apreaciated.... i was really stuck with this.
Keep up the good job!