Help with WGET and renaming downloaded files :(

o0110o · August 18, 2010, 8:59pm

Hi everybody, I would greatly appreciate some expertise in this matter. I am trying find an efficient way to batch download files from a website and rename each file with the url it originated from (from the CLI). (ie. Instead of xyz.zip, the output file would be http://www.abc.com/xyz.zip\) A method using WGET is preferable but not absolutely necessary. I'm just starting to get comfortable with the command line; in part because of some of the great help I've gotten here before, so once again I'm back looking for some help from this excellent community

Thanks

rdcwayx · August 18, 2010, 9:56pm

Do you try -O option in wget?

  -O   --output-document=FILE   write documents to FILE.

o0110o · August 18, 2010, 10:10pm

Thanks for your response Will -O replace all the downloaded files filenames with their corresponding original url addresses?

rdcwayx · August 18, 2010, 10:26pm

you can customize the output file name with -O option, or assign a var to it.

o0110o · August 18, 2010, 11:00pm

Okay, so far so good. Thanks for the tip. All I need to do now is figure out how I can automatically replace the filenames with their respective url addresses; is this somehow possible using grep or sed with the

wget -a logfile

switch enabled?

alister · August 18, 2010, 11:57pm

You cannot name a file "http://www.abc.com/xyz.zip". Forward slashes are one of two illegal characters in unix filenames (null byte being the other). The best you can do without some kind of translation is to mirror the hierarchy, with each slash-delimited component in the url, except the last, being a directory. Even so, the "//" cannot be handled without some special treatment as a "//" in a pathname is treated identically to "/".

o0110o · August 19, 2010, 12:10am

Thanks for your feedback alistar Maybe you can answer this one for me: How would I go about printing the url of a particular photo onto that photo (ie. watermark) as soon as its downloaded? I have already had limited success with the "convert" command using predefined text, however, I still haven't figured out the auto-url-watermark capability that I'm after. Thanks again!

alister · August 19, 2010, 5:02pm

It would help if you shared the code that you're using, along with a description of how it fails and the desired result (which I assume is to have the url watermarked on an image). Don't assume that we are familiar with the tool's you are using. However, even without specific knowledge of the tools involved, if there is a shortcoming in your shell script, we may be able assist.

Regards,
Alister

Corona688 · August 19, 2010, 5:35pm

It always helps to read the manpage.

$ man wget

WGET(1)                            GNU Wget                            WGET(1)



NAME
       Wget - The non-interactive network downloader.

SYNOPSIS
       wget [option]... ...

DESCRIPTION
       GNU Wget is a free utility for non-interactive download of files from
       the Web.  It supports HTTP, HTTPS, and FTP protocols, as well as
       retrieval through HTTP proxies.

...
       --force-directories
           The opposite of -nd---create a hierarchy of directories, even if
           one would not have been created otherwise.  E.g. wget -x
           http://fly.srk.fer.hr/robots.txt will save the downloaded file to
           fly.srk.fer.hr/robots.txt.
...

konsolebox · August 19, 2010, 10:25pm

If you want to watermark an image everytime that wget fetches it , you have to separate each wget call per url in a loop in a shell script. Then everytime wget successfully downloaded an image, the script will call another tool to add watermark to the image. The problem arises with how you will save the images to your local directories.

With respect to watermarking, there are lots of tutorials on the web on how to do it like in these pages:
Resize and Watermark Images in Linux | SavvyAdmin.com
Batch Watermark Images in Linux | Tux Tweaks

o0110o · August 22, 2010, 2:24pm

Thanks for all your input people

Here's what I have so far:

{This part downloads photos based on the users query, in this case "Apples". The URLs which are access during the download process are then indexed in the file "picasalist}

GET "http://picasaweb.google.com/data/feed/base/all?alt=rss&kind=photo&access=public&filter=1&q=Apples&hl=en_US" | sed 's/</\n</g' | grep media:content |sed 's/.*url='"'"'\([^'"'"']*\)'"'"'.*$/\1/' > picasalist; 

{This part watermarks the images with pre-defined text}

wget -c -i picasalist; mogrify -font helvetica -pointsize 12 -gravity southwest -draw 'fill black text 1,1 "Apples" fill white text 2,0 "Apples"'  *.jpg

Now I just need to be able to string these together and add the URL -to-Watermark feature.

I also have a piece of code that isolates the filename from the URL:

cat picasalist | rev | cut -d\/ -f 1 | rev

Theoretically, one could compare the filename to the addresses in "picasalist", and then pass the corresponding URL off to the mogrify command and presto, mission accomplished! I just wish my technical ability was on par with my aspirations, lol. I have to say though, the kind people on these forums have always helped me in the right direction.