Downloading jpgs from a gallery type website

workisnotfun · April 29, 2014, 5:39pm

Can someone explain what this does step by step? I found this script on stackoverflow and want to customize it for personal use in downloading jpg images from a website.

# get all pages 
curl 'http://domain.com/id/[1-151468]' -o '#1.html' 

# get all images 
grep -oh 'http://pics.domain.com/pics/original/.*jpg' *.html >urls.txt 

# download all images 
sort -u urls.txt | wget -i-

What I think the first like does is download the pages of domain with curl but what does the '#1.html' mean?
Why in .*jpg is the * after the '.'? Also what is this trying to do? I attempted altering this using a different website but there's an error grep: *.html: No such file or directory even though the first command is downloading the html files just fine.
I think the third option is just organizing the results and wget goes to the jpg's website and downloads the jpgs.

Corona688 · April 29, 2014, 5:50pm

1) From man curl:

       -o/--output <file>
              Write output to <file> instead of stdout. If you are using {} or
              []  to  fetch  multiple documents, you can use '#' followed by a
              number in the <file> specifier. That variable will  be  replaced
              with the current string for the URL being fetched. ...

So it replaces #1 with the number of the page in question.

2) Because it's a regex, not a glob. In a regex, * means "zero or more of the previous character", and . means "any character". So .*jpg means "any string ending in jpg".

3) Yes, it sorts them to download in-order. Possibly not very well since it's just a random pile of URL's but order doesn't matter too much here anyway.

workisnotfun · April 29, 2014, 6:24pm

I am having the most trouble with step two I think.

I'm assuming the -oh is the option -o and -h?

Step 1 downloads the files fine but they're stored in my directory as 1.html *dot* , 2.html *dot* etc, with a dot right after (not a period), not sure if this is a problem but Step 2 doesn't seem to be able to find any .html files

and so Step 3 fails because there is no urls.txt. What could be the problem?

---------- Post updated at 05:24 PM ---------- Previous update was at 04:55 PM ----------

I think I might have found a different problem actually.

Running this in terminal works fine

wget -nd -H -p -A jpg,jpeg,png,gif -e robots=off www.url.example

but when I put this in my bash script and run it I get awaiting response... 404 Not Found. At the end of the jpg url, for some reason a %0D gets appended to the end which I'm thinking makes wget go to the wrong url.

I've been trying a different approach than my earlier one since I couldn't get that working. What could be the problem now so that I can automate the downloading?

Chubler_XL · April 29, 2014, 8:44pm

%OD is a carriage return has probably been appended to your files by Windows, did you edit this file on windows and transfer to unix?

dos2unix filename from unix should remove these extra characters