Wget from multiple paths

Devyn · March 11, 2015, 10:05am

if I have these wildcards to download from:

path1/*.txt
path2/*.txt
path3/*.txt
path4/*.txt
path5/*.txt

under a link such as this:

http://abc .com/]abc.com

Can wget be written in such a way to extract only those files and create the corresponding paths under a target folder? So the result looks like this:

/downloads/path1/*.txt
...
/downloads/path5/*.txt

I just need to download the above in that specific pattern and folder structure.

Cheers
DH

---------- Post updated at 10:05 AM ---------- Previous update was at 10:02 AM ----------

I pasted abc.com and it interpreted the link (lol). was just an example.

hehehe

cmccabe · March 11, 2015, 11:01am

Try either:

Code:

 wget -i urls.txt

And wget should generate the unique filenames for you, along with full paths if you do -x:

Code:

 wget -x -i urls.txt

urls.txt should contain the http:// and the path to the file.

Hopefully this helps :).

Devyn · March 11, 2015, 6:35pm

So I have to specifically list all the urls in urls.txt?

Thanks,
DH

kumaran_5555 · March 12, 2015, 2:52am

wget can recursively pull file from a webpage - provided that webpage has links to other files. This is how recursion happens.

You can't enumerate from a home page to all of its subpages if you don't have any links in the home page. Logically there is no way for wget to find what all pages in that domain (brute force search is simply not practical).

If you have links to other pages, then you can use

--accept-regex urlregex

wget

to restrict what links are recursively pulled.

In your case, if you have one web page which provides links to say "path1, path2 ..." and each pathX provides further links, you can do what you want through

wget -r

cmccabe · March 12, 2015, 8:48am

Yes, specifically list the web address in the urls.txt.

An example would be: that will download the two pdf's from the site indicated and put them both in one in a folder.

 http://www.genedx.com/wp-content/uploads/crm_docs/info_sheet_hedd.pdf 
http://www.genedx.com/wp-content/uploads/crm_docs/info_sheet_vws.pdf