Curl parallel download file list

tonispa · May 21, 2017, 12:27pm

Hello guys, first post sorry if I did some mess here =)

Using Ubuntu 14.04lts 64bits server version.

I have a list (url.list) with only URLs to download, one per line, that looks like this:

http://domain.com/teste.php?a=2&b=3&name=1
http://domain.com/teste.php?a=2&b=3&name=2
...
http://domain.com/teste.php?a=2&b=3&name=30000

As you can see, there are many lines in the file (in this case 30000). Because of that I'm using a trick to download many URLs simultaneosly with this:

cat url.list | xargs -n 1 -P 10 <<MAGIC COMMAND THAT WILL SAVE ME>>

The problem is that I'd like to rename the output file with the same value of the name field, like: 1.html, 2.html, ..., 30000.html ecc, and use curl to limit the size of the file to 50KB. So the curl command should be something like:

curl -r 0-50000 -L $URL -o $filename.html -a $filename.log

How can I have it done?

I can parse the output of the pipe with echo $URL | sed -n -e 's/^.*name=//p' but I don't know how use this in the same line grabbing the output of a pipe in 2 variables ($URL and $filename).

I tried this with no success:

cat url.list | xargs -n 1 -P 10 | filename=$(sed -n -e 's/^.*name=//p') ; curl -r 0-50000 -L $URL -o $filename.html -a $filename.log

Thank you in advance,
tonispa

RudiC · May 22, 2017, 12:03pm

Did you try read ing that file:

while IFS="?&=" read URL X X X X X FN REST; do echo $FN, $URL; done <url.list
1, http://domain.com/teste.php
2, http://domain.com/teste.php
, ...
30000, http://domain.com/teste.php

The X es are dummy variables. Instead of the echo , put in your magic command. There have been threads on "parallel" execution with some tricks; use the search function in here.

tonispa · May 22, 2017, 2:32pm

Thank you so much for your help @RudiC , I'll try your tips this night and post here back. I walked a little yesterday with this codem and figured out how to use xargs to "parallelize" the jobs to curl:

xargs -n 1 -P 10 curl -s -r 0-50000 -O < url.list

But the problem is that I can't rename the file as I want. So what I did is cd to my destination directory path, and then I run the code above. But I notice that if there is some similar filenames in differents URLs the first file is overwrote by the last one. Because of that, if I want to keep the same destination directory, be able to rename the output is mandatory.

Chubler_XL · May 22, 2017, 4:25pm

How about something like this:

#!/bin/bash
fetch_url () {
   URL=$@
   filename=${URL##*=}

   curl -r 0-50000 -L "$URL" -o ${filename}.html -a ${filename}.log
}

export -f fetch_url

xargs -n 1 -P 10 fetch_url < url.list

Aia · May 22, 2017, 5:45pm

perl -nle '/(\d+)$/ and print "$_ -o $1.html -a $1.log"' url.list | xargs -I {} -P 10 curl -r 0-50000 -L "{}"

tonispa · May 25, 2017, 7:13pm

chubler_xl:

How about something like this:

fetch_url () {
   URL=$@
   filename=${URL##*=}
   curl -r 0-50000 -L "$URL" -o ${filename}.html -a ${filename}.log
}
export -f fetch_url
xargs -n 1 -P 10 fetch_url < url.list

Thank you for your reply! I did not make this work. I tried to write a script only for this function and call it, tried put "inline" command inside a screen, and always is the same error:

xargs: fetch_urlxargs: fetch_urlxargs: fetch_url: No such file or directory: No such file or directory

Do you know how to solve this?

Chubler_XL · May 27, 2017, 1:44pm

export -f is a bash feature and I use it here to insure the internal function fetch_url is exported to sub shells. This is needed as xargs is an external command to the shell and runs the assembled commands in a new shells.

I assumed, as you were using GNU xargs (-P feature is a GNU extension), that you were also using the bash shell. I've updated my original post to specify the required shell, and this is all you may need to do to get you version working.

However, if you do not wish to use bash, you could put your function in an external script so that it can be called from xargs for example:

$HOME/bin/fetch_url:

#!/bin/sh
URL=$@
filename=${URL##*=}
curl -r 0-50000 -L "$URL" -o ${filename}.html -a ${filename}.log

And from another script (or the command line) you can call this with:

xargs -n 1 -P 10 $HOME/bin/fetch_url < url.list