Multi html download.

hoo · November 30, 2012, 6:12am

Hello,

I have a url list. it is very huge. I want download them concurrently.
Aria2c is very good tool for this.(or concurrently curl command) But my server is crash for I/O process.
it is very high load. I want all download htmls(Htmls are very small) save to a single text file. Is it possible? Thank you very much.

Aria2c command:

aria2c -iurl.txt -j30

url.txt

http://www.domain.com/f34gf345g.html
http://www.domain.com/jyjk678.html
....

Yoda · November 30, 2012, 9:16am

while read URL
do
    wget "$URL" >> download.txt # Downloading URL using wget & appending it to file: download.txt
done < urls_list.dat            # Reading from a file: urls_list.dat which has list of URLs

hoo · November 30, 2012, 9:51am

Thanks but it is not concurrently download. it is very slow for huge url list.

Yoda · November 30, 2012, 9:59am

Downloading 50 URLs at a time, you can customize as per your requirement:-

seq=1
while read URL
do
   wget "$URL" >> download_${seq}.txt & 
   seq=$( expr $seq + 1 )
   mod=$( expr $seq % 50 )
   if [ $mod -eq 0 ]
   then
         wait   
   fi
done < urls_list.dat
wait
cat download_*.txt > consolidated.txt

Corona688 · November 30, 2012, 10:15am

bipinajith:

while read URL
do
   wget "$URL" >> download.txt # Downloading URL using wget & appending it to file: download.txt
done < urls_list.dat            # Reading from a file: urls_list.dat which has list of URLs

Good use of while read. You can redirect the entire loop instead of reopening download.txt 1000 times though:

while read line
do
        wget ...
done > download.txt

wget also has some features which make a loop unnecessary though

wget is able to read a list of files with -i. The -nv option is also useful, to make it still print completed files without printing all the complicated junk wget usually does.

wget -nv -i urls_list.dat > download.txt

This should be much faster than calling wget 1000 times since it is able to re-use the same connection if it's connecting to the same site. Concurrency may not be necessary ( and may not be desirable in many cases -- how fast is your connection? ) but if it is, I'd split the list into parts and use wget -i on those parts.

hoo · November 30, 2012, 11:03am

Thanks. it is very fast. but each file separately downloading to hdd. it is very high load for server. I want downloading but only to single file.

bipinajith:

Downloading 50 URLs at a time, you can customize as per your requirement:-

seq=1
while read URL
do
   wget "$URL" >> download_${seq}.txt & 
   seq=$( expr $seq + 1 )
   mod=$( expr $seq % 50 )
   if [ $mod -eq 0 ]
   then
   wait   
   fi
done < urls_list.dat
wait
cat download_*.txt > consolidated.txt

Corona688 · November 30, 2012, 12:37pm

Since they're in the background, they have to be saved to independent files. It'd be almost impossible to guarantee the order of the output if they weren't.

I'd try splitting the file into many chunks for wget -i to handle independently. This will allow them to be concurrent without such an overwhelming number of files.

#!/bin/sh

# Calculate how many lines among n processes, 10 default
MAXPROC=${2:-10}
# Count lines first
LINES=$(wc -l < $1 )
# Divide lines by processes
let LINES=LINES/MAXPROC

# Split file into 10 chunks xaa, xab, ...
split -l $LINES < $1

# Loop over xaa, xab, ...
for FILE in x*
do
        # Download one set of files from $FILE into $FILE.out in background
        wget -nv -i "$FILE" -O - > $FILE.out 2> $FILE.err &
done

wait    # Wait for all processes to finish

# Assemble files in order
cat x*.out
cat x*.err >&2
# Remove temporary files
rm x*

Use it like

./multiget.sh filelist 5 2> errlog > output

for 5 simultaneous downloads.

hoo · November 30, 2012, 3:08pm

Corona688,

Thank you very much. This code is good for my server load.