I have a url list. it is very huge. I want download them concurrently.
Aria2c is very good tool for this.(or concurrently curl command) But my server is crash for I/O process.
it is very high load. I want all download htmls(Htmls are very small) save to a single text file. Is it possible? Thank you very much.
while read URL
do
wget "$URL" >> download.txt # Downloading URL using wget & appending it to file: download.txt
done < urls_list.dat # Reading from a file: urls_list.dat which has list of URLs
Good use of while read. You can redirect the entire loop instead of reopening download.txt 1000 times though:
while read line
do
wget ...
done > download.txt
wget also has some features which make a loop unnecessary though
wget is able to read a list of files with -i. The -nv option is also useful, to make it still print completed files without printing all the complicated junk wget usually does.
wget -nv -i urls_list.dat > download.txt
This should be much faster than calling wget 1000 times since it is able to re-use the same connection if it's connecting to the same site. Concurrency may not be necessary ( and may not be desirable in many cases -- how fast is your connection? ) but if it is, I'd split the list into parts and use wget -i on those parts.
Since they're in the background, they have to be saved to independent files. It'd be almost impossible to guarantee the order of the output if they weren't.
I'd try splitting the file into many chunks for wget -i to handle independently. This will allow them to be concurrent without such an overwhelming number of files.
#!/bin/sh
# Calculate how many lines among n processes, 10 default
MAXPROC=${2:-10}
# Count lines first
LINES=$(wc -l < $1 )
# Divide lines by processes
let LINES=LINES/MAXPROC
# Split file into 10 chunks xaa, xab, ...
split -l $LINES < $1
# Loop over xaa, xab, ...
for FILE in x*
do
# Download one set of files from $FILE into $FILE.out in background
wget -nv -i "$FILE" -O - > $FILE.out 2> $FILE.err &
done
wait # Wait for all processes to finish
# Assemble files in order
cat x*.out
cat x*.err >&2
# Remove temporary files
rm x*