Status code checker for 1300 URLs is running 15 mins

hi All,

I wrote a shell script for checking the status code of the pages. for 1300 URLs, it is taking 15 mins to run. Can anyone pls provide a solution to make the script to run faster. about 5 mins..

In my script i used "while loop" and "wget --server-response".

Thank You!

Without knowing the script it would be hard to tell. And 1300 URLs in 15 minutes already is pretty fast for me ( 15 minutes * 60 / 1300 = ~0.7 seconds per URL, including starting the process and establishing the connection)

you could try to get (only) the http status code by

curl -o /dev/null --silent --write-out '%{http_code}' <yourURL>

however I am not sure whether it will help you check 1300 urls faster. You can give it a try.

Sure thanks.. i will try it..

Is there any way of running the script in parallel.. i.e.., executing the loop in parallel?

Thanks

Yes. As for how, we'd have to know what you're doing with the status code afterwards.

Just the script have to generate the status codes for 1300 URLs, Am going to Integrate this script with a Build process. This Build process will take 40 mins to run. After this build process gets over, this status code checker will be triggered, so if this script runs for 5 mins then it will useful.

So if the script runs in such a way that 10 instances taking 10 URLs at a time (i.e., in Parallel) then the script will run in 5 mins.. Any ideas of how to do this in parallel ?

Thanks!

Did you try with perl ?

 
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my @sites=qw(www.microsoft.com www.google.com www.yahoo.com);
foreach my $site (@sites){
 my $filename=$site;
 $site='http://'.$site;
 my $response = getstore($site,$filename);
 if($response eq RC_OK){
  print "$site is up.\n";
 }
 else{
  print "Theres some error on the site. Returned error code $response.\n";
 }
}

There's a few ways to parallelize this, but for that we'd need more info. What system are you on? Is using GNU parallel an option? How do you get the input? What do you do with the status code once you've got it? What shell are you using?

After generating the status codes, output file will be sent to some people.

Am accessing the Bash shell script in a unix server. I will access this Unix server by using putty.

GNU parallel is not installed in my unix server.

Which is the better way to parallelize ?

So the basic structure is

for each URL
  wget status code from URL >> codes_file

send codes_file through email

Does that sound about right?

This is what i used.. by using while loop and wget. This sounds like it will take the same time as my script..

Any other ideas?

My response wasn't a suggestion on how to make it faster, but a question on how you're doing it now, so that we can get an idea on what might work, and what side effects should be considered.

since you have wget, you probably have Linux, and can make xargs do this:

xargs -d '\n' -P 4 --max-args=16 wget -nv --spider <urllist >responselist

This will run four simultaneous instances of wget. The '--max-args' stops it from feeding too many args into one wget, so in case one download hangs a while, the other instances will be able to take up most of the slack.

The --spider tells it not to download the page, just check its existence, which should also help improve script speeds.

The -nv tells it to print success or failures one per line.

thanks.. i tried with GNU parallel.. it seems that the script completes in 4 mins!! I will try with xarg also..

Thanks!

Whether using parallel or xargs I think --spider may help.

thanks!!

i used GNU parallel in my script. If i run the script in command prompt by giving the command "sh code.sh" then it executes as expected. But if i run it using crontab, then error message thrown as "parallel: command not found". i.e., it is not recognizing the parallel command. but i installed GNU parallel package in the same path where my script is present.

Any idea on this?

That question is so common it's in our FAQ. cron has a very minimal PATH compared to a user shell. You can either set your own PATH, or . /etc/profile to get a proper default PATH, or call parallel with its full path i.e. /path/to/parallel

1 Like