Status code checker for 1300 URLs is running 15 mins

vidhyaS · September 14, 2011, 7:54am

hi All,

I wrote a shell script for checking the status code of the pages. for 1300 URLs, it is taking 15 mins to run. Can anyone pls provide a solution to make the script to run faster. about 5 mins..

In my script i used "while loop" and "wget --server-response".

Thank You!

pludi · September 14, 2011, 8:01am

Without knowing the script it would be hard to tell. And 1300 URLs in 15 minutes already is pretty fast for me ( 15 minutes * 60 / 1300 = ~0.7 seconds per URL, including starting the process and establishing the connection)

sk1418 · September 14, 2011, 8:13am

you could try to get (only) the http status code by

curl -o /dev/null --silent --write-out '%{http_code}' <yourURL>

however I am not sure whether it will help you check 1300 urls faster. You can give it a try.

vidhyaS · September 14, 2011, 8:35am

Sure thanks.. i will try it..

Is there any way of running the script in parallel.. i.e.., executing the loop in parallel?

Thanks

pludi · September 14, 2011, 8:40am

Yes. As for how, we'd have to know what you're doing with the status code afterwards.

vidhyaS · September 15, 2011, 1:42am

Just the script have to generate the status codes for 1300 URLs, Am going to Integrate this script with a Build process. This Build process will take 40 mins to run. After this build process gets over, this status code checker will be triggered, so if this script runs for 5 mins then it will useful.

So if the script runs in such a way that 10 instances taking 10 URLs at a time (i.e., in Parallel) then the script will run in 5 mins.. Any ideas of how to do this in parallel ?

Thanks!

itkamaraj · September 15, 2011, 1:55am

Did you try with perl ?

 
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my @sites=qw(www.microsoft.com www.google.com www.yahoo.com);
foreach my $site (@sites){
 my $filename=$site;
 $site='http://'.$site;
 my $response = getstore($site,$filename);
 if($response eq RC_OK){
  print "$site is up.\n";
 }
 else{
  print "Theres some error on the site. Returned error code $response.\n";
 }
}

pludi · September 15, 2011, 4:16am

There's a few ways to parallelize this, but for that we'd need more info. What system are you on? Is using GNU parallel an option? How do you get the input? What do you do with the status code once you've got it? What shell are you using?

vidhyaS · September 15, 2011, 7:46am

After generating the status codes, output file will be sent to some people.

Am accessing the Bash shell script in a unix server. I will access this Unix server by using putty.

GNU parallel is not installed in my unix server.

Which is the better way to parallelize ?

pludi · September 15, 2011, 8:27am

So the basic structure is

for each URL
  wget status code from URL >> codes_file

send codes_file through email

Does that sound about right?

vidhyaS · September 29, 2011, 9:50am

This is what i used.. by using while loop and wget. This sounds like it will take the same time as my script..

Any other ideas?

pludi · September 29, 2011, 10:15am

My response wasn't a suggestion on how to make it faster, but a question on how you're doing it now, so that we can get an idea on what might work, and what side effects should be considered.

Corona688 · September 30, 2011, 1:16am

since you have wget, you probably have Linux, and can make xargs do this:

xargs -d '\n' -P 4 --max-args=16 wget -nv --spider <urllist >responselist

This will run four simultaneous instances of wget. The '--max-args' stops it from feeding too many args into one wget, so in case one download hangs a while, the other instances will be able to take up most of the slack.

The --spider tells it not to download the page, just check its existence, which should also help improve script speeds.

The -nv tells it to print success or failures one per line.

vidhyaS · September 30, 2011, 6:14am

thanks.. i tried with GNU parallel.. it seems that the script completes in 4 mins!! I will try with xarg also..

Thanks!

Corona688 · September 30, 2011, 12:02pm

Whether using parallel or xargs I think --spider may help.

vidhyaS · October 6, 2011, 1:51pm

thanks!!

i used GNU parallel in my script. If i run the script in command prompt by giving the command "sh code.sh" then it executes as expected. But if i run it using crontab, then error message thrown as "parallel: command not found". i.e., it is not recognizing the parallel command. but i installed GNU parallel package in the same path where my script is present.

Any idea on this?

Corona688 · October 6, 2011, 2:54pm

That question is so common it's in our FAQ. cron has a very minimal PATH compared to a user shell. You can either set your own PATH, or . /etc/profile to get a proper default PATH, or call parallel with its full path i.e. /path/to/parallel