Website crawler

Hi,

I want to build a crawler that seeks for a keyword on certain websites.

This is what the website looks like:

website.com/xxxxAA11xxxx

I want that the crawler automatically changes the letters alphanumerically and if a certain keyword is found, the website got to be logged.

But I have no plan how to do that, please help me out!

Thank you.

change what letters, to what?

What's your system? What's your shell?

I use OS X and bash.

Change the letters from 0000 to 99ZZ so the first two digits should be all numbers from 00 to 99 and the last two digits every letter and every number in every combination.

If you have a mac, you'll have to use curl. It's actually fairly good at this -- you can tell it whole lists of things to fetch, and split on --_curl_-- to tell when each new page begins and ends when you pipe it into something else. Unfortunately, it'll take [a-z] and [0-9], but not [a-z0-9] or [0123456789abcdefghijklmnopqrstuvwxyz], so you have to give it four blocks of stuff to fetch:

Something like:

BASE="http://website/xxxxx[00-99]"
TAIL="yyyyy"

# Fetch all pages with curl, feed them through awk, print all pages containing 'searchstr'
curl "${BASE}[0-9][0-9]${TAIL}" "${BASE}[0-9][a-z]${TAIL}" "${BASE}[a-z][0-9]${TAIL}" "${BASE}[a-z][a-z]${TAIL}" 2>/dev/null |
        # Split on curl header of --_curl_--.  $1 is the URL following it.  Print all URLs for pages containing 'searchstr'
        awk -v RS="^--_curl_--" -v FS="\n" '/searchstr/ { print  $1 }'

Might be a foolish question, but how do I install curl? I downloaded it (curl-7.19.7). The instructions from curl itself don't work. Says can't find make. Configuration also fails. And then, how to run your script? Is this all code I need? Thank you!

I was nearly certain that macs came with curl, make sure you don't have it.

If you don't, it'd probably be much easier to install it through fink than to build it yourself by hand.

Thank you so much for your help, but it doesn't work either, I can't install it and even the aid program is useless.

In what way does it not work?

Moderator comments were removed during original forum migration.

I uploaded some screenshots buy they are under approving. Say, you don't want to help me via instant message, do you? :wink:

No.

You could also just describe the problem...

Ok. well it s***s cuz I can't copy/past it!

A root directory /sw exists. Please bla....

and if I try to use the aid program another error occurs but I can't describe it says something weird.

the 'bla' is probably important.

So's the stuff you "can't describe"... if it's words, it's typable. :wall:

You might have already installed fink though, if you already have /sw ? Try fink list in shell, if it works you have fink.