Read webpage from shell script

Hey experts,

I am trying to read a webpage and want to search for a text patter (case insensitive). Please help me acquire this.

If the text is found i will receive a mail which can be hardcoded in the script.

I am not a big fan of using PERL script.

Please help.

Machine: AIX

Thanks a lot in advance.

You could try:

found="`egrep -i regex_pattern index.html`"
[[ "_$found" != _ ]] && echo "Found it: $found" | mail -s subject me@myemail.dom

But first you want to check that you can send mail from command line:

echo blah | mail -s testSubject my@email
curl http://www.unix.com | awk '/Copyright/ {print "found the pattern!"}'

search pattern "Copyright" in unix.com homepage. once it was found, print one "found the pattern"

Hi Mirni,

Thanks for you quick response. I have check and i can send mails through my command line. My issue is i have and entire link starting example The UNIX and Linux Forums - Learn UNIX and Linux from Experts some thing like this. I tried to replace ur code on index.html with my url and it failed with an error egrep: can't opn the link specified.

Please advise.

Thanks in adavance.

Look at the post above. Use curl:

curl http://www.unix.com | egrep "pattern"

Alternatively you can use 'wget' to download the html file and then search... But curl sounds better/

Hi sk1418,

I tried the command you provided but unfortunately, i dont think its installed on my machine.
Below is the error i received.

 
$ curl http://www.unix.com | awk '/Copyright/ {print "found the pattern!"}'
ksh: curl:  not found

Please advice.

do you have wget installed? try this

 wget -O - http://www.unix.com| [awk or grep]

--- update ---
well since you only want to know if the webpage contains the pattern, you could:

 wget -O - http://www.unix.com | grep -c "Copyright"

if the returned number > 0. means pattern was found.

Hard luck my friend not installed.

@sk1418:
The problem is pages with active content won't get downloaded as you'd want to. E.g.

curl www.unix.com | grep Copyright

returns 3 lines, whereas

wget www.unix.com -O - | grep Copyright

returns none due to some active contents...

Well since you don't have curl AND wget.. do you have python installed?

save this into a .py file, e.g t.py

#!/usr/bin/python
import urllib
print urllib.urlopen("http://www.unix.com").read()

then you could do something like this:

kent$ python t.py| grep -c "Copyright"       
3

@mirni

wget www.unix.com -O - | grep Copyright

works here. I can see 3 red "Copyright" :smiley: I 've alias grep with color.

Hi Guys,

I do not have wget or lynx installed and pyton doesn't seems to work. Is there ne way to write a shell script.?

Your response is appreciated.

This is why it's always helpful to know what your system is.

Doesn't seem to work? What, exactly, does it do? It's "python", by the way, check your spelling in the script too.

No netspeak, please.

Some versions of ksh allow you to create a network socket but this would be a very hackish and breakage-prone way to do it. You should try and get the right tool for the job.

What is your system, anyway? Did you try curl?

Some shellssupport sockets.
With some versions of bash you could write something like this:

exec 3<>/dev/tcp/www.unix.com/80
printf 'GET / HTTP/1.1\nHost: www.unix.com\nConnection: close\n\n' >&3
grep <&3 Copyright
exec 3>&- 

---------- Post updated at 11:44 PM ---------- Previous update was at 11:42 PM ----------

I agree, I would use Perl.

The OP seems to be using AIX.

i don't have experience of AIX. but python is not a standard package of AIX?
btw, @bankimmehta, if you say "python doesn't seem to work", what is the output then?

how about netcat?

echo -e 'GET / HTTP/1.1\nHost: www.unix.com\n\n'  | nc www.unix.com 80 | grep Copyright

The same problem as raw sockets in the shell -- there's more to HTTP than "GET /". What if it decides to send it to you in an alien character set, or in several sections? Utilities like wget, or a well-crafted perl module, could be expected to handle that but it'd be a mountain of code to write and test for a one-off shell script.