script

mike171562 · July 15, 2007, 11:02am

Hello, I have been searching for a way to extact urls from google cache url search results,

I have a file with a list of urls like this

""http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm\+"shit"&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8""

what i need to do is extract the actual url World Wide Words: Shit which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.

any help would be appreciated

mike171562 · July 15, 2007, 11:20am

This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file

#!/bin/bash
#
# google.sh
# ---------
#  Automatic Google search from the command line.
#
#    Syntax : $ google {search terms}
#
if [ -z $1 ]
then
  # If no keyword is entered echo try again
  #
  echo "you didnt tell me what to search....try again"
else
  #url variable with the maximum search results (100) per page
  #
  url='http://google.ca/search?num=100&hl=en&safe=off&q='

  appended=0
  for searchTerm in "$@"
  do
    # Replace white spaces in the search terms
    #
    searchTerm=`echo $searchTerm | sed 's/ /%20/g'`

    url="$url%22$searchTerm%22"

    if [ $appended -lt `expr $# - 1` ]
    then
      url="$url"\+
    else
      url="$url"\&btnG\=Google\+Search\&meta\=
    fi

    let "appended+=1"
  done

  lynx -dump $url >> googleresult1
  sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
  rm googleresult1
  cat googleresults2
  sed -e '/google/d' googleresults2 >> urls.txt
fi

The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +

Franklin52 · July 15, 2007, 1:18pm

Try this:

line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards

reborg · July 15, 2007, 2:43pm

franklin52:

Try this:

line='http://64.233.167.104/search?q=cache:ts2G04wctD0J:www.worldwidewords.org/qa/qa-shi3.htm+%22shit%22&hl=en&ct=clnk&cd=12&gl=ca&ie=UTF-8'

echo $line|sed 's/\(.*\):www\(.*\)+\(.*\)/www\2/'

Regards

Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?

echo $url_to_strip | awk -F'[:+]' '{print $4}'

If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.

Franklin52 · July 15, 2007, 3:13pm

reborg:

Three absolute wildcards and no anchors? It may work my personally I would not consider using that code. Also why copy to buffer what you don't use?
echo $url_to_strip | awk -F'[:+]' '{print $4}'
If this does not give the correct result used nawk instead of awk, as you may have an 'old awk' on some system eg. Solaris.

I may wrong, I don't know how the adresses are formatted but this works only with exact three colons before the adresses.

Regards

reborg · July 15, 2007, 4:05pm

That is the format of a google cache entry

However in the interest of being more flexible (note that the match is not a generic wildcard) :

echo $url_to_strip | sed -n 's_.*:\(www[^+]*\)+.*_\1_p'

mike171562 · July 15, 2007, 7:00pm

thanks for the replies everyone, I know the above script is sloppy, but im not very good at scripting yet. Im using it to list urls with certain words to import into a web spider for my company web filter. I decided the easist way would just be to add an extra line to filter out lines with "cache" in them something like:
sed -e '/cache/d' googleresults2 >> urls.txt

Does anyone know of a good web spider scipt that is keyword based?