what i need to do is extract the actual url World Wide Words: Shit which lies between the : and the + and remove the google cache url from the list so I will have a list of regular urls, I also have normal urls in the list which I would like to keep in the list.
This is the bash script I am using, it searches google and give you a list of urls, takes out everything but the link and pipes them to a file
#!/bin/bash
#
# google.sh
# ---------
# Automatic Google search from the command line.
#
# Syntax : $ google {search terms}
#
if [ -z $1 ]
then
# If no keyword is entered echo try again
#
echo "you didnt tell me what to search....try again"
else
#url variable with the maximum search results (100) per page
#
url='http://google.ca/search?num=100&hl=en&safe=off&q='
appended=0
for searchTerm in "$@"
do
# Replace white spaces in the search terms
#
searchTerm=`echo $searchTerm | sed 's/ /%20/g'`
url="$url%22$searchTerm%22"
if [ $appended -lt `expr $# - 1` ]
then
url="$url"\+
else
url="$url"\&btnG\=Google\+Search\&meta\=
fi
let "appended+=1"
done
lynx -dump $url >> googleresult1
sed 's/http/\^http/g' googleresult1 | tr -s "^" "\n" | grep http| sed 's/\ .*//g' >> googleresults2 #this command extract only the urs
rm googleresult1
cat googleresults2
sed -e '/google/d' googleresults2 >> urls.txt
fi
The sed command at the end removes the results with google.com in them which are the following pages of results
I have tried this sed -n '/:/,/+/p' url.txt but there are three colons in the cache url and I need the text between the third : and the +
thanks for the replies everyone, I know the above script is sloppy, but im not very good at scripting yet. Im using it to list urls with certain words to import into a web spider for my company web filter. I decided the easist way would just be to add an extra line to filter out lines with "cache" in them something like:
sed -e '/cache/d' googleresults2 >> urls.txt
Does anyone know of a good web spider scipt that is keyword based?