html tags

dunryc · November 28, 2007, 6:29pm

hi new to the forum so hi every one hope you all well,

Iam attempting to write a bash script at the moment its a scraper/grabber using wget to download webpages related to the users query. that part is no probs when i have the page i need to stipr all the useless (to me) data out of the html source ie :-

as you can seen from the above the data i need to grab is from between the new tags these are always on the source what ever the uses query. Can anyone help or point me in the correct direction any help would be greatly appreciated thanks for listening dunryc

porter · November 28, 2007, 6:39pm

Have you considered XMLStarlet Command Line XML Toolkit: Overview

bakunin · November 28, 2007, 7:35pm

There are two different cases to be considered: the starting and ending tags are on the same line or they are on different lines:

Example

<new>This is the text to catch</new>

<new>
This is some text
to catch</new>

Both can be matched by simple regular expressions. For each regexp i give the matched portion in blue:

sed -n 's/.*<new>\(.*\)<\/new>.*/\1/p'

blabla <new>text to match</new> blabla

sed -n '/<new>/,/<\/new>/ {
               s/.*<new>//
               s/<\/new>.*//
               /^$/d
               p
               }'

blabla <new>text
to
match</new> blabla

bakunin

dunryc · November 29, 2007, 5:14pm

thanks for the pointers guys , i did have a look at XMLStarlet to grab the data and it works great but i wanted to use tools that would be present in most distros the commands that bakunin work great once again thanks for the help