sed/tr/grep help

flightskoo · April 17, 2012, 3:00am

So I have a html file with a bunch of words inside tags and I need to extract just the words, and I'm not sure exactly what the best way to do this is. The format is as follows:

<tr>
    <td>word 1</td>
    <td>word 2</td>
</tr>

And all I want to extract is the 'word 2'. First I tried eliminating all other html garbage with

egrep '<tr>|<td>' filename

but after that I really had no clue. I tried using sed to find all the <tr> tags and delete it, plus the following line, but there has to be a better way to do this.

The other question I have, is what command do you use to find a phrase, and solely delete that phrase? For example:

wordswordswo<b>rdswords</b>words...

How would one go about just deleting the bold tags? It's pretty simple to delete a line, but what about JUST the matched pattern?

One last request... instead of just giving me some code/commands, could you kind of explain what is going on with the code? Regular expressions are new to me, as well as shell scripting and it's really really confusing and frustrating. Any helpful websites describing how to do similar types of operations would be great, because frankly there are a lot of crappy ones out there on the web. Trust me, I've read about half of them. Thanks so much in advance.

asterisk-ix_use · April 17, 2012, 3:36am

Hey try these:

For Q1:

 
 
user1@linuxbox:/home/user1> cat data
<tr>
    <td>word 1</td>
    <td>word 2</td>
</tr>

user1@linuxbox:/home/user1> sed -n '/\/tr/{g;1!p;};h' data
    <td>word 2</td>

And for the Q2:

 
 
echo "part1<b>part2</b>part3" | sed -n 's/\(.*\)<b>\(.*\)<\/b>\(.*\)/\1\2\3/p'
part1part2part3

Hope this helps!!

Scrutinizer · April 17, 2012, 4:59am

Try:

awk -F'<|>' 'NF>3{print $3}' infile

pokerino · April 17, 2012, 5:09am

Hi,

try this: