sed to parse html

Hello,
I have a html file like this :

<html>
...
...
...
<table>
.......
......
</table>
<table name = "hi">
......
.....
...
</table>
<h1> Welcome </h1>
.......
......
</html>

I only need to take out the text that is between <table name = "hi" > and the corresponding </table>. I need to delete the rest. How do I do that?

I got to the <table name = "hi"> and I deleted lines before that, but I am not able to get to the corresponding </table> as there could be multiple </table> statements.

Please help.

Thanks,
Prasanna

Try:

sed '/<table name = "hi">/,/<\/table>/!d' infile

Thanks a lot.

Is the regex not greedy? Would it not match the last </table> that it sees, if there are other table tags below our </table>

Thanks,
Prasanna

Might as well make it tab separated text, too, so you can excel/access import it. :slight_smile:

Hi no it is not greedy...

Thanks a lot.

You can import html but it is amazingly slow!

---------- Post updated at 04:49 PM ---------- Previous update was at 04:48 PM ----------

Beware some old access do not know how to properly honor CSV, so tab sep txt is the winner!

@Scrutinizer

What happend if

<table name = "hi">
......
<table name = "hi">
...
</table>
.....
...
</table>
.....

It would output:

<table name = "hi">
......
<table name = "hi">
...
</table>

so you would miss :wink:

.....
...
</table>
#!/usr/bin/env ruby  -Ku
file=ARGV[0]
require 'hpricot'
doc = open(file){|f|Hpricot(f)}
(doc/"table").each do |x|
  print "->#{x}\n" if x.get_attribute("name") == "hi"
end
# cat file
<html>
...
...
...
<table>
.......
......
</table>
<table name = "hi">
text inside hi
</table>
<h1> Welcome </h1>
.......
......
</html>
<table name = "hi">
some more text inside hi
</table>


$ ruby test.rb file
====> <table name="hi">
text inside hi
</table>
====> <table name="hi">
some more text inside hi
</table>

awk and sed's range patterns also print multiple ranges:

$ awk '/<table name = "hi">/,/<\/table>/' infile
<table name = "hi">
text inside hi
</table>
<table name = "hi">
some more text inside hi
</table>
$ sed -n  '/<table name = "hi">/,/<\/table>/p' infile
<table name = "hi">
text inside hi
</table>
<table name = "hi">
some more text inside hi
</table>