Parsing: How to go from HTML to CSV?

Dear all,

I have to parse a large amount of html files, which I would like to transform into comma separated values. The html-files have the following structure:

<tag1> CATEGORY_1 <tag2><tag3> HEADER_1 <tag4>

<tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag3>HEADER_2 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag5> paragraph_3 <tag6>
<tag7>

<tag1> CATEGORY_2 <tag2><tag3> HEADER_3 <tag4>
<tag5> paragraph_1 <tag6>
<tag3>HEADER_4 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag7>
.....

Each category contains a different number of headers and each header contains a different number of paragraphs.

I would like to transform the HTML into something looking like the following:

�CATEGORY_1� , �HEADER_1� , �paragraph_1�

�CATEGORY_1� , �HEADER_1� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_1�
�CATEGORY_1� , �HEADER_2� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_3�
�CATEGORY_2� , �HEADER_3� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_2�
...

I am fairly new to shell scripting and I have been playing around with awk. I was not able to come up with a satisfying solution yet, so any kind of help would be greatly appreciated.

Many thanks

Philipp

Sounds like a job for Perl and HTML::TreeBuilder

Hi,
thank you for the reply. I will try that, however, I am relcutant to use Perl since I hardly know how the bash shell works. Anyways. Thanks a lot

Phil