Parsing: How to go from HTML to CSV?

docdudetheman · March 26, 2009, 11:40am

Dear all,

I have to parse a large amount of html files, which I would like to transform into comma separated values. The html-files have the following structure:

&lt;tag1&gt; CATEGORY_1 &lt;tag2&gt;&lt;tag3&gt; HEADER_1 &lt;tag4&gt;

<tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag3>HEADER_2 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag5> paragraph_3 <tag6>
<tag7>

<tag1> CATEGORY_2 <tag2><tag3> HEADER_3 <tag4>
<tag5> paragraph_1 <tag6>
<tag3>HEADER_4 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag7>
.....

Each category contains a different number of headers and each header contains a different number of paragraphs.

I would like to transform the HTML into something looking like the following:

�CATEGORY_1� , �HEADER_1� , �paragraph_1�

�CATEGORY_1� , �HEADER_1� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_1�
�CATEGORY_1� , �HEADER_2� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_3�
�CATEGORY_2� , �HEADER_3� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_2�
...

I am fairly new to shell scripting and I have been playing around with awk. I was not able to come up with a satisfying solution yet, so any kind of help would be greatly appreciated.

Many thanks

Philipp

pludi · March 26, 2009, 12:10pm

Sounds like a job for Perl and HTML::TreeBuilder

docdudetheman · March 26, 2009, 2:34pm

Hi,
thank you for the reply. I will try that, however, I am relcutant to use Perl since I hardly know how the bash shell works. Anyways. Thanks a lot

Phil