Dear all,
I have to parse a large amount of html files, which I would like to transform into comma separated values. The html-files have the following structure:
<tag1> CATEGORY_1 <tag2><tag3> HEADER_1 <tag4>
<tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag3>HEADER_2 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag5> paragraph_3 <tag6>
<tag7>
<tag1> CATEGORY_2 <tag2><tag3> HEADER_3 <tag4>
<tag5> paragraph_1 <tag6>
<tag3>HEADER_4 <tag4><tag5> paragraph_1 <tag6>
<tag5> paragraph_2 <tag6>
<tag7>
.....
Each category contains a different number of headers and each header contains a different number of paragraphs.
I would like to transform the HTML into something looking like the following:
�CATEGORY_1� , �HEADER_1� , �paragraph_1�
�CATEGORY_1� , �HEADER_1� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_1�
�CATEGORY_1� , �HEADER_2� , �paragraph_2�
�CATEGORY_1� , �HEADER_2� , �paragraph_3�
�CATEGORY_2� , �HEADER_3� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_1�
�CATEGORY_2� , �HEADER_4� , �paragraph_2�
...
I am fairly new to shell scripting and I have been playing around with awk. I was not able to come up with a satisfying solution yet, so any kind of help would be greatly appreciated.
Many thanks
Philipp