Parse data

saw7 · October 3, 2010, 4:13am

hi

i have a file p1.htm

<div class="colorID2">
   
    aaaa aaaa aa <br/>
    bbbbbbbb   bbb<br/>


    <br/>cccc ccc ccc 
</div><div class="colorID1">
   
    dddd d ddddd<br/>
 

eeee eeee eeeeeeeeee<br/>
      fffff

<br/>g gg<br/>
</div>
<div ...

output:

aaaa aaaa aa.bbbbbbbb   bbb.cccc ccc ccc.dddd d ddddd.eeee eeee eeeeeeeeee.fffff.g gg

my code:

awk -vRS="" '{gsub(/<br/>/,".",$0)}1' p1.htm

but don't work

thank's

Scrutinizer · October 3, 2010, 4:31am

Try:

awk '{$1=$1;gsub(/<\/*div[^>]*>/,"");gsub(/ *(<br\/>)+ */,".")}1' RS= ORS= infile

saw7 · October 3, 2010, 5:17am

thank's Scrutinizer

---------- Post updated at 04:17 AM ---------- Previous update was at 03:40 AM ----------

Scrutinizer, sorry, can you explain me:

/ *(<br\/>)+ */

---------- Post updated at 04:17 AM ---------- Previous update was at 04:17 AM ----------

Scrutinizer, sorry, can you explain me:

/ *(<br\/>)+ */

what the difference:

/<br\/>/

Scrutinizer · October 3, 2010, 5:31am

It means zero or more spaces, followed by 1 or more occurrences of the string <br/> followed by zero or more spaces.

frans · October 3, 2010, 5:48am

sed 's/<[^<]*>//g' infile | tr '\n' ' '

doesn't convert tabs and multiple spaces but can be read by the shell, awk...

saw7 · October 3, 2010, 6:35am

Scrutinizer
frans