I need help to parse this xml file that has paragraphs broken in different lines and I would like to join in a single line.
I hope you can understand my explanation. Thanks for any help/direction.
The script could be in bash, awk, ruby, perl whatever please
In the output I want:
The values with font=9 as initial line of each group
The 1st value with font=8 that is inmediately below of value with font=9, I want it as 2nd line of each group
The 2nd value with font=8 that is below of value with font=9, I want it as first column of each group
The values with font=10 I want them as second column of each group
And finally, join in a single line the values with font=8 that belong to the previous value of font=10
OS family
Unix pk1
1 1 originally meant to be a
1 2 convenient platform for programmers
Source model
Unix pk2
2 1 Historically closed-source
2 2 , while some Unix projects (including BSD family and Illumos) are open-source. Development started in 1969.
2 3 this is last paragraph.
<b>OS family </b>
Unix pk1
1
1
<b>OS family </b>
originally meant to be a
1
2
<b>Source model </b>
convenient platform
1
<b>Source model </b>
Historically
closed-source
2
<b>Source model </b>
, while some Unix
3
I see your script prints the output desired but when I try it the output is different.
I get this output.
OS family
nix pk1
originally meant to be a
for programmersorm
Source model
nix pk2
closed-source
Development started in 1969.ly and Illumos)
last paragraph.
I was afraid of that when readling your system info. What awk version do you use? Sure you ran the script exactly as given? And the data as given in the sample?
Yes. I run exactly as given and with input the same as pasted in forum.
In Ubuntu system
$ awk -W version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
Copyright (C) 1989, 1991-2015 Free Software Foundation.
In Cygwin:
$ awk -W version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.
$ awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' input.xml
9 OS family
8 Unix pk1
8 1
10 1
8 originally meant to be a
10 2
8
8 convenient platform
8 for programmers
9 Source model
8 Unix pk2
8 2
10 1
8 Historically
8 closed-source
10 2
8
8 , while some Unix
8 projects (including BSD family and Illumos)
8 are open-source.
8 Development started in 1969.
10 3
8 this is
8 last paragraph.
I discovered why the output differs. The line ending of the file is CRLF since I�m working on Windows. I changed the line ending to LF and the output is the same as yours.