Help to join separate lines in a single one from xml file

Ophiuchus · February 23, 2018, 1:02am

Hi all,

I need help to parse this xml file that has paragraphs broken in different lines and I would like to join in a single line.

I hope you can understand my explanation. Thanks for any help/direction.

The script could be in bash, awk, ruby, perl whatever please

In the output I want:
The values with font=9 as initial line of each group
The 1st value with font=8 that is inmediately below of value with font=9, I want it as 2nd line of each group
The 2nd value with font=8 that is below of value with font=9, I want it as first column of each group
The values with font=10 I want them as second column of each group
And finally, join in a single line the values with font=8 that belong to the previous value of font=10

input

	<text top="333" left="98" width="93" height="16" font="9"><b>OS family </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk1</text>
	<text top="368" left="98" width="12" height="16" font="8">1 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8"> originally meant to be a </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">convenient platform</text>
	<text top="402" left="98" width="339" height="16" font="8"> for programmers</text>
	
	<text top="333" left="98" width="93" height="16" font="9"><b>Source model </b></text>
	<text top="350" left="98" width="192" height="16" font="8">Unix pk2</text>
	<text top="368" left="98" width="12" height="16" font="8">2 </text>
	<text top="365" left="112" width="5" height="11" font="10">1</text>
	<text top="368" left="118" width="308" height="16" font="8">Historically </text>
	<text top="368" left="118" width="308" height="16" font="8">closed-source </text>
	<text top="365" left="427" width="5" height="11" font="10">2</text>
	<text top="368" left="433" width="4" height="16" font="8"> </text>
	<text top="385" left="98" width="339" height="16" font="8">, while some Unix</text>
	<text top="402" left="98" width="339" height="16" font="8"> projects (including BSD family and Illumos)</text>
	<text top="402" left="98" width="339" height="16" font="8"> are open-source.</text>
	<text top="402" left="98" width="339" height="16" font="8"> Development started in 1969.</text>
	<text top="365" left="427" width="5" height="11" font="10">3</text>
	<text top="402" left="98" width="339" height="16" font="8">this is</text>
	<text top="402" left="98" width="339" height="16" font="8"> last paragraph.</text>

desired output

	OS family
	Unix pk1
	1 1 originally meant to be a
	1 2 convenient platform for programmers
	
	Source model 
	Unix pk2
	2 1 Historically closed-source 
	2 2 , while some Unix projects (including BSD family and Illumos) are open-source. Development started in 1969.
	2 3 this is last paragraph.

Don_Cragun · February 23, 2018, 2:12am

As always, it helps if we know what operating system you're using and what you have tried to solve this problem on your own.

By listing bash along with awk , ruby , and perl are you saying that bash is the shell that you use?

We are here to help you learn how to use the tools available on your system to do things like this; not to act as your unpaid programming staff.

Ophiuchus · February 24, 2018, 1:04am

Hello Don,

My apologies for any misunderstanding.

I�m using Cygwin on Windows and Ubuntu 16.04.2 LTS on Windows.

In awk or ruby I think would be preferable for my to understand any direction someone could share me.

The code I�ve been able to construct so far is in awk but the output is far from my desired one.

awk '/font="9">/ {a = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     /font="8">/ {
     z++; 
     if(z==1){ b = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z==2){ c = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     if(z>2 ){ d = d " " gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )}
     }
     /font="10">/{d = ""; e = gensub(/(<.*\">)(.*)(<.*)/, "\\2", "g", $0 )

     print a"\n"b"\n"c"\n"e,d; z=0}' input.xml

My current output is:

<b>OS family </b>
Unix pk1
1
1
<b>OS family </b>
 originally meant to be a
1
2
<b>Source model </b>

convenient platform
1
<b>Source model </b>
Historically
closed-source
2
<b>Source model </b>

, while some Unix
3

I hope someone could give some help on this.

Thanks in advance.

RudiC · February 24, 2018, 4:30am

Try

awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8) {if (LVL == 3)   printf "%s ", $0
                         if (LVL == 2)  {LVL = 3
                                         GRP1 = $0
                                        }
                         if (LVL == 1)  {LVL = 2
                                         printf "%s", $0
                                        }
                         }
         if (FNT == 10) {GRP2 = $0
                         printf ORS "%s %s ", GRP1, GRP2
                        }
        }

END     {printf ORS
        }
' file
OS family 
Unix pk1
1  1  originally meant to be a  
1  2   convenient platform  for programmers 

Source model 
Unix pk2
2  1 Historically  closed-source  
2  2   , while some Unix  projects (including BSD family and Illumos)  are open-source.  Development started in 1969. 
2  3 this is  last paragraph.

EDIT: Looks like above can be simplified:

awk '
        {match ($0, /font="[^"]*"/)
         FNT = substr ($0, RSTART+6, RLENGTH-7)
         gsub (/^\t|<[^>]*>/, _)

         if (FNT ==  9) {LVL = 1
                         printf "%s%s" ORS, TRS, $0
                         TRS = ORS ORS
                        }
         if (FNT ==  8)  if (LVL++ == 2)        GRP1 = $0
                           else                 printf "%s ", $0
         if (FNT == 10)  printf ORS "%s %s ", GRP1, $0
        }

END     {printf ORS
        }
' file

Ophiuchus · February 24, 2018, 1:40pm

Hi RudiC,

Thanks for your help.

I see your script prints the output desired but when I try it the output is different.

I get this output.

OS family
 nix pk1
  originally meant to be a
  for programmersorm

Source model
 nix pk2
 closed-source
  Development started in 1969.ly and Illumos)
  last paragraph.

RudiC · February 24, 2018, 1:55pm

I was afraid of that when readling your system info. What awk version do you use? Sure you ran the script exactly as given? And the data as given in the sample?

Pls. post the output of

awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' file

Ophiuchus · February 24, 2018, 2:09pm

Hi RudiC,

Yes. I run exactly as given and with input the same as pasted in forum.

In Ubuntu system

$ awk -W version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
Copyright (C) 1989, 1991-2015 Free Software Foundation.

In Cygwin:

$ awk -W version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5-p2, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.

$ awk '{match ($0, /font="[^"]*"/); LVL = substr ($0, RSTART+6, RLENGTH-7); gsub (/<[^>]*>/, _); print LVL, $0}' input.xml
9       OS family
8       Unix pk1
8       1
10      1
8        originally meant to be a
10      2
8
8       convenient platform
8        for programmers

9       Source model
8       Unix pk2
8       2
10      1
8       Historically
8       closed-source
10      2
8
8       , while some Unix
8        projects (including BSD family and Illumos)
8        are open-source.
8        Development started in 1969.
10      3
8       this is
8        last paragraph.

RudiC · February 24, 2018, 2:24pm

Hmmm ... strange. Identical to what I get. Try again either proposdal in post#4 after dropping the ^\t| from the gsub regex.

Ophiuchus · February 25, 2018, 12:55am

Hi RudiC,

Thank you.

I discovered why the output differs. The line ending of the file is CRLF since I�m working on Windows. I changed the line ending to LF and the output is the same as yours.