Parsing XML

Learned People,

Hello !

Till today, for the most part, all of the tricky questions/situations that I encountered were already posted by other folks and all I had to do was peruse through these one at a time and I could find some sort of an answer and all I had to do was add some minor tweaks and it would fly.

Today, I ran into this situation where in Im parsing XML data that is stored in flat files. I found a perl based answer on this forums that sounds like I can use it but there is no perl installation on the machine & what ever I can do, I tried it and it is not working. Also, this is not a well-formed xml and most of it is one continuos line of 2million+ characters and even when the file size runs into a gig or 1.2g, it only has 20,000 lines.

Here is an example below -

Sample input file ------

$head -1 ROAMERCDRFILE

MDN>9168761121</MDN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Phase Error Peak</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>0</Value>\n<MSN>Degree</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Phase Error RMS</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>1.87</Value>\n<MSN>Degree</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Freq Error</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>4.47</Value>\n<MSN>Hz</MSN>\n<HomeSId>150</HomeSId>\n<ESN>-150</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Transmit Time Error</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>0</Value>\n<MSN>us</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Power</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid1 Power</Subsubroamer>\n<ServSid>CDMA BC6 Power Port A BC6 CH 25 Mid1Power</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>-10.35</Value>\n<MSN>dBm</MSN>\n<HomeSId>-7</HomeSId>\n<ESN>-13</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid1 Power Rho</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>99.86</Value>\n<MSN>N/A</MSN>\n<HomeSId>100</HomeSId>\n<ESN>98</ESN>\n

What Im looking for is -

9168761121
	NA
	Mod
	CDMA BC6
	0
	NA
            ....
            .....
            ...

This is what I stitched together from copying stuff around ....

awk 'BEGIN{FS="<|>"}
{print ESN, SubRoamer, Roamer, Value
ESN=""
SubRoamer =""          
Roamer =""
Value=""
} 
{for(i=1;i<=NF;i++) {if($i=="ESN"){ESN=$(i+1);continue}}}
{for(i=1;i<=NF;i++) {if($i=="SubRoamer"){SubRoamer =$(i+1); continue}}}
{for(i=1;i<=NF;i++) {if($i=="Roamer"){Roamer =$(i+1); continue}}}
{for(i=1;i<=NF;i++) {if($i=="Value"){Value =$(i+1); continue}}}
END {print ESN, SubRoamer, Roamer, Value}' ROAMERCDRFILE

& it print the following ...

   
98   99.86

which is wrong.

Also, I pieced together a very laborious looking thing as below -

cat ROAMERCDRFILE  | tr '>' '\n' | egrep 'ESN|Subroamer|roamer|Value' | sed 's/ESN//g' | sed 's/Subroamer//g' | sed 's/roamer//g' | sed 's/Value//g' | sed 's/<//g' |sed 's/>//g' | sed 's/\///g' | sed 's/\\//g'

& it is showing records as below -

n
NA
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
0
n
NA
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
1.87
n
NA
n

But then, that is taking lots and lots of time and so far, I havent seen result pouring out not even once.

Any and all help would be such a huge relief.

Please help !

regards,
Manohar

Is this gonna be work for you?

echo '.....' | awk 'BEGIN {FS=">";RS="<"}{$0=$2;sub(/\\/,"");print}' 

O/P:

9168761121
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
1
n
0
n
Degree
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
1
n
1.87
n
Degree
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Freq Error
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error
n
1
n
4.47
n
Hz
n
150
n
-150
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Transmit Time Error
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error
n
1
n
0
n
us
n
NA
n
NA
n
n
n
CDMA BC6
n
Power
n
Port A BC6 CH 25 Mid1 Power
n
CDMA BC6 Power Port A BC6 CH 25 Mid1Power
n
1
n
-10.35
n
dBm
n
-7
n
-13
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid1 Power Rho
n
CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho
n
1
n
99.86
n
N/A
n
100
n
98
n

anchal_khare,

Thanks a ton for the reply.

Yes, it is doing it but I had to obfuscate some of the underlying stuff in here for obvious reasons of confidentiality and hence its not fetching the records but it is showing some of the records in here that Im looking for.

So, can you please please explain to me as to what is going on in here, as I understand the field separators and the substitution component a little bit of it but why are you doing this - $0=$2 and what are we gaining out of this ? The reason as to why I ask you this is SO that I can totally tailor my solution instead of posting in here...

awk 'BEGIN {FS=">";RS="<"}{$0=$2;sub(/\\/,"");print}'

Thanks again !

regards,
Manohar.

RS means record separator (each line).
Setting it to "<" was actually means the input line is being splitted after each "<" character. So that you will get values like ..

Subroamer>Power
/Subroamer>\n
Subsubroamer>Port A BC6 CH 25 Mid1 Power
ServSid>CDMA BC6 Power Port A BC6 CH 25 Mid1Power

As you know, We are interested in the later part ( after the ">" character), so FS (field separator ">" is used and in this way fetching second field ($2).

$0=$2, is replacing everything in the line with $2 so that processing could be simpler.
You could skip that part. but you need to do something like this '..... {sub(/\\/,"",$2);print $2}' .

anchal_khare,

I got it the moment I posted reply to my question. That is so nice that you are helping me out here. Thank you and thank you.
I have a bigger question now. So, now this would just strip out the head & tail as well & display the data in its elemental nature...equivalent of just showing up the data all line by line.

So, how do I determine what header tag is related to what data element in here ? I cant just start at data element x & go about as these are optional fields in the db and as such, the cell site might not even post that value field in there. Is there any other way ? Im sorry to throw back another question right at you like this but I just dont know where else to go or what else should I do in here.

with DUE THANKS,

regards,
Anil

I worked on the output you have shown in your requirement (#1).
You didn't mention that you are interested in element names.

About element names, as you mentioned that its not a well formed XML string, therefore I am not sure about the correctness of any solution until you tell us some specific criteria.

With the given input , you can do something like..

echo '......' | awk 'BEGIN {FS=">";RS="<"} !/^\//{printf "%-20s%-20s\n", $1,$2}'


MDN                 9168761121          
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Phase Error Peak
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak
SPCS_Priority       1                   
Value               0                   
MSN                 Degree              
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Phase Error RMS
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS
SPCS_Priority       1                   
Value               1.87                
MSN                 Degree              
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Freq Error
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error
SPCS_Priority       1                   
Value               4.47                
MSN                 Hz                  
HomeSId             150                 
ESN                 -150                
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Transmit Time Error
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error
SPCS_Priority       1                   
Value               0                   
MSN                 us                  
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Power               
Subsubroamer        Port A BC6 CH 25 Mid1 Power
ServSid             CDMA BC6 Power Port A BC6 CH 25 Mid1Power
SPCS_Priority       1                   
Value               -10.35              
MSN                 dBm                 
HomeSId             -7                  
ESN                 -13                 
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid1 Power Rho
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho
SPCS_Priority       1                   
Value               99.86               
MSN                 N/A                 
HomeSId             100                 
ESN                 98                   

Hope it helps.