Parsing XML in awk : OFS does not work as expected

martin.franek · December 31, 2010, 6:35am

Hi,

I am trying to parse regular XML file where I have to reduce number of decimal points in some xml elements. I am using following AWK command to achive that :

#!/bin/ksh

EDITCMD='BEGIN { FS = "[\<\>]"; OFS=FS }
{
if ( $3 ~ "[0-9][0-9]*\\.[0-9][0-9]*" && length(substr($3,1+index($3,"."))) == 15 ) {
PRE=substr($3,1,index($3,".")-1);
POST=substr($3,1+index($3,"."),5);
$3 = PRE "." POST
}
{
print $0
}
}'
nawk "$EDITCMD" /path/file.xml

Problem is, that I can not make the OFS to be correctly print out in the lines where the transformation was applied. Output looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Import xmlns:xsi="">
<INSTRUMENT>
<INSTRUMENT_CD>00036AAB1</INSTRUMENT_CD>
<BUNDLE_ID>48328</BUNDLE_ID>
<ACCRUAL_DT>5/8/2001</ACCRUAL_DT>
[<>]AMT_ISU[<>]125000000.00000[<>]/AMT_ISU[<>]
<ANNOUNCE_DT>5/1/2001</ANNOUNCE_DT>
<CD_INSTMT_TYPE>UNKNOWN</CD_INSTMT_TYPE>
<CHANGE_DT>5/7/2009 21:02:01.370</CHANGE_DT>
..
..

What am I doing wrong ? FS definition seems to be correct as the transformation is applied to the correct fields/strings, but why the OFS does not hold corresponding FS character when line is been printed out ? It did not help when I escaped, double escaped or did not escaped this characters in FS.

Thanks for your help,

Martin

Scrutinizer · December 31, 2010, 7:22am

Try: sub($3,PRE"."POST) instead of $3 = PRE "." POST and then you can leave out OFS=FS

martin.franek · December 31, 2010, 7:43am

Thanks Scrutinizer, your advise works fine.

However, I would be still interested how to properly use OFS when in FS is regular expression or group of characters and I do not want to change corresponding output separator , just need to access and touch some of the fields.

Any other ideas ?

Thanks & Regards

Scrutinizer · December 31, 2010, 7:55am

contrary to FS, OFS does not contain regex, so IMO that would not be possible..

m.d.ludwig · January 1, 2011, 9:57am

As it does not look like you are validating tags, and that you are reducing any number with 15 significant digits, maybe man sed (linux) be a "better" choice:

sed -e 's/>\([0-9][0-9]*\.[0-9][0-9][0-9][0-9][0-9]\)[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]</>\1</g' inputfile

(Yes, I know "-e" is not necessary, but I am one of those boring, make it obvious kind of person)
This way, you don't have to worry if file being changed was formated as shown above, or as:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Import xmlns:xsi="">
<INSTRUMENT><INSTRUMENT_CD>00036AAB1</INSTRUMENT_CD><BUNDLE_ID>48328</BUNDLE_ID><ACCRUAL_DT>5/8/2001</ACCRUAL_DT><AMT_ISU>125000000.123456789012345</AMT_ISU><ANNOUNCE_DT>5/1/2001</ANNOUNCE_DT><CD_INSTMT_TYPE>UNKNOWN</CD_INSTMT_TYPE><CHANGE_DT>5/7/2009 21:02:01.370</CHANGE_DT>...

Now if you need make sure the tags match, you can do change the regex to:

s:<\([^>]*\)>\([0-9][0-9]*\.[0-9][0-9][0-9][0-9][0-9]\)[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]</\1>:<\1>\2</\1>:g

Or even list the specific tags you want to change:

s:<\(AMT_ISU\|anothertag\)>\([0-9][0-9]*\.[0-9][0-9][0-9][0-9][0-9]\)[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]</\1>:<\1>\2</\1>:g