How to do find differences between 2 XML Files?

Hello All,

Requirement is to compare 2 XML files and see if there are any differences but from some of the providers We are receiving UTF-16 formatted XML file with no end of line as shown below.

Excerpt of data file:

ÿþ<^@?^@x^@m^@l^@ ^@v^@e^@r^@s^@i^@o^@n^@=^@"^@1^@.^@0^@"^@ ^@e^@n^@c^@o^@d^@i^@n^@g^@=^@"^@U^@T^@F^@-^@1^@6^@"^@ ^@?^@>^@<^@P^@r^@o^@v^@i^@d^@e^@r^@ ^@x^^@l^@n^@s^@=^@"^@h^@t^@t^@p^@:^@/^@/^@w^@w^@w^@.^@f^@c^@a^@.^@g^@o^@v^@/^@F^@C^@S^@L^@o^@a^@n^@s^@"^@ ^@x^@m^@l^@n^@s^@:^@x^@s^@i^@=^@"^@h^@t^@t^@p^@:^@/^@/^@w^@w^@w^@.^@w^@3^@.^@o^@r^@g^@/^@2^@0^@0^@1^@/^@X^@M^@L^@S^@c^@h^@e^@m^@a^@-^@i^@n^@s^@t^@a^@n^@c^@e^@"^@

2014-03-31_17_V2.5.XML [readonly][noeol][converted] 2L, 18676154C

I used iconv command to convert this file to UTF-8 formatted file.
Now i can see the data in the XML file visible to human eyes but everything is coming out as a single line.

wc -l 2014-03-31_17_V2.5.XML.utf8
	1 2014-03-31_17_V2.5.XML.utf8

How could i put of end of lines after each XML tag?

once i align the XML tags in my data file with end of line characters, then i want to do DIFF between two XML files to find the differences. please help.

Thank you.

If you can live with a newline added after every tag, here is a simple way to do it:

awk '
BEGIN {	RS = ">"
	ORS = ">\n"
}
{	$1 = $1
	print
}' file.xml

This will add a newline after each greater than sign (which will also add the missing newline to the end of your file).

Change awk to /usr/xpg4/bin/awk if you're using a Solaris/SunOS system.

Sorry i should have mentioned my OS i am using Red Hat Linux OS. I tried executing your code but receiving an error message.

awk 'BEGIN {RS = ">" ORS = ">\n"}{$1 = $1 print}' 2014-03-31_17_V2.5.XML.utf8

awk: BEGIN {RS = ">" ORS = ">\n"}{$1 = $1 print}
awk:                     ^ syntax error
awk: BEGIN {RS = ">" ORS = ">\n"}{$1 = $1 print}
awk:                                      ^ syntax error

If you would use the code I gave you; there would not be any syntax errors.
If you randomly combine lines of code without adding statement separators, when you do that; you should expect syntax errors, or just bad results! Please try the code I suggested (keeping newlines where I had them) and let us know what happens.

Below is the excerpt of the output for the awk script.

<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="17" SCHEMA_VERSION="2.5">
<Institution UNINUM="xxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46" BOOK_VALUE_COUNT="3115" PAST_DUE_AMOUNT_DOLLARS="3630254.00" PAST_DUE_AMOUNT_COUNT="23" ACCEPTABLE_VOL_DOLLARS="693647325.79" ACCEPTABLE_VOL_COUNT="3058" ACCRUED_INTEREST_DOLLARS="10221888.55" ACCRUED_INTEREST_COUNT="2877" DOUBTFUL_VOL_DOLLARS="2374.32" DOUBTFUL_VOL_COUNT="3" OAEM_VOL_DOLLARS="17835860.04" OAEM_VOL_COUNT="16" PRINCIPAL_BALANCE_DOLLARS="709941492.91" PRINCIPAL_BALANCE_COUNT="3095" SUBSTANDARD_VOL_DOLLARS="8677821.31" SUBSTANDARD_VOL_COUNT="38" PD_RATING_VALUES="20603" PD_RATING_COUNT="3322" BEGINNING_FARMER_FLAG_COUNT="636" SMALL_FARMER_FLAG_COUNT="1505" YOUNG_FARMER_FLAG_COUNT="580">
<Customer CIF="14338">
<BORROWER_NAME>
xxxxxxxxxxxx</BORROWER_NAME>
<FIPS_CODE>
49023</FIPS_CODE>
<RELATED_PARTY_LOAN_CODE>
0</RELATED_PARTY_LOAN_CODE>
<RELATIONSHIP_ESTABLISH_DATE>
1999-03-15</RELATIONSHIP_ESTABLISH_DATE>
<LAST_RISK_RATING_CHANGE_DATE>
2012-06-12</LAST_RISK_RATING_CHANGE_DATE>
<BALANCE_SHEET_DATE>
2012-11-30</BALANCE_SHEET_DATE>
<INCOME_STATEMENT_DATE>
2013-12-31</INCOME_STATEMENT_DATE>
<DEBT_REPAYMENT_COVERAGE_RATIO>
1.2083000000</DEBT_REPAYMENT_COVERAGE_RATIO>
<CURRENT_ASSETS>
2216074.00</CURRENT_ASSETS>
<CURRENT_LIABILITIES>
2036364.00</CURRENT_LIABILITIES>
<FARM_OPS_EXP>
3759280.00</FARM_OPS_EXP>
<GROSS_AG_INC>
3553127.00</GROSS_AG_INC>
<INT_EXP>
123050.00</INT_EXP>
<NON_CURR_ASSET>
14172350.00</NON_CURR_ASSET>
<NON_CURR_LIABILITIES>
2491022.00</NON_CURR_LIABILITIES>
<NET_AG_INC>
-206153.00</NET_AG_INC>
<NET_INC>
498043.00</NET_INC>
<NET_WORTH>
11861038.00</NET_WORTH>
<NONFARM_INC>
704196.00</NONFARM_INC>
<TOTAL_ASSETS>
16388424.00</TOTAL_ASSETS>
<TOTAL_LIABILITIES>
4527386.00</TOTAL_LIABILITIES>
<DEBT_SERVICE_REQUIREMENT>
514028.00</DEBT_SERVICE_REQUIREMENT>
<REPAYMENT_SOURCE>
1</REPAYMENT_SOURCE>
</Customer>
</Institution>
</Provider>
^M>

Is there any possibility to acheive the below expected output

<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="17" SCHEMA_VERSION="2.5">
<Institution UNINUM="xxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46" BOOK_VALUE_COUNT="3115" PAST_DUE_AMOUNT_DOLLARS="3630254.00" PAST_DUE_AMOUNT_COUNT="23" ACCEPTABLE_VOL_DOLLARS="693647325.79" ACCEPTABLE_VOL_COUNT="3058" ACCRUED_INTEREST_DOLLARS="10221888.55" ACCRUED_INTEREST_COUNT="2877" DOUBTFUL_VOL_DOLLARS="2374.32" DOUBTFUL_VOL_COUNT="3" OAEM_VOL_DOLLARS="17835860.04" OAEM_VOL_COUNT="16" PRINCIPAL_BALANCE_DOLLARS="709941492.91" PRINCIPAL_BALANCE_COUNT="3095" SUBSTANDARD_VOL_DOLLARS="8677821.31" SUBSTANDARD_VOL_COUNT="38" PD_RATING_VALUES="20603" PD_RATING_COUNT="3322" BEGINNING_FARMER_FLAG_COUNT="636" SMALL_FARMER_FLAG_COUNT="1505" YOUNG_FARMER_FLAG_COUNT="580">
<Customer CIF="14338">
<BORROWER_NAME>xxxxxxxxxxxx</BORROWER_NAME>
<FIPS_CODE>49023</FIPS_CODE>
<RELATED_PARTY_LOAN_CODE>0</RELATED_PARTY_LOAN_CODE>
<RELATIONSHIP_ESTABLISH_DATE>1999-03-15</RELATIONSHIP_ESTABLISH_DATE>
<LAST_RISK_RATING_CHANGE_DATE>2012-06-12</LAST_RISK_RATING_CHANGE_DATE>
<BALANCE_SHEET_DATE>2012-11-30</BALANCE_SHEET_DATE>
<INCOME_STATEMENT_DATE>2013-12-31</INCOME_STATEMENT_DATE>
<DEBT_REPAYMENT_COVERAGE_RATIO>1.2083000000</DEBT_REPAYMENT_COVERAGE_RATIO>
<CURRENT_ASSETS>2216074.00</CURRENT_ASSETS>
<CURRENT_LIABILITIES>2036364.00</CURRENT_LIABILITIES>
<FARM_OPS_EXP>3759280.00</FARM_OPS_EXP>
<GROSS_AG_INC>3553127.00</GROSS_AG_INC>
<INT_EXP>123050.00</INT_EXP>
<NON_CURR_ASSET>14172350.00</NON_CURR_ASSET>
<NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES>
<NET_AG_INC>-206153.00</NET_AG_INC>
<NET_INC>498043.00</NET_INC>
<NET_WORTH>11861038.00</NET_WORTH>
<NONFARM_INC>704196.00</NONFARM_INC>
<TOTAL_ASSETS>16388424.00</TOTAL_ASSETS>
<TOTAL_LIABILITIES>4527386.00</TOTAL_LIABILITIES>
<DEBT_SERVICE_REQUIREMENT>514028.00</DEBT_SERVICE_REQUIREMENT>
<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
</Customer>
</Institution>
</Provider>

Try this on your one line file, but your mileage may vary:

sed -r 's#(</[^>]*>)#\1\n#g' file

RudiC's suggestion may work well on a Linux system, but sed is only defined to work on a text file. (Files with lines that average 9Mb/line and that do not end with a newline character are not text files.) The following should work as long as no line in your desired output file is longer than 2048 bytes:

awk '
BEGIN {	RS = ">"
	ORS = ">\n"
}
!/^</ {	out = out ">" $1
	next
}
{	print out
	out = $1
}
END {	print out
}' 2014-03-31_17_V2.5.XML.utf8

i tried your code, but it is removing the content for example
<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="17" SCHEMA_VERSION="2.5">
CIF="14338">
xxxxxxxxxxxx DAIRY FARMS INC</BORROWER_NAME>

>
<?xml>
<Provider>
<Institution>
<Customer>
<BORROWER_NAME>xxxxxxxxx>
<FIPS_CODE>49023</FIPS_CODE>
<RELATED_PARTY_LOAN_CODE>0</RELATED_PARTY_LOAN_CODE>
<RELATIONSHIP_ESTABLISH_DATE>1999-03-15</RELATIONSHIP_ESTABLISH_DATE>
<LAST_RISK_RATING_CHANGE_DATE>2012-06-12</LAST_RISK_RATING_CHANGE_DATE>
<BALANCE_SHEET_DATE>2012-11-30</BALANCE_SHEET_DATE>
<INCOME_STATEMENT_DATE>2013-12-31</INCOME_STATEMENT_DATE>
<DEBT_REPAYMENT_COVERAGE_RATIO>1.2083000000</DEBT_REPAYMENT_COVERAGE_RATIO>
<CURRENT_ASSETS>2216074.00</CURRENT_ASSETS>
<CURRENT_LIABILITIES>2036364.00</CURRENT_LIABILITIES>
<FARM_OPS_EXP>3759280.00</FARM_OPS_EXP>
<GROSS_AG_INC>3553127.00</GROSS_AG_INC>
<INT_EXP>123050.00</INT_EXP>
<NON_CURR_ASSET>14172350.00</NON_CURR_ASSET>
<NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES>
<NET_AG_INC>-206153.00</NET_AG_INC>
<NET_INC>498043.00</NET_INC>
<NET_WORTH>11861038.00</NET_WORTH>
<NONFARM_INC>704196.00</NONFARM_INC>
<TOTAL_ASSETS>16388424.00</TOTAL_ASSETS>
<TOTAL_LIABILITIES>4527386.00</TOTAL_LIABILITIES>
<DEBT_SERVICE_REQUIREMENT>514028.00</DEBT_SERVICE_REQUIREMENT>
<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>

What input is giving you that output?

Below is the excerpt of my input file, its one single file because of no EOL characters as you know.

<?xml version="1.0" encoding="UTF-16" ?><Provider PROVIDER="x" SCHEMA_VERSION="2.5"><Institution UNINUM="xxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46" BOOK_VALUE_COUNT="3115" PAST_DUE_AMOUNT_DOLLARS="3630254.00" PAST_DUE_AMOUNT_COUNT="23" ACCEPTABLE_VOL_DOLLARS="693647325.79" ACCEPTABLE_VOL_COUNT="3058" ACCRUED_INTEREST_DOLLARS="10221888.55" ACCRUED_INTEREST_COUNT="2877" DOUBTFUL_VOL_DOLLARS="2374.32" DOUBTFUL_VOL_COUNT="3" OAEM_VOL_DOLLARS="17835860.04" OAEM_VOL_COUNT="16" PRINCIPAL_BALANCE_DOLLARS="709941492.91" PRINCIPAL_BALANCE_COUNT="3095" SUBSTANDARD_VOL_DOLLARS="8677821.31" SUBSTANDARD_VOL_COUNT="38" PD_RATING_VALUES="20603" PD_RATING_COUNT="3322" BEGINNING_FARMER_FLAG_COUNT="636" SMALL_FARMER_FLAG_COUNT="1505" YOUNG_FARMER_FLAG_COUNT="580"><Customer CIF="14338"><BORROWER_NAME>xxxxxxxxx DAIRY FARMS INC</BORROWER_NAME><FIPS_CODE>49023</FIPS_CODE><RELATED_PARTY_LOAN_CODE>0</RELATED_PARTY_LOAN_CODE><RELATIONSHIP_ESTABLISH_DATE>1999-03-15</RELATIONSHIP_ESTABLISH_DATE><LAST_RISK_RATING_CHANGE_DATE>2012-06-12</LAST_RISK_RATING_CHANGE_DATE><BALANCE_SHEET_DATE>2012-11-30</BALANCE_SHEET_DATE><INCOME_STATEMENT_DATE>2013-12-31</INCOME_STATEMENT_DATE><DEBT_REPAYMENT_COVERAGE_RATIO>1.2083000000</DEBT_REPAYMENT_COVERAGE_RATIO><CURRENT_ASSETS>2216074.00</CURRENT_ASSETS><CURRENT_LIABILITIES>2036364.00</CURRENT_LIABILITIES><FARM_OPS_EXP>3759280.00</FARM_OPS_EXP><GROSS_AG_INC>3553127.00</GROSS_AG_INC><INT_EXP>123050.00</INT_EXP><NON_CURR_ASSET>14172350.00</NON_CURR_ASSET><NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES><NET_AG_INC>-206153.00</NET_AG_INC><NET_INC>498043.00</NET_INC><NET_WORTH>11861038.00</NET_WORTH><NONFARM_INC>704196.00</NONFARM_INC><TOTAL_ASSETS>16388424.00</TOTAL_ASSETS><TOTAL_LIABILITIES>4527386.00</TOTAL_LIABILITIES><DEBT_SERVICE_REQUIREMENT>514028.00</DEBT_SERVICE_REQUIREMENT><REPAYMENT_SOURCE>1</REPAYMENT_SOURCE><Loan LOAN_NUMBER="3583040101"><BRANCH>RICHFIELD           

Sorry about that, try this:

awk '
BEGIN {	FS = RS = ">"
	ORS = ">\n"
}
!/^</ {	out = out ">" $1
	next
}
{	print out
	out = $1
}
END {	print out
}' 2014-03-31_17_V2.5.XML.utf8
1 Like

It looks good now below is the excerpt from XML file, Thanks a lot

>
<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="xx" SCHEMA_VERSION="2.5">
<Institution UNINUM="xxxxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46" BOOK_VALUE_COUNT="3115" PAST_DUE_AMOUNT_DOLLARS="3630254.00" PAST_DUE_AMOUNT_COUNT="23" ACCEPTABLE_VOL_DOLLARS="693647325.79" ACCEPTABLE_VOL_COUNT="3058" ACCRUED_INTEREST_DOLLARS="10221888.55" ACCRUED_INTEREST_COUNT="2877" DOUBTFUL_VOL_DOLLARS="2374.32" DOUBTFUL_VOL_COUNT="3" OAEM_VOL_DOLLARS="17835860.04" OAEM_VOL_COUNT="16" PRINCIPAL_BALANCE_DOLLARS="709941492.91" PRINCIPAL_BALANCE_COUNT="3095" SUBSTANDARD_VOL_DOLLARS="8677821.31" SUBSTANDARD_VOL_COUNT="38" PD_RATING_VALUES="20603" PD_RATING_COUNT="3322" BEGINNING_FARMER_FLAG_COUNT="636" SMALL_FARMER_FLAG_COUNT="1505" YOUNG_FARMER_FLAG_COUNT="580">
<Customer CIF="xxx">
<BORROWER_NAME>xxxxx</BORROWER_NAME>
<FIPS_CODE>49023</FIPS_CODE>
<RELATED_PARTY_LOAN_CODE>0</RELATED_PARTY_LOAN_CODE>
<RELATIONSHIP_ESTABLISH_DATE>1999-03-15</RELATIONSHIP_ESTABLISH_DATE>
<LAST_RISK_RATING_CHANGE_DATE>2012-06-12</LAST_RISK_RATING_CHANGE_DATE>
<BALANCE_SHEET_DATE>2012-11-30</BALANCE_SHEET_DATE>
<INCOME_STATEMENT_DATE>2013-12-31</INCOME_STATEMENT_DATE>
<DEBT_REPAYMENT_COVERAGE_RATIO>1.2083000000</DEBT_REPAYMENT_COVERAGE_RATIO>
<CURRENT_ASSETS>2216074.00</CURRENT_ASSETS>
<CURRENT_LIABILITIES>2036364.00</CURRENT_LIABILITIES>
<FARM_OPS_EXP>3759280.00</FARM_OPS_EXP>
<GROSS_AG_INC>3553127.00</GROSS_AG_INC>
<INT_EXP>123050.00</INT_EXP>
<NON_CURR_ASSET>14172350.00</NON_CURR_ASSET>
<NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES>
<NET_AG_INC>-206153.00</NET_AG_INC>
<NET_INC>498043.00</NET_INC>
<NET_WORTH>11861038.00</NET_WORTH>
<NONFARM_INC>704196.00</NONFARM_INC>
<TOTAL_ASSETS>16388424.00</TOTAL_ASSETS>
<TOTAL_LIABILITIES>4527386.00</TOTAL_LIABILITIES>
<DEBT_SERVICE_REQUIREMENT>514028.00</DEBT_SERVICE_REQUIREMENT>
<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
<Loan LOAN_NUMBER="3583040101">
<BRANCH>RICHFIELD                                         </BRANCH>
<INT_RATE_PRODUCT>6</INT_RATE_PRODUCT>
<LOAN_OFFICER>ROBERT WHEELER                                              </LOAN_OFFICER>
<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
<LOSS_GIVEN_DEFAULT>A</LOSS_GIVEN_DEFAULT>
<TIL_FLAG>0</TIL_FLAG>
<PERFORMANCE_CLASS>1</PERFORMANCE_CLASS>
<BEGINNING_FARMER_FLAG>0</BEGINNING_FARMER_FLAG>
<LOAN_TYPE>1</LOAN_TYPE>
<SMALL_FARMER_FLAG>0</SMALL_FARMER_FLAG>
<ACCEPTABLE_VOL>836460.01</ACCEPTABLE_VOL>
<ACCRUED_INTEREST>0.00</ACCRUED_INTEREST>
<BOOK_VALUE>836460.01</BOOK_VALUE>
<BORROWER_CATEGORY>2</BORROWER_CATEGORY>
<BORROWER_ENTITY>3</BORROWER_ENTITY>
<COMMIT_CURRENT>836460.01</COMMIT_CURRENT>
<COMMIT_UNDISBURSED>0.00</COMMIT_UNDISBURSED>
<COST_OF_FUNDS>0.0150500000</COST_OF_FUNDS>
<DATE_ORIGINATED>2000-07-21</DATE_ORIGINATED>
<DOUBTFUL_VOL>0.00</DOUBTFUL_VOL>
<GOVT_GUARANTEE_AMT>0.00</GOVT_GUARANTEE_AMT>
<INT_RATE>0.0450000000</INT_RATE>
<LENDER>1</LENDER>