Ariean
September 16, 2014, 11:25am
1
Sample XML file:
<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="xx" SCHEMA_VERSION="2.5">
<Institution UNINUM="xxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46">
<Customer CIF="dww213">
<BORROWER_NAME>xxxxxxxxxxxx</BORROWER_NAME>
<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
<Loan LOAN_NUMBER="xxx">
<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
<LOSS_GIVEN_DEFAULT>A</LOSS_GIVEN_DEFAULT>
<COLL_TYPE>3</COLL_TYPE>
<STATUS_FLAG>1</STATUS_FLAG>
<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
</Loan>
</Customer>
<Customer CIF="z122321">
<BORROWER_NAME>xxxxxxxxxxxx</BORROWER_NAME>
<FIPS_CODE>xxxx</FIPS_CODE>
<NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES>
<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
<Loan LOAN_NUMBER="xxx">
<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
<LOSS_GIVEN_DEFAULT>A</LOSS_GIVEN_DEFAULT>
<COLL_TYPE>3</COLL_TYPE>
<STATUS_FLAG>1</STATUS_FLAG>
<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
</Loan>
<Loan LOAN_NUMBER="123">
<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
</Loan>
</Customer>
</Institution>
</Provider>
Hello All,
I am using red Linux OS and my requirement is to get only unique tags, for example for the above XML file i should get the below unique list of tags.
<Provider>
<Institution>
<Customer>
<BORROWER_NAME>
<REPAYMENT_SOURCE>
<FIPS_CODE>
<NON_CURR_LIABILITIES>
<Loan>
<YOUNG_FARMER_FLAG>
<LOSS_GIVEN_DEFAULT>
<COLL_TYPE>
<STATUS_FLAG>
<UMBRELLA_NUMBER>
After i get this list i need to compare it against predefined list of tags and error/email out if the tag is not in that list.
I can do a for loop and compare against the predefined list but i am struck at how to get those unique tags from XML file, can you please help.
Thank you
Yoda
September 16, 2014, 11:35am
2
Try something like this:-
awk -F'[<> ]' '
{
sub(/\//,x,$2)
if ( $2 !~ /xml/ )
A[$2]
}
END {
for ( k in A )
print "<" k ">"
}
' file.xml
1 Like
Try
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !a[$2]++{print "<"$2">"}' file.xml
---------- Post updated at 10:30 PM ---------- Previous update was at 10:24 PM ----------
OR
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !($2 in a){print "<"$2">"; a[$2]}' file.xml
1 Like
Ariean
September 16, 2014, 3:38pm
4
Try
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !a[$2]++{print "<"$2">"}' file.xml
---------- Post updated at 10:30 PM ---------- Previous update was at 10:24 PM ----------
OR
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !($2 in a){print "<"$2">"; a[$2]}' file.xml
For some reason it is working fine for the sample file i provided which is idented properly but problem is if it is not idented properly it only prints below tags.
<Provider>
<>
Yoda
September 16, 2014, 4:11pm
5
If you have xmllint
, use it to show the structure:-
echo "du" | xmllint --shell file.xml
Pipe the output to an awk program to format.
Then attach a sample of the actual xml file you want to process...
Hello Ariean,
Following may help.
awk '{
match($0,/<\/.*>/);
b=substr($0,RSTART,RLENGTH);
if(b)
{a[++i]=b}
}
END{
{for(k in a)
{c[a[k]]=k}
}
{for(u in c)
{gsub(/\//,X,u);print u}
}
}' Input_File
Output will be as follows.
<BORROWER_NAME>
<LOSS_GIVEN_DEFAULT>
<NON_CURR_LIABILITIES>
<Provider>
<REPAYMENT_SOURCE>
<YOUNG_FARMER_FLAG>
<Institution>
<UMBRELLA_NUMBER>
<FIPS_CODE>
<Customer>
<STATUS_FLAG>
<Loan>
<COLL_TYPE>
NOTE: This code has been tested on the sample code.
Thanks,
R. Singh
1 Like
Ariean
September 17, 2014, 1:32pm
8
ravindersingh13:
Hello Ariean,
Following may help.
awk '{
match($0,/<\/.*>/);
b=substr($0,RSTART,RLENGTH);
if(b)
{a[++i]=b}
}
END{
{for(k in a)
{c[a[k]]=k}
}
{for(u in c)
{gsub(/\//,X,u);print u}
}
}' Input_File
Output will be as follows.
<BORROWER_NAME>
<LOSS_GIVEN_DEFAULT>
<NON_CURR_LIABILITIES>
<Provider>
<REPAYMENT_SOURCE>
<YOUNG_FARMER_FLAG>
<Institution>
<UMBRELLA_NUMBER>
<FIPS_CODE>
<Customer>
<STATUS_FLAG>
<Loan>
<COLL_TYPE>
NOTE: This code has been tested on the sample code.
Thanks,
R. Singh
could you please explain a little bit what you are doing in your code i am naive to awk. Many Thanks
Hello Ariean,
Following may help.
awk '{
match($0,/<\/.*>/); ##### Making match for string which starts with </ and ends with > ######
b=substr($0,RSTART,RLENGTH); ##### Storing the matched string value in a variable named b #####
if(b) ##### If variable b is NOT null #####
{a[++i]=b} ##### creating array named a whose index is a increasing valued variable #####
}
END{
{for(k in a) ##### Fetching the values of array a #####
{c[a[k]]=k} ##### storinng values in a array named c, whose index is the value of array a and it's value is the index of array a #####
}
{for(u in c) ##### Fetching the values of array c #####
{gsub(/\//,X,u);print u} ##### Removing the / from the values #####
}
}' Input_File
Thanks,
R. Singh
Ariean
September 17, 2014, 2:35pm
10
ravindersingh13:
Hello Ariean,
Following may help.
awk '{
match($0,/<\/.*>/); ##### Making match for string which starts with < and ends with > ######
b=substr($0,RSTART,RLENGTH); ##### Storing the matched string value in a variable named b #####
if(b) ##### If variable b is NOT null #####
{a[++i]=b} ##### creating array named a whose index is a increasing valued variable #####
}
END{
{for(k in a) ##### Fetching the values of array a #####
{c[a[k]]=k} ##### storing values in a array named c, whose index is the value of array a and it's value is the index of array a #####
}
{for(u in c) ##### Fetching the values of array c #####
{gsub(/\//,X,u);print u} ##### Removing the / from the values #####
}
}' Input_File
Thanks,
R. Singh
Thank you i just put it for a test against 5.8 GB XML file, it running for past 1 hour, is there any way we can fine tune this, appreciate your help.
If you have more than 1 tag per line something like this may be more accurate:
awk -F '[> ]' '! /^[/?]/ && length($1) && !h[$1]++ {print RS $1 ">" }' RS=\< infile
1 Like
Ariean
September 18, 2014, 4:13pm
12
Thanks it worked pretty fast, but in my below excerpt of output file how do i remove the tags highlighted below. Looks like first tag is because of some junk characters from input file as i see it.
<>
<ACCEPTABLE_VOL>
<ACCRUED_INTEREST>
<APPRAISAL_DATE_RE>
<APPRAISED_VALUE_RE>
<FACILITY_DESC>
<FACILITY_GROSS_OUTSTANDING>
<FARM_OPS_EXP>
<FARM_PAYMENT_SUPPORT>
<!--FILE>
<FIPS_CODE>
<FUNDS_HELD_BAL>
.
..
...