How to get distinct Tags from an XML file?

Ariean · September 16, 2014, 11:25am

Sample XML file:

<?xml version="1.0" encoding="UTF-16" ?>
<Provider PROVIDER="xx" SCHEMA_VERSION="2.5">
<Institution UNINUM="xxxx" EXTRACT_DATE="2013-12-31" CUSTOMER_ROW_COUNT="1577" LOAN_ROW_COUNT="3322" BOOK_VALUE_DOLLARS="720163381.46">
	<Customer CIF="dww213">
		<BORROWER_NAME>xxxxxxxxxxxx</BORROWER_NAME>
		<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
			<Loan LOAN_NUMBER="xxx">
				<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
				<LOSS_GIVEN_DEFAULT>A</LOSS_GIVEN_DEFAULT>
				<COLL_TYPE>3</COLL_TYPE>
				<STATUS_FLAG>1</STATUS_FLAG>
				<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
			</Loan>
	</Customer>
	<Customer CIF="z122321">
		<BORROWER_NAME>xxxxxxxxxxxx</BORROWER_NAME>
		<FIPS_CODE>xxxx</FIPS_CODE>
		<NON_CURR_LIABILITIES>2491022.00</NON_CURR_LIABILITIES>
		<REPAYMENT_SOURCE>1</REPAYMENT_SOURCE>
			<Loan LOAN_NUMBER="xxx">
				<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
				<LOSS_GIVEN_DEFAULT>A</LOSS_GIVEN_DEFAULT>
				<COLL_TYPE>3</COLL_TYPE>
				<STATUS_FLAG>1</STATUS_FLAG>
				<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
			</Loan>
			<Loan LOAN_NUMBER="123">
				<YOUNG_FARMER_FLAG>0</YOUNG_FARMER_FLAG>
				<UMBRELLA_NUMBER>0025101101</UMBRELLA_NUMBER>
			</Loan>
	</Customer>
</Institution>
</Provider>

Hello All,
I am using red Linux OS and my requirement is to get only unique tags, for example for the above XML file i should get the below unique list of tags.

	<Provider>
	<Institution>
	<Customer>
	<BORROWER_NAME>
	<REPAYMENT_SOURCE>
	<FIPS_CODE>
	<NON_CURR_LIABILITIES>
	<Loan>
	<YOUNG_FARMER_FLAG>
	<LOSS_GIVEN_DEFAULT>
	<COLL_TYPE>
	<STATUS_FLAG>
	<UMBRELLA_NUMBER>

After i get this list i need to compare it against predefined list of tags and error/email out if the tag is not in that list.
I can do a for loop and compare against the predefined list but i am struck at how to get those unique tags from XML file, can you please help.

Thank you

Yoda · September 16, 2014, 11:35am

Try something like this:-

awk -F'[<> ]' '
        {
                sub(/\//,x,$2)
                if ( $2 !~ /xml/ )
                        A[$2]
        }
        END {
                for ( k in A )
                        print "<" k ">"
        }
' file.xml

Akshay_Hegde · September 16, 2014, 12:00pm

Try

$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !a[$2]++{print "<"$2">"}' file.xml

---------- Post updated at 10:30 PM ---------- Previous update was at 10:24 PM ----------

OR

$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !($2 in a){print "<"$2">"; a[$2]}' file.xml

Ariean · September 16, 2014, 3:38pm

akshay hegde:

Try
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !a[$2]++{print "<"$2">"}' file.xml
---------- Post updated at 10:30 PM ---------- Previous update was at 10:24 PM ----------

OR
$ awk -F'[<> ]' '{ $1 = $1 }$2 !~ /^[[:punct:]]/ && !($2 in a){print "<"$2">"; a[$2]}' file.xml

For some reason it is working fine for the sample file i provided which is idented properly but problem is if it is not idented properly it only prints below tags.

<Provider>
<>

Yoda · September 16, 2014, 4:11pm

If you have xmllint , use it to show the structure:-

echo "du" | xmllint --shell file.xml

Pipe the output to an awk program to format.

shamrock · September 16, 2014, 4:50pm

Then attach a sample of the actual xml file you want to process...

RavinderSingh13 · September 17, 2014, 1:50am

Hello Ariean,

Following may help.

awk '{
match($0,/<\/.*>/); 
b=substr($0,RSTART,RLENGTH); 
 if(b)
    {a[++i]=b}
     } 
END{
  {for(k in a)
    {c[a[k]]=k}
 } 
 {for(u in c)
  {gsub(/\//,X,u);print u}
 }
   }' Input_File

Output will be as follows.

<BORROWER_NAME>
<LOSS_GIVEN_DEFAULT>
<NON_CURR_LIABILITIES>
<Provider>
<REPAYMENT_SOURCE>
<YOUNG_FARMER_FLAG>
<Institution>
<UMBRELLA_NUMBER>
<FIPS_CODE>
<Customer>
<STATUS_FLAG>
<Loan>
<COLL_TYPE>

NOTE: This code has been tested on the sample code.

Thanks,
R. Singh

Ariean · September 17, 2014, 1:32pm

ravindersingh13:

Hello Ariean,

Following may help.

awk '{
match($0,/<\/.*>/); 
b=substr($0,RSTART,RLENGTH); 
 if(b)
   {a[++i]=b}
   } 
END{
  {for(k in a)
   {c[a[k]]=k}
 } 
 {for(u in c)
  {gsub(/\//,X,u);print u}
 }
   }' Input_File

Output will be as follows.

<BORROWER_NAME>
<LOSS_GIVEN_DEFAULT>
<NON_CURR_LIABILITIES>
<Provider>
<REPAYMENT_SOURCE>
<YOUNG_FARMER_FLAG>
<Institution>
<UMBRELLA_NUMBER>
<FIPS_CODE>
<Customer>
<STATUS_FLAG>
<Loan>
<COLL_TYPE>

NOTE: This code has been tested on the sample code.

Thanks,
R. Singh

could you please explain a little bit what you are doing in your code i am naive to awk. Many Thanks

RavinderSingh13 · September 17, 2014, 2:33pm

Hello Ariean,

Following may help.

awk '{
match($0,/<\/.*>/);                    ##### Making match for string which starts with </ and ends with > ######
b=substr($0,RSTART,RLENGTH);           ##### Storing the matched string value in a variable named b #####
 if(b)                           ##### If variable b is NOT null #####
    {a[++i]=b}                         ##### creating array named a whose index is a increasing valued variable #####
     } 
END{
  {for(k in a)                       ##### Fetching the values of array a #####
    {c[a[k]]=k}                       ##### storinng values in a array named c, whose index is the value of array a and it's value is the index of array a #####
 } 
 {for(u in c)                       ##### Fetching the values of array c #####
  {gsub(/\//,X,u);print u}             ##### Removing the / from the values #####
 }
   }' Input_File

Thanks,
R. Singh

Ariean · September 17, 2014, 2:35pm

ravindersingh13:

Hello Ariean,

Following may help.

awk '{
match($0,/<\/.*>/);            ##### Making match for string which starts with < and ends with > ######
b=substr($0,RSTART,RLENGTH);   ##### Storing the matched string value in a variable named b #####
 if(b)                               ##### If variable b is NOT null #####
   {a[++i]=b}                 ##### creating array named a whose index is a increasing valued variable #####
   } 
END{
  {for(k in a)                 ##### Fetching the values of array a #####
   {c[a[k]]=k}               ##### storing values in a array named c, whose index is the value of array a and it's value is the index of array a #####
 } 
 {for(u in c)               ##### Fetching the values of array c #####
  {gsub(/\//,X,u);print u}     ##### Removing the / from the values #####
 }
   }' Input_File

Thanks,
R. Singh

Thank you i just put it for a test against 5.8 GB XML file, it running for past 1 hour, is there any way we can fine tune this, appreciate your help.

Chubler_XL · September 17, 2014, 3:29pm

If you have more than 1 tag per line something like this may be more accurate:

awk -F '[> ]' '! /^[/?]/ && length($1) && !h[$1]++ {print RS $1 ">" }' RS=\< infile

Ariean · September 18, 2014, 4:13pm

Thanks it worked pretty fast, but in my below excerpt of output file how do i remove the tags highlighted below. Looks like first tag is because of some junk characters from input file as i see it.

<>
<ACCEPTABLE_VOL>
<ACCRUED_INTEREST>
<APPRAISAL_DATE_RE>
<APPRAISED_VALUE_RE>
<FACILITY_DESC>
<FACILITY_GROSS_OUTSTANDING>
<FARM_OPS_EXP>
<FARM_PAYMENT_SUPPORT>
<!--FILE>
<FIPS_CODE>
<FUNDS_HELD_BAL>
.
..
...