Extract TAG name and XPATH from XML file via shellscript

BithunC · August 21, 2012, 1:31am

Hi,
Here is a sample xml file and expected output.
I need to extract the element/tag name (not value) and xpath (sample output.txt).
But the main problem is I put here one simple xml file where I can clearly see the number of elements, but in real time I have a xml file which have over 500 elements, so is there any option to find the elements automatically and retrive the xpath.

sample.XML:

 
<?xml version = '1.0'?>
<ROWSET>
<ROW num="1">
<EMPNO>7369</EMPNO>
<ENAME>SMITH</ENAME>
<JOB>CLERK</JOB>
<MGR>7902</MGR>
<HIREDATE>12/17/1980 0:0:0</HIREDATE>
<SAL>800</SAL>
<DEPTNO>20</DEPTNO>
</ROW>
<ROW num="2">
<EMPNO>7499</EMPNO>
<ENAME>ALLEN</ENAME>
<JOB>SALESMAN</JOB>
<MGR>7698</MGR>
<HIREDATE>2/20/1981 0:0:0</HIREDATE>
<SAL>1600</SAL>
<COMM>300</COMM>
<DEPTNO>30</DEPTNO>
</ROW>
</ROWSET>

Output.txt :

 
ROW_NUM /ROWSET/ROW/@num
EMPNO /ROWSET/ROW/EMPNO
ENAME /ROWSET/ROW/ENAME
JOB /ROWSET/ROW/JOB
MGR /ROWSET/ROW/MGR
HIREDATE /ROWSET/ROW/HIREDATE
SAL /ROWSET/ROW/SAL
COMM /ROWSET/ROW/COMM
DEPTNO /ROWSET/ROW/DEPTNO

Note: If there is a attribute like "num" and the value is changing like 1, 2 or more than that.. is there any changes in output file?

Thnx,
Bithun

raj_saini20 · August 21, 2012, 2:52am

awk 'BEGIN{f=0}
(/^<ROW /){
		split($0,a,"\"");
		print "ROW_NUM/ROWSET/ROW/@"a[2];
		f=1;
		next
	}
(f==1){
	if(/^<\/ROW/)
		{
			f=0
		}
	else
		{
			split($0,a,"<|>");
			print a[2]"/ROWSET/ROW/"a[3]
		}
	}' sample.xml

output is

ROW_NUM/ROWSET/ROW/@1
EMPNO/ROWSET/ROW/7369
ENAME/ROWSET/ROW/SMITH
JOB/ROWSET/ROW/CLERK
MGR/ROWSET/ROW/7902
HIREDATE/ROWSET/ROW/12/17/1980 0:0:0
SAL/ROWSET/ROW/800
DEPTNO/ROWSET/ROW/20
ROW_NUM/ROWSET/ROW/@2
EMPNO/ROWSET/ROW/7499
ENAME/ROWSET/ROW/ALLEN
JOB/ROWSET/ROW/SALESMAN
MGR/ROWSET/ROW/7698
HIREDATE/ROWSET/ROW/2/20/1981 0:0:0
SAL/ROWSET/ROW/1600
COMM/ROWSET/ROW/300
DEPTNO/ROWSET/ROW/30

is this output you required?

BithunC · August 21, 2012, 3:01am

Thanx for the suggestion raj.
Appritiate your help.

But I don't need the element value, you can see my expected output there is a " " (space) between element name and xpath (so that i can differentiate with element name and xpath) but not the element value, otherwise its fine.
thanx again

raj_saini20 · August 21, 2012, 3:12am

awk 'BEGIN{f=0}
(/^<ROW /){
		split($0,a,"\"");
		print "ROW_NUM /ROWSET/ROW/@num";
		f=1;
		next
	}
(f==1){
	if(/^<\/ROW/)
		{
			f=0
		}
	else
		{
			split($0,a,"<|>");
			print a[2]" /ROWSET/ROW/"a[2]
		}
	}' sample.xml

output will be

ROW_NUM /ROWSET/ROW/@num
EMPNO /ROWSET/ROW/EMPNO
ENAME /ROWSET/ROW/ENAME
JOB /ROWSET/ROW/JOB
MGR /ROWSET/ROW/MGR
HIREDATE /ROWSET/ROW/HIREDATE
SAL /ROWSET/ROW/SAL
DEPTNO /ROWSET/ROW/DEPTNO
ROW_NUM /ROWSET/ROW/@num
EMPNO /ROWSET/ROW/EMPNO
ENAME /ROWSET/ROW/ENAME
JOB /ROWSET/ROW/JOB
MGR /ROWSET/ROW/MGR
HIREDATE /ROWSET/ROW/HIREDATE
SAL /ROWSET/ROW/SAL
COMM /ROWSET/ROW/COMM
DEPTNO /ROWSET/ROW/DEPTNO

BithunC · August 21, 2012, 3:24am

Ya raj ..thanks a lot ... now the output looks same as per i requierd.
just suggest me one thing .. should i save that code sample.ksh with starting a header line #!/usr/bin/ksh ??
if so then in the time of execution should i write like this :

ksh sample.ksh sample.xml

---------- Post updated at 12:54 PM ---------- Previous update was at 12:47 PM ----------

yes .. its working as i wrote starting with header line #!/usr/bin/ksh

ksh sample.ksh > sample.txt

Thanx Raj ...

raj_saini20 · August 21, 2012, 3:25am

No need to give file name at run time as it is mentioned in script it self

BithunC · August 21, 2012, 3:49am

Raj, you gave ROW/@num as hard-coded but if there more than attribute and some kind of parent-child element in this case how we can solve?

i am giving here one onother example :

sample2.xml

 
<FORMINFO FORMVERSION="12-2009" DOCID="CHS-101228-01340-2" FILENUM="776719" CASE_NO="10660990" FORMNUM="BPOCHASE" VERSION="3.6" VENDOR="UTLSValuations" MAINFORM="BPOCHASE">
<SUBJECT>
 <LOANNUM>100001416010918632</LOANNUM>
 <RELATIONSHIPNUM>UTLS Default Services</RELATIONSHIPNUM>
 <REONUM />
 <ADDR>
   <STREET>32416 N 44TH PL</STREET>
   <CITY>CAVE CREEK</CITY>
   <STATEPROV>AZ</STATEPROV>
   <ZIP>85331</ZIP>
 </ADDR>
 <COUNTY />
 <ASSESPARCEL>
   <NUM>211-34-168</NUM>
 </ASSESPARCEL>
 <BORROWER />
 <SOLDLISTED VALUE="YES" />
 <PRICEREDUCTION NUM="1">
   <DOM />
   <PRICE />
 </PRICEREDUCTION>
 <PRICEREDUCTION NUM="2">
   <DOM />
   <PRICE />
 </PRICEREDUCTION>
 <PRICEREDUCTION NUM="3">
   <DOM />
   <PRICE />
 </PRICEREDUCTION>
 <PROJ>
   <TYPE>SFRD</TYPE>
   <DESCRIPTION>SFRD</DESCRIPTION>
 </PROJ>
 <HOMEOWNERASSNFEE>300</HOMEOWNERASSNFEE>
 <HOMEOWNERASSNRESPONSE>YEARLY</HOMEOWNERASSNRESPONSE>
 <HOA>
   <FEECURRENT>YES</FEECURRENT>
   <DELINQUENCIES />
   <MAINTENANCE>INSURANCE</MAINTENANCE>
   <COMPANY />
   <PHONE>(623) 572-7579</PHONE>
   <LEGALISSUES />
 </HOA>
 <HOMEOWNERASSNDESC />
 <PROJECT>
   <NAME />
 </PROJECT>
 <CURRENTOCCUPANT>VACANT</CURRENTOCCUPANT>
 <CURRENTOWNER>MAINRESIDENCE</CURRENTOWNER>
</SUBJECT>
</FORMINFO>

output2.txt

 
FORMVERSION /FORMINFO/@FORMVERSION
DOCID /FORMINFO/@DOCID
FILENUM /FORMINFO/@FILENUM
CASE_NO /FORMINFO/@CASE_NO
FORMNUM /FORMINFO/@FORMNUM
VERSION /FORMINFO/@VERSION
VENDOR /FORMINFO/@VENDOR
MAINFORM /FORMINFO/@MAINFORM
LOANNUM /FORMINFO/SUBJECT/LOANNUM
RELATIONSHIPNUM /FORMINFO/SUBJECT/RELATIONSHIPNUM
REONUM /FORMINFO/SUBJECT/REONUM
CITY /FORMINFO/SUBJECT/ADDR/CITY
STATEPROV /FORMINFO/SUBJECT/ADDR/STATEPROV
STREET /FORMINFO/SUBJECT/ADDR/STREET
ZIP/FORMINFO/SUBJECT/ADDR/ZIP
....
....

I faced problem on more than one child tag of a single parent and more than one attribute.

Thanx in advance..

raj_saini20 · August 22, 2012, 3:38am

try this

awk 'BEGIN{f=0}
(/^<FORMINFO /){
		var="FORMINFO/";
		for(i=2;i<=NF;i++)
			{
				split($i,a,"=");
				print a[1]" /"var"@"a[1]
			}
		f=1;
		next
	}
(f==1){ n=split($0,a,"<|>");
	x=n-1;
	if(a[2] ~ /^\//)
		{
			
			if(a[2] ~ /FORMINFO/) f=0;
			sub(a[2],"",var);next
		};
	if(($0 !~ /\//))
		{	split(a[2],b,FS);
			var=substr(var,1)b[1]"/";next
		};
	split(a[2],b,FS);
	print b[1]" /" var b[1]
	
	}' sample.xml

Output is

FORMVERSION /FORMINFO/@FORMVERSION
DOCID /FORMINFO/@DOCID
FILENUM /FORMINFO/@FILENUM
CASE_NO /FORMINFO/@CASE_NO
FORMNUM /FORMINFO/@FORMNUM
VERSION /FORMINFO/@VERSION
VENDOR /FORMINFO/@VENDOR
MAINFORM /FORMINFO/@MAINFORM
LOANNUM /FORMINFO/SUBJECT/LOANNUM
RELATIONSHIPNUM /FORMINFO/SUBJECT/RELATIONSHIPNUM
REONUM /FORMINFO/SUBJECT/REONUM
STREET /FORMINFO/SUBJECT/ADDR/STREET
CITY /FORMINFO/SUBJECT/ADDR/CITY
STATEPROV /FORMINFO/SUBJECT/ADDR/STATEPROV
ZIP /FORMINFO/SUBJECT/ADDR/ZIP
COUNTY /FORMINFO/SUBJECT/COUNTY
NUM /FORMINFO/SUBJECT/ASSESPARCEL/NUM
BORROWER /FORMINFO/SUBJECT/BORROWER
SOLDLISTED /FORMINFO/SUBJECT/SOLDLISTED
DOM /FORMINFO/SUBJECT/PRICEREDUCTION/DOM
PRICE /FORMINFO/SUBJECT/PRICEREDUCTION/PRICE
DOM /FORMINFO/SUBJECT/PRICEREDUCTION/DOM
PRICE /FORMINFO/SUBJECT/PRICEREDUCTION/PRICE
DOM /FORMINFO/SUBJECT/PRICEREDUCTION/DOM
PRICE /FORMINFO/SUBJECT/PRICEREDUCTION/PRICE
TYPE /FORMINFO/SUBJECT/PROJ/TYPE
DESCRIPTION /FORMINFO/SUBJECT/PROJ/DESCRIPTION
HOMEOWNERASSNFEE /FORMINFO/SUBJECT/HOMEOWNERASSNFEE
HOMEOWNERASSNRESPONSE /FORMINFO/SUBJECT/HOMEOWNERASSNRESPONSE
FEECURRENT /FORMINFO/SUBJECT/HOA/FEECURRENT
DELINQUENCIES /FORMINFO/SUBJECT/HOA/DELINQUENCIES
MAINTENANCE /FORMINFO/SUBJECT/HOA/MAINTENANCE
COMPANY /FORMINFO/SUBJECT/HOA/COMPANY
PHONE /FORMINFO/SUBJECT/HOA/PHONE
LEGALISSUES /FORMINFO/SUBJECT/HOA/LEGALISSUES
HOMEOWNERASSNDESC /FORMINFO/SUBJECT/HOMEOWNERASSNDESC
NAME /FORMINFO/SUBJECT/PROJECT/NAME
CURRENTOCCUPANT /FORMINFO/SUBJECT/CURRENTOCCUPANT
CURRENTOWNER /FORMINFO/SUBJECT/CURRENTOWNER

Is this output you required

BithunC · August 22, 2012, 10:09am

yes raj .. thanx for ur extreme help ....
this is exactly what i need ...
this one is perfect for one xml file ..but in our scenario we have a share folder where almost 10000 xml will be stored every month.
there i can create one param file contains all xml file names with extension.
now my question is ... is it possible to pass that param file into ur above code which will give output as one .txt file contains all element name along with xpath as ur output was.
there might be 200 common elements in all xml files and rest of all are might be new....so we have have to make one superset of all elements and have to put down in one test file along with the xpath.

note : actually we are building sas mapping script to load all xml to oracle db, previously we did manual copy paset work for all xml and it took 10 days to cover almost all elements
kindly suggest, do we can load all xml element data to oracle db using shell script?

raj_saini20 · August 23, 2012, 5:28am

yes with script anything can be done

But provide example having your scenario

BithunC · August 23, 2012, 6:04am

Raj,

Here i attached 5 xml files for your reference. but in our share-path more than 10000 this type of xml file will be comming in every month.
we have to store all data into a oracle table (this can be more than one table if required).
at very begining we dont have any fixed oracle table, we have created three table as per elements we found in xml file.
Through SAS we successfully load all xml files into different oracle tables, but during that SAS script writing we have to write all elements name along with xpath manually and that was a truely boring work for all files.
by shell script we made one param file containing all names of xml files and passed that param file to our SAS code. SAS read all xml one by one and insert all data into oracle tables.
Now in coming week probably we will not be able to access SAS and we have to pull all xml data from that xml files in share-path into oracle table by one shell script.

now you suggest me is this possible or not.... check the xml files(attached) and give me some idea.

BithunC · August 24, 2012, 2:29am

anyone .... any helping hand... on post#11

BithunC · August 24, 2012, 2:42am

Raj,

atleast help me on your post#8
using your code i can get required output.
i have attached 5 xml files on post#11
i can use param file like fileNames.txt(attached here) and passing this file into your code how i can generate one output.txt which contains all elements and xpath(like your output) of all 5 files together.
Advance thanx for your kindful help.
--Bithun

raj_saini20 · August 24, 2012, 3:39am

filenames.txt contains those 5 xml file names.

what exactly you want to do

if you want to select only these file names for which output is required then try

cat filenames.txt | xargs -n 1 my_awk_cmd

BithunC · August 24, 2012, 3:47am

i just want to pass the fileNames.txt as a parameter file which contains 5 xml file-names.
just pass the file name one by one into your code and append the result in output.txt
otherwise you suggest how i can extract all elements and xpath of all xml files by one script.

---------- Post updated at 01:17 PM ---------- Previous update was at 01:16 PM ----------

in post#11 i attached 5 XML files for your reference..

raj_saini20 · August 24, 2012, 4:04am

use same i mentioned in my last post or use while loop

BithunC · August 24, 2012, 4:14am

u mentioned file name in ur code..but in case of more than one file what i wud do??

raj_saini20 · August 24, 2012, 5:26am

while read f_name 
do
awk 'BEGIN{f=0}
(/^<FORMINFO /){
		var="FORMINFO/";
		for(i=2;i<=NF;i++)
			{
				split($i,a,"=");
				print a[1]" /"var"@"a[1]
			}
		f=1;
		next
	}
(f==1){ n=split($0,a,"<|>");
	x=n-1;
	if(a[2] ~ /^\//)
		{
			
			if(a[2] ~ /FORMINFO/) f=0;
			sub(a[2],"",var);next
		};
	if(($0 !~ /\//))
		{	split(a[2],b,FS);
			var=substr(var,1)b[1]"/";next
		};
	split(a[2],b,FS);
	print b[1]" /" var b[1]
	
	}' $f_name
done < filename.txt

BithunC · September 5, 2012, 4:17am

Thank you Raj,
if you get time kindly run your above code with my 5 xml files (attached in post#11) and txt file as parameter (attached in post#13).
when i run txt file through your last code i got many types of output there, where some output is matching with my expected result and some is totally difference. i guess there need to be a small modification in your code to get the proper result from that 5 xml files.