Need an efficient way to search for a tag in an xml file having millions of rows

Sheel · March 1, 2012, 5:25am

Hi,

I have an XML file with around 1 billion rows in it and i am trying to find the number of times a particular tag occurs in it. The solution i am using works but takes a lot of time (~1 hr) .Please help me with an efficient way to do this.

Lets say the input file is

<Root>
     <Person>
           <Name>John</Name>
     </Person>
</Root>

This <Name> block can be present in it multiple times and i need to find the count quickly(efficiently).

Thanks.

birei · March 1, 2012, 5:33am

Hi Sheel,

Curious about your solution, what is?

I would use xpath or something similar.

Regards,
Birei.

Sheel · March 1, 2012, 5:39am

I am using a simple awk statement

 awk '/\<Name\>/' inputfile | wc -l

itkamaraj · March 1, 2012, 5:39am

 
grep -c "<Name>" xmlfile

balajesuri · March 1, 2012, 5:47am

File 'input' contains 1 million entries of this block:

<Root>
    <Person>
        <Name>John</Name>
    </Person>
</Root>

And here's an analysis:

[root@host dir]# time awk '/<Name>/' input | wc -l
1000000

real    0m7.802s
user    0m7.766s
sys     0m0.125s
[root@host dir]# time awk '/<Name>/ {i++} END {print i}' input
1000000

real    0m7.559s
user    0m7.485s
sys     0m0.074s
[root@host dir]# time grep -c "Name" input
1000000

real    0m0.158s
user    0m0.121s
sys     0m0.037s
[root@host dir]# time perl -ne '(/<Name>/) && $i++; END {print $i}' input
1000000
real    0m2.968s
user    0m2.928s
sys     0m0.040s
[root@host dir]# time sed -n '/<Name>/p' input | wc -l
1000000

real    0m3.716s
user    0m3.716s
sys     0m0.096s

Verdict: grep seems to be quickest to do this particular task amongst the utilities used above. Crudely extrapolating the results for a file with 1 billion blocks of entries, it should take about 158s or around 3mins.

Sheel · March 1, 2012, 6:38am

Have tried all the options (grep . sed & awk) but none of these seem to perform well when the file has 1 billion rows in it. There is one catch though. The input xml file has all the tags in a single row. i.e. this single row gets divided into 1 billion rows after indentation.
This indentation is manual. Can you guys help me with a command that indents the file first and then may be the search command could return the results faster.

e.g. Right Now the InputFile is

I need a command to convert this file into the format below

birei · March 1, 2012, 7:10am

Your one-line input file is not well formed.

For a well-formed xml file, it doesn't mind if one-line or multi-line, try with xpath, here an example:

$ cat infile
<?xml version="1.0" encoding="UTF-8"?><Root><Person><Name>John</Name></Person><Person><Name>John</Name></Person></Root>
$ xpath infile 'count(//Name)'
Query didn't return a nodeset. Value: 2

Regards,
Birei

Sheel · March 1, 2012, 7:20am

I am using AIX and dont have an option to use xpath.

Was trying this command to format the file

 sed 's/\>\</\>\\n\</g' input

. But this gives an output

<?xml version="1.0" encoding="UTF-8"?>\n<Root>\n<Person>\n<Name>John</Name>\n</Person>\n<Person>\n<Name>John</Name>\n</Person>\n</Root>

balajesuri · March 1, 2012, 7:39am

If you escape '\n' in the substitute part \>\\n\< , how do you expect to see a line break!

$ sed 's/></>\n</g' input
<?xml version="1.0" encoding="UTF-8"?>
<Root>
<Person>
<Name>John</Name>
</Person>
<Person>
<Name>John</Name>
</Person>
</Root>

Sheel · March 1, 2012, 11:40pm

This is what i get from the command you suggested

<?xml version="1.0" encoding="UTF-8"?>n<Root>n<Person>n<Name>John</Name>n</Person>n<Person>n<Name>John</Name>n</Person>n</Root>

balajesuri · March 1, 2012, 11:47pm

See.. This is what happens when you don't mention which OS and shell you're working on. Solution in post #9 was tried on RHEL GNU bash, sed version 4.1.5.

Sheel · March 1, 2012, 11:56pm

I did say AIX (post#8) .. seems u missed it but thats ok. Tried tr, sed ,awk .. but none working. Pls see if u can get me a soln.

---------- Post updated at 11:56 PM ---------- Previous update was at 11:53 PM ----------

got it

 sed "s/></>\\`echo -e '\n\r'`</g" input

thnx all for ur efforts.

balajesuri · March 2, 2012, 12:30am

Ah, yes! My bad.. Its my mistake. Sorry mate.

chihung · March 2, 2012, 5:07am

You need to use a SAX parser instead of DOM. Here is a python implementation.

#! /usr/bin/python

import xml.parsers.expat

count=0

def start_element(name, attrs):
        global count
        if name == "Name":
                count+=1

f=open('infile.xml')

p=xml.parsers.expat.ParserCreate()
p.StartElementHandler=start_element
p.ParseFile(f)
print count

Running on my little netbook


$ wc -l infile.xml
1 infile.xml

$ time ./infile.py
100000

real    0m1.187s
user    0m1.168s
sys     0m0.016s

$ grep 'model name' /proc/cpuinfo
model name      : Intel(R) Atom(TM) CPU N270   @ 1.60GHz
model name      : Intel(R) Atom(TM) CPU N270   @ 1.60GHz

Please let us know how long it take for 1 billion records.