I have an XML file with around 1 billion rows in it and i am trying to find the number of times a particular tag occurs in it. The solution i am using works but takes a lot of time (~1 hr) .Please help me with an efficient way to do this.
[root@host dir]# time awk '/<Name>/' input | wc -l
1000000
real 0m7.802s
user 0m7.766s
sys 0m0.125s
[root@host dir]# time awk '/<Name>/ {i++} END {print i}' input
1000000
real 0m7.559s
user 0m7.485s
sys 0m0.074s
[root@host dir]# time grep -c "Name" input
1000000
real 0m0.158s
user 0m0.121s
sys 0m0.037s
[root@host dir]# time perl -ne '(/<Name>/) && $i++; END {print $i}' input
1000000
real 0m2.968s
user 0m2.928s
sys 0m0.040s
[root@host dir]# time sed -n '/<Name>/p' input | wc -l
1000000
real 0m3.716s
user 0m3.716s
sys 0m0.096s
Verdict: grep seems to be quickest to do this particular task amongst the utilities used above. Crudely extrapolating the results for a file with 1 billion blocks of entries, it should take about 158s or around 3mins.
Have tried all the options (grep . sed & awk) but none of these seem to perform well when the file has 1 billion rows in it. There is one catch though. The input xml file has all the tags in a single row. i.e. this single row gets divided into 1 billion rows after indentation.
This indentation is manual. Can you guys help me with a command that indents the file first and then may be the search command could return the results faster.
e.g. Right Now the InputFile is
I need a command to convert this file into the format below
See.. This is what happens when you don't mention which OS and shell you're working on. Solution in post #9 was tried on RHEL GNU bash, sed version 4.1.5.
You need to use a SAX parser instead of DOM. Here is a python implementation.
#! /usr/bin/python
import xml.parsers.expat
count=0
def start_element(name, attrs):
global count
if name == "Name":
count+=1
f=open('infile.xml')
p=xml.parsers.expat.ParserCreate()
p.StartElementHandler=start_element
p.ParseFile(f)
print count
Running on my little netbook
$ wc -l infile.xml
1 infile.xml
$ time ./infile.py
100000
real 0m1.187s
user 0m1.168s
sys 0m0.016s
$ grep 'model name' /proc/cpuinfo
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
Please let us know how long it take for 1 billion records.