To search for a particular tag in xml and collate all similar tag values and display them count

srkmish · July 27, 2014, 12:29pm

I want to basically do the below thing. Suppose there is a tag called object1. I want to display an output for all similar tag values under heading of Object 1 and the count of the xmls. Please help

 
File:
<xml><object1>house</object1><object2>child</object2>
<xml><object1>book</object1><object2>tree</object2>
<xml><object1>house</object1><object2>roof</object2>
 
o/p:
 
House: (Count - 2)
<xml><object1>house</object1><object2>child</object2>
<xml><object1>house</object1><object2>roof</object2>
 
Book: (Count - 1)
<xml><object1>book</object1><object2>tree</object2>

bakunin · July 27, 2014, 1:45pm

Your question makes a few assertions about your input i would like to verify before i start to suggest anything:

You imply that the values are not spanning several lines, which would be legal in XML:

<xml><object1>foo bar</object1>
<object1>foo
bar</object1></xml>

Basically the two lines would be equivalent in XML, but maybe (?) not in your requirement.

Furthermore, what about blanks and other whitespace? is foo bar equivalent to foo bar ?

I hope this helps.

bakunin

Akshay_Hegde · July 27, 2014, 2:00pm

Try something like this

akshay@nio:/tmp$ cat file
<xml><object1>house</object1><object2>child</object2>
<xml><object1>book</object1><object2>tree</object2>
<xml><object1>house</object1><object2>roof</object2>

akshay@nio:/tmp$ awk -F'[><]' 'NF{ c[$5]++; d[$5] = d[$5] ? d[$5] ORS $0 : $0}END{for(i in d) print i ": (count - " c ")" RS d RS  }' file

book: (count - 1)
<xml><object1>book</object1><object2>tree</object2>

house: (count - 2)
<xml><object1>house</object1><object2>child</object2>
<xml><object1>house</object1><object2>roof</object2>

srkmish · July 28, 2014, 3:25am

bakunin:

Your question makes a few assertions about your input i would like to verify before i start to suggest anything:

You imply that the values are not spanning several lines, which would be legal in XML:
<xml><object1>foo bar</object1>
<object1>foo
bar</object1></xml>
Basically the two lines would be equivalent in XML, but maybe (?) not in your requirement.

Furthermore, what about blanks and other whitespace? is foo bar equivalent to foo bar ?

I hope this helps.

bakunin

Basically, the file will be a collection of huge no of xmls each in a different line and the tag will not span multiple lines. I actually want a generic method to do this i.e. the command should scan for "object1" tag and should extract the value between <object1> and </object1> and display all the xmls containing this particular value and its count.

---------- Post updated at 02:25 AM ---------- Previous update was at 02:21 AM ----------

akshay hegde:

Try something like this

akshay@nio:/tmp$ cat file
<xml><object1>house</object1><object2>child</object2>
<xml><object1>book</object1><object2>tree</object2>
<xml><object1>house</object1><object2>roof</object2>

akshay@nio:/tmp$ awk -F'[><]' 'NF{ c[$5]++; d[$5] = d[$5] ? d[$5] ORS $0 : $0}END{for(i in d) print i ": (count - " c ")" RS d RS  }' file
 
book: (count - 1)
<xml><object1>book</object1><object2>tree</object2>
 
house: (count - 2)
<xml><object1>house</object1><object2>child</object2>
<xml><object1>house</object1><object2>roof</object2>

Hey, this works perfectly. Thanks. However, can you suggest a generic method to do this . I wanna search for the "object1" tag in the xml and copy its tag value and display all lines containing this value and its count. Can you explain ur command a bit so i can understand the code. I want to extrapolate this command later so that i can search for other tag values and display content accordingly.

srkmish · July 29, 2014, 12:16pm

Hey guys, i would be really grateful if anyone can explain the code that akshay wrote.

awk -F'[><]' 'NF{ c[$5]++; d[$5] = d[$5] ? d[$5] ORS $0 : $0}END{for(i in d) print i ": (count - " c ")" RS d RS  }' file

Chubler_XL · July 29, 2014, 5:45pm

OK I'll give it a whirl:

Firstly I'll break it into multiple lines for ease of reading:

awk -F'[><]' '
NF {
  c[$5]++
  d[$5] = d[$5] ? d[$5] ORS $0 : $0
}
END{
   for(i in d)
      print i ": (count - " c ")" RS d RS
}' file

NF { examine lines that have 1 or more fields (ie non-blank lines).

-F'[><]' This argument to awk sets your field separator to < or > awk will split the line up on these characters and assign each field to $1 thru $n.

So for <xml><object1>house</object1><object2>child</object2>

we get:

$1 = ""
$2 = "xml"
$3 = ""
$4 = "object1"
$5 = "house"
$6 = "/object1"

c[$5]++ creates a associative array c[] with field #5 as the key and increments the value (c[house]=c[house]+1) so it's a count of the number of times each tag appears.

d[$5] = d[$5] ? d[$5] ORS $0 : $0 if d[$5] is not null/blank then append ORS (output record separator which is newline in this case) and while input line to it; otherwise assign it to the whole input line.

The END block goes through all the keys in the d[] array and prints the key count followed by all input lines that contain that key (value of the d[] array element).

srkmish · July 30, 2014, 3:56am

Woah chubler, this is fantastic. Thanks. That cleared up things a lot for me.

But what about when the <object>1 value tag is not necessarily in $5 position. How to search for the tag value then and add it to an array.

srkmish · August 19, 2014, 9:44am

Anyone guys???

Chubler_XL · August 19, 2014, 4:13pm

Below should do what you asked for. However, beware this isn't a full XML parser and many things can still trip it up for example:

<xml><object1>
value</object1></xml>

awk -F'[><]' '
NF {
  pos=0
  for(i=1;i<NF;i++) if($i=="object1") pos=i+1
  if(pos) {
      c[$pos]++
      d[$pos] = d[$pos] ? d[$pos] ORS $0 : $0
  }
}
END{
   for(i in d)
      print i ": (count - " c ")" RS d RS
}' infile

srkmish · August 21, 2014, 3:58am

Yes, this is working fine. thanks. I ran into another problem. Actually before the xml starts, there is some string. I need to run this awk script only on those lines which have the string "Exception Found" before the xmls. I guess i have to add && /Exception Found/, but im confused where to add it in this awk script.