Extract a pattern from xml file

ashokvpp · August 20, 2012, 3:37am

Hi,

In a single line I have the below xml content

<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><lst name="status"><lst name=""><str name="name"/><str name="instanceDir">/var/www/search/current/Search/solr/./</str><str name="dataDir">/data/www/search/shared/indexes/</str><date name="startTime">2012-08-09T13:48:35.584Z</date><long name="uptime">925837154</long><lst name="index"><int name="numDocs">205235</int><int name="maxDoc">205235</int><long name="version">1326998109779</long><bool name="optimized">true</bool><bool name="current">true</bool><bool name="hasDeletions">false</bool><str name="directory">org.apache.lucene.store.MMapDirectory:org.apache.lucene.store.MMapDirectory@/var/www/search/shared/indexes/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@a26ff14</str><date name="lastModified">2012-08-20T06:56:43Z</date></lst></lst></lst>

I wanted to extract what ever value is in this <int name="numDocs">205235</int>

Please explain if you post with awk or sed - As it would be very much helpful for beginners to learn and understand.

Thanks

elixir_sinari · August 20, 2012, 3:45am

sed -n '/.*<int name="numDocs">\([^<]*\)<.*/s//\1/p' file

With awk:

awk 'sub(/.*<int name="numDocs">/,""){print $0+0}' file

ashokvpp · August 20, 2012, 4:31am

Thank you very much.

Could you please explain the parts of each command.

Best
Ashok

elixir_sinari · August 20, 2012, 4:46am

sed

.* --> Any character any number of times
<int name="numDocs"> --> the required pattern, of course
]\([^<]*\)< --> a tagged regular expression (TRE) to store all the characters (except for <) upto the first left chevron (<)
.* --> the remaining characters in the line.
In a line matching this pattern, substitute the whole line (// is the remembered previous pattern) with the TRE (\1) and print(p).

---

awk

sub(/.*<int name="numDocs">/,"") --> in each line read, try to delete the the pattern upto <int name="numDocs">.
If this substitution is successful, sub() returns 1 and the corresponding action is executed.
The action adds 0 to the whole remaining record/line. This retains only the first number in the line and prints it.

ashokvpp · August 20, 2012, 5:04am

Good explanation.

On awk side in the last if there were characters like "PASS" then

<str name="status">PASS</str>

awk 'sub(/.*<str name="status">/,""){print $0}'

Result: PASS</str>

Could you please advise.

elixir_sinari · August 20, 2012, 5:26am

I had assumed a number in the field. Nevertheless, try:

awk 'match($0,/<str name="status">[^<]+</){print substr($0,RSTART+19,RLENGTH-20)}' file

This uses the match() function to match the pattern in the input line. If no match found, match() will return 0 and no further processing will be done on the line. If multiple matches are possible, match() will only match the first match (too many matches :)) and set the special variables RSTART and RLENGTH.
RSTART --> starting position in the line where the match was found.
RLENGTH --> length of the match made.
Using values of these 2 variables, we print the required substring.