Hi,
In a single line I have the below xml content
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><lst name="status"><lst name=""><str name="name"/><str name="instanceDir">/var/www/search/current/Search/solr/./</str><str name="dataDir">/data/www/search/shared/indexes/</str><date name="startTime">2012-08-09T13:48:35.584Z</date><long name="uptime">925837154</long><lst name="index"><int name="numDocs">205235</int><int name="maxDoc">205235</int><long name="version">1326998109779</long><bool name="optimized">true</bool><bool name="current">true</bool><bool name="hasDeletions">false</bool><str name="directory">org.apache.lucene.store.MMapDirectory:org.apache.lucene.store.MMapDirectory@/var/www/search/shared/indexes/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@a26ff14</str><date name="lastModified">2012-08-20T06:56:43Z</date></lst></lst></lst>
I wanted to extract what ever value is in this <int name="numDocs">205235</int>
Please explain if you post with awk or sed - As it would be very much helpful for beginners to learn and understand.
Thanks
sed -n '/.*<int name="numDocs">\([^<]*\)<.*/s//\1/p' file
With awk:
awk 'sub(/.*<int name="numDocs">/,""){print $0+0}' file
1 Like
Thank you very much.
Could you please explain the parts of each command.
Best
Ashok
sed
.*
--> Any character any number of times
<int name="numDocs">
--> the required pattern, of course
]\([^<]*\)<
--> a tagged regular expression (TRE) to store all the characters (except for <) upto the first left chevron (<)
.*
--> the remaining characters in the line.
In a line matching this pattern, substitute the whole line (// is the remembered previous pattern) with the TRE (\1) and print(p).
---
awk
sub(/.*<int name="numDocs">/,"")
--> in each line read, try to delete the the pattern upto <int name="numDocs">.
If this substitution is successful, sub() returns 1 and the corresponding action is executed.
The action adds 0 to the whole remaining record/line. This retains only the first number in the line and prints it.
1 Like
Good explanation.
On awk side in the last if there were characters like "PASS" then
<str name="status">PASS</str>
awk 'sub(/.*<str name="status">/,""){print $0}'
Result: PASS</str>
Could you please advise.
I had assumed a number in the field. Nevertheless, try:
awk 'match($0,/<str name="status">[^<]+</){print substr($0,RSTART+19,RLENGTH-20)}' file
This uses the match()
function to match the pattern in the input line. If no match found, match()
will return 0 and no further processing will be done on the line. If multiple matches are possible, match()
will only match the first match (too many matches :)) and set the special variables RSTART and RLENGTH.
RSTART --> starting position in the line where the match was found.
RLENGTH --> length of the match made.
Using values of these 2 variables, we print the required substring.