Cutting all xml tags out of a line

bathtime · February 22, 2018, 5:49am

I would like search and find a word (easily identified by 'key') from an xml file and then cut all of the tags out of the resulting line (anything between a < and a >) and display the remaining material. I am running Debian and mksh shell.

dictionary.sh:

#!/bin/sh

key='key="'$1'"><form'
tagIn='<'
tagOut='>'

awk -v vkey="$key" -v vtagIn="$tagIn" -vtagOut="$tagOut" \
'$0 ~ vkey {print "Found:     " vkey "\n"}       $0 ~ vtagIn ".*" vtagOut {sub(vtagOut ".*", ""); sub(".*" vtagIn, ""); printf $0;}' \
words.txt

words.txt

<entry id="n53" type="main" key="abolesco"><form opt="n"><orth extent="full" lang="la" opt="n">abolsc</orth></form><gramGrp opt="n"><itype opt="n"> olv, -, ere, </itype>incept. </gramGrp><sense id="n53.0" level="0" n="0" opt="n"><etym lang="la" opt="n">aboleo</etym>, <trans opt="n"><tr opt="n">to decay gradually, vanish, disappear, die out</tr></trans>: <foreign lang="la">nomen vetustate</foreign>, <usg opt="n">L.</usg>: <foreign lang="la">tanti gratia facti</foreign>, <usg opt="n">V.</usg> </sense></entry>
M
M
M
<entry id="n54" type="main" key="abolitio"><form opt="n"><orth extent="full" lang="la" opt="n">aboliti</orth></form><gramGrp opt="n"><itype opt="n"> nis, </itype><gen opt="n">f</gen> </gramGrp><sense id="n54.0" level="0" n="0" opt="n"><etym lang="la" opt="n">aboleo</etym>, <trans opt="n"><tr opt="n">an abolition</tr></trans>: <foreign lang="la">tributorum</foreign>, <usg opt="n">Ta.</usg>-<trans opt="n"><tr opt="n">An annulling</tr></trans>: <foreign lang="la">sententiae</foreign>, <usg opt="n">Ta.</usg> </sense></entry>

Run as:

$ ./dictionary abolesco

The result is:

Found:     key="abolesco"><form

As you can see, the code finds the word, but I don't know how to remove the tags. I feel I'm quite off with this one.

I would like the result to be tagless and only for the word searched. Like this:

abolsc olv, -, ere, incept. aboleo to decay gradually, vanish, disappear, die out: nomen vetustate, L.: tanti gratia facti, V.

Even if someone could point me in the right direction that would be great.

RudiC · February 22, 2018, 6:27am

Try

awk -v vkey="$key" -v tagIn="$tagIn" -vtagOut="$tagOut" '
$0 ~ vkey       {print "Found:     " vkey "\n"
                 gsub (tagIn "[^" tagOut "]*" tagOut, "")
                 print
                }
' file
Found:     key="abolesco"><form

abolsc olv, �, ere, incept. aboleo, to decay gradually, vanish, disappear, die out: nomen vetustate, L.: tanti gratia facti, V.

drl · February 22, 2018, 9:55am

Hi.

In thread Extract a value from an xml file, post #11, there are examples for extraction using:

xml_grep /usr/bin/xml_grep version 0.9
xmlstarlet - ( /usr/bin/xmlstarlet, 2014-09-14 )
xmllint: using libxml version 20901
xml2 - ( /usr/bin/xml2, 2012-04-16 )

Best wishes ... cheers, drl

bathtime · February 22, 2018, 5:35pm

rudic:

Try

awk -v vkey="$key" -v tagIn="$tagIn" -vtagOut="$tagOut" '
$0 ~ vkey       {print "Found:     " vkey "\n"
   gsub (tagIn "[^" tagOut "]*" tagOut, "")
   print
   }
' file
Found:     key="abolesco"><form

abolsc olv, -, ere, incept. aboleo, to decay gradually, vanish, disappear, die out: nomen vetustate, L.: tanti gratia facti, V.

Thank you. This worked absolutely perfectly...until the xml code I was using suddenly changed to grouping between lines and not on the one line, but I got that sorted out and all is fine.

Thanks. I'm thinking this will come in handy.