Extract a value from an xml file

I have this XML file format and all in one line:

Fri Dec 23 00:14:52 2016 Logged Message:689|<?xml version="1.0" encoding="UTF-8"?><PORT_RESPONSE><HEADER><ORIGINATOR>XMG</ORIGINATOR><DESTINAT
ION>ENSEMBLE</DESTINATION><MESSAGE_ID>NXT107349698</MESSAGE_ID><MSGTYPE>PRI</MSGTYPE><TIMESTAMP>12232016061452</TIMESTAMP></HEADER><ADMIN><WIC
IS_REL_NO>5.0.0</WICIS_REL_NO><NNSP>9664</NNSP><OLSP>6529</OLSP><ONSP>6529</ONSP><REQ_NO>6664016358514349</REQ_NO><VER_ID_REQ>00</VER_ID_REQ><
VER_ID_RESP>00</VER_ID_RESP><RT>C</RT><RESP_NO>652901635838480144</RESP_NO><CD_TSENT>122220160614</CD_TSENT><REP>Port Center</REP><TEL_NO_REP>
000-207-8009</TEL_NO_REP><CHC></CHC><DD_T>122320160909</DD_T><NPQTY>00001</NPQTY></ADMIN><LINE_DATA><PORTED_NUM>990-799-1234</PORTED_NUM></LIN
E_DATA></PORT_RESPONSE>

How can I write a script to extract only the value 990-799-1234 which is the value for PORTED_NUM and store in a field.

I am using solaris 10 Unix box.

WHERE to store that number? Try

awk  'match ($0, /<PORTED_NUM>[0-9-]*<\/PORTED_NUM>/) {print substr ($0, RSTART+12, RLENGTH-25)}' file
990-799-1234
1 Like

Getting the following errors:

awk: syntax error near line 1
awk: bailing out near line 1

Please, try the following:

perl -nle '/<PORTED_NUM>([\d-]*)</ and print $1' mrn6430.xml
1 Like

Hello mrn6430,

Kindly change awk to nawk or /usr/xpg4/bin/awk and then it should fly.

Thanks,
R. Singh

1 Like

Hi,

I have some package here on my debian box which is called xml-twig-tools containing a program called xml_grep(a perl script) . I'm sure there are a bunch of xml-tools out there. I think within Solaris it would require to download some Perl Modules from CPAN for you and getting the script from here: xml_grep - search.cpan.org

With this tool I can do the following:

 xml_grep //PORTED_NUM --text_only your-file.xml

Output:

990-799-1234

or with xmlstarlet(a lot faster than the perl script, there's a package available for solaris) some compiled binary:

xmlstarlet sel -t -v //PORTED_NUM your-file.xml

If this is only a one-shot thing, you'll be better of with above awk or perl tips, since you do not have to fiddle with any installation. If you have more often to deal with XML those xmltools will help be a lot easier to handle.

1 Like

Hi,

IMO, if it is xml file with proper xml syntax then xmllint is better choice.

Read man xmllint for detailed information.

echo "cat //*[local-name()='PORT_RESPONSE']/*[local-name()='LINE_DATA']/*[local-name()='PORTED_NUM']/text()" | xmllint --shell input.xml | sed -e '/^\//d'

Gives desired output:

I removed

due to xml parser error,

I tried below xml content in single line.( unlike post #1 thrown errors when checked for xml syntax using xmllint )

Simulated file content:

cat input.xml
<?xml version="1.0" encoding="UTF-8"?><PORT_RESPONSE><HEADER><ORIGINATOR>XMG</ORIGINATOR><DESTINATION>ENSEMBLE</DESTINATION><MESSAGE_ID>NXT107349698</MESSAGE_ID><MSGTYPE>PRI</MSGTYPE><TIMESTAMP>12232016061452</TIMESTAMP></HEADER><ADMIN><WICIS_REL_NO>5.0.0</WICIS_REL_NO><NNSP>9664</NNSP><OLSP>6529</OLSP><ONSP>6529</ONSP><REQ_NO>6664016358514349</REQ_NO><VER_ID_REQ>00</VER_ID_REQ><VER_ID_RESP>00</VER_ID_RESP><RT>C</RT><RESP_NO>652901635838480144</RESP_NO><CD_TSENT>122220160614</CD_TSENT><REP>Port Center</REP><TEL_NO_REP>000-207-8009</TEL_NO_REP><CHC></CHC><DD_T>122320160909</DD_T><NPQTY>00001</NPQTY></ADMIN><LINE_DATA><PORTED_NUM>990-799-1234</PORTED_NUM></LINE_DATA></PORT_RESPONSE>
2 Likes

c00l

That works too:

xmllint file.xml --xpath '//PORTED_NUM/text()'

It was there even without installing as it is part of the basic libxml2-utils package here. It too seems that xmllint has a more complete degree of conformance to the XPATH-Specification than the other 2 tools.

1 Like

Hi,

xmllint file.xml --xpath '//PORTED_NUM/text()'

More simple :b:

However when i run, i get

xmllint --version
xmllint: using libxml version 20706
$ cat /etc/issue.net 
Debian GNU/Linux 8

xmllint --version
xmllint: using libxml version 20901
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML \
   Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr \ 
   Schemas Schematron Modules Debug Zlib Lzma 

Seems to be not the version, but just the choice of features at compile time:

$ cat /etc/issue.net 
Debian GNU/Linux 3.1

$ xmllint --version
xmllint: using libxml version 20616
   compiled with: DTDValid FTP HTTP HTML C14N Catalog XPath XPointer XInclude Iconv Unicode Regexps Automata Schemas

Hmm. Wrong. --xpath juest does not work with the old version.

However this one even works within the ancient debian:

 xmllint --shell file.xml <<<'cat //PORTED_NUM'

Resulting in this:

/ >  -------
<PORTED_NUM>990-799-1234</PORTED_NUM>
/ >

Now strip of the unnecessary junk around the value:

xmllint --shell file.xml <<<'cat //PORTED_NUM' | grep -oE '[0-9-]{8,}'

...with the following being a bit more general(using positive lookahead and lookbehind to require > berfore the pattern and < after)...

xmllint --shell file.xml <<<'cat //PORTED_NUM' | perl -ne '/(?<=>)(.*)(?=<)/ and print($1)'

Hi.

Gathering these suggestions together with the call to the older version of xmllint (posted by stomp) and adding xml2 to process the modified XML (posted by greet_sed):

#!/usr/bin/env bash

# @(#) s1       Demonstrate string extraction from XML file.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C xml_grep xmlstarlet xmllint xml2

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE, $(wc -l <$FILE) lines:"
cat $FILE

pl " Expected output:"
cat $E

pl " Results, xml_grep:"
xml_grep //PORTED_NUM --text_only $FILE

pl " Results, xmlstarlet:"
xmlstarlet sel -t -v //PORTED_NUM $FILE
pe

pl " Results, xmllint:"
xmllint $FILE --xpath '//PORTED_NUM/text()'
pe

pl " Results, xmllint:"
xmllint --shell $FILE <<<'cat //PORTED_NUM' |
perl -ne '/(?<=>)(.*)(?=<)/ and print($1)'
pe

pl " Results, xml2:"
xml2 < data1 |
awk -F= '/PORTED_NUM/ { print $2 }'

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
xml_grep /usr/bin/xml_grep version 0.9
xmlstarlet - ( /usr/bin/xmlstarlet, 2014-09-14 )
xmllint: using libxml version 20901
xml2 - ( /usr/bin/xml2, 2012-04-16 )

-----
 Input data file data1, 1 lines:
<?xml version="1.0" encoding="UTF-8"?><PORT_RESPONSE><HEADER><ORIGINATOR>XMG</ORIGINATOR><DESTINATION>ENSEMBLE</DESTINATION><MESSAGE_ID>NXT107349698</MESSAGE_ID><MSGTYPE>PRI</MSGTYPE><TIMESTAMP>12232016061452</TIMESTAMP></HEADER><ADMIN><WICIS_REL_NO>5.0.0</WICIS_REL_NO><NNSP>9664</NNSP><OLSP>6529</OLSP><ONSP>6529</ONSP><REQ_NO>6664016358514349</REQ_NO><VER_ID_REQ>00</VER_ID_REQ><VER_ID_RESP>00</VER_ID_RESP><RT>C</RT><RESP_NO>652901635838480144</RESP_NO><CD_TSENT>122220160614</CD_TSENT><REP>Port Center</REP><TEL_NO_REP>000-207-8009</TEL_NO_REP><CHC></CHC><DD_T>122320160909</DD_T><NPQTY>00001</NPQTY></ADMIN><LINE_DATA><PORTED_NUM>990-799-1234</PORTED_NUM></LINE_DATA></PORT_RESPONSE> 

-----
 Expected output:
990-799-1234

-----
 Results, xml_grep:
990-799-1234

-----
 Results, xmlstarlet:
990-799-1234

-----
 Results, xmllint:
990-799-1234

-----
 Results, xmllint:
990-799-1234

-----
 Results, xml2:
990-799-1234

Observations:
1) the code for the early xmllint and xml2 need additional work, perl , awk , grep , etc. to isolate the string of interest.

2) A few of the codes (the ones that have a pe afterwards, seem to omit the trailing newline -- not an error, just something to be noted.

Details for xml2:

xml2    convert xml documents in a flat format (man)
Path    : /usr/bin/xml2
Version : - ( /usr/bin/xml2, 2012-04-16 )
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Repo    : Debian 8.6 (jessie) 

Best wishes ... cheers, drl

Another awk:

awk -v k=PORTED_NUM '$1==k{print $2}' RS=\< FS=\> file

--
On Solaris use /usr/xpg4/bin/awk rather than awk

All worked. Thanks to all

In this forum have been lot of questions parse xml using awk.

I have published some solution in one location. Sort toolset to parse xml using awk. Basic idea is to change xml more for awk friendly format.

Enjoy.

How can I use this same perl command but o display alpha values? for ex if I want:

display the value of MSGCODE in this xml:

Thu Jan 26 13:44:21 2017 Logged Message:473|<?xml version="1.0" encoding="UTF-8"?><GENERIC_NOTIFICATION><HEADER><ORIGINATOR>SMG</ORIGINATOR><DESTINATION>ENSEMBLE</DESTINATION><MESSAGE_ID>NXT106361086</MESSAGE_ID><MSGTYPE>NOT</MSGTYPE><TIMESTAMP>01262017184421</TIMESTAMP></HEADER><REQ_NO>7778017026777200</REQ_NO><VER_ID_REQ>00</VER_ID_REQ><MSGDATA><PORTED_NUM>512-202-1767</PORTED_NUM><MSGCODE>SOA1012I</MSGCODE><MSGTEXT>Port Request Authorized by Old Service Provider 6646</MSGTEXT></MSGDATA></GENERIC_NOTIFICATION>

I am certainly not a perl expert, but from what I know about REs, the following should do what you want:

perl -nle '/<MSGCODE>([^<]*)</ and print $1' file

which, if file contains the data above, produces the output:

SOA1012I

Is that what you're trying to do?

1 Like

Perfect. Thank you so much