How do I extract parameter value after name="value" accurately?

kchinnam · February 12, 2016, 2:17pm

How do I extract initialHeapSize and maximumHeapSize values accurately?

        <jvmEntries xmi:id="JavaVirtualMachine_1337159909831" verboseModeClass="false" verboseModeGarbageCollection="true" verboseModeJNI="false" initialHeapSize="256" maximumHeapSize="512" runHProf="false" hprofArguments="" debugMode="false" debugArgs="-Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777" enericJvmArguments="-Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120">

I want the code to handle potential variations to input.

1. initialHeapSize=256 maximumHeapSize=512
2. initialHeapSize='256' maximumHeapSize='512'
3. initialHeapSize = 256 maximumHeapSize = 512
etc..

Done right this is very useful to extract parameter values from config files or
ps -ef etc output.

I have a working version that using awk and cut,, but it can't handle any of the potential variations to value format mentioned above.

Don_Cragun · February 12, 2016, 3:30pm

Show us your working shell script using awk and cut and maybe we can help you extend its capabilities.

What operating system and shell are you using?

kchinnam · February 12, 2016, 4:51pm

Here is the bash and Linux version:

GNU bash, version 3.2.51(1)-release (x86_64-suse-linux-gnu)

This is working,, but I am concerned that it can go wrong in many ways.

        MIN_HEAP=`cat server.xml | grep -i "<jvmEntries" | sed -e 's/"/ /g' | awk '{print $11}'`
        MAX_HEAP=`cat server.xml | grep -i "<jvmEntries" | sed -e 's/"/ /g' | awk '{print $13}'`

Don_Cragun · February 12, 2016, 8:31pm

In the first post in this thread, you showed three lines from an XML file containing a single <jvmEntries> tag with its options. But, the code you showed us in the third post in this thread only works if that entire tag is contained on a single line (not on three lines). Is that tag on three lines or on one line?

Will the strings initialHeapSize and maximumHeapSize immediately followed by zero or more spaces immediately followed by an equal sign ever appear in your XML file outside of the <jvmEntries> tag?

You are performing a case-insensitive search for the tag jvmEntries , do we also need to search for initialHeapSize and maximumHeapSize without regard to case, or are those strings always presented exactly as shown?

kchinnam · February 12, 2016, 10:19pm

my input is in a XML and <jvmEntries> is in a single line.
initialHeapSize and maximumHeapSize appear only once in the XML file inside <jvmEntries> tag.
case insensitive search is to make it more robust. At present input has name="value" format without spaces. But as I explained input can change and still be valid,, how can script be more smarter?

if you are aware of any java run-time arguments used to start application processes, they have min, max heap size parameters visible in ps -auxww output. ability to get parameter values with some tolerance would help a great deal.

Aia · February 12, 2016, 10:20pm

The following produces the same result that you posted for MIN_HEAP and MAX_HEAP

MIN_HEAP=$(perl -ne 'print /initialHeapSize[ \x27"=]+(\d+)/' server.xml)

echo $MIN_HEAP
256

MAX_HEAP=$(perl -ne 'print /maximumHeapSize[ \x27"=]+(\d+)/' server.xml)

echo $MAX_HEAP
512

That accommodates for any of the followings:

(initial|maximum)HeapSize '512'
(initial|maximum)HeapSize 512
(initial|maximum)HeapSize='512'
(initial|maximum)HeapSize = '512'
(initial|maximum)HeapSize = "512'
(initial|maximum)HeapSize = "512"
(initial|maximum)HeapSize = '512"
(initial|maximum)HeapSize= "512"
(initial|maximum)HeapSize ="512"
Or any variation with single quotes.

kchinnam · February 12, 2016, 11:02pm

Aia perl solution is working great. Thanks.
I would like to know how this can be done using sed -e or
GNU grep -P -o which are more native to bash.

kshji · February 13, 2016, 2:35am

Here is pure bash/ksh93 solution. Not used any external commands like awk, perl, grep, sed, ...

If you have too old bash, update it. Older bash parser has bug (ex. 4.1) , it can't parse correctly subprocess case syntax.

while read line
do
        # grep using case
        case "$line" in
           \<jvmEntries*)  ;;
                *) continue ;;
        esac
        # sed using builtin properties
        line=${line// =/=}
        line=${line//= /=}

        # parse line elements to the array using delimiter $IFS
        elem=($line)

        # create var=value lines and source it
        . /dev/stdin <<< $(
                for e in ${elem[@]}
                do
                        # grep only xxx=xxx and xxx="xxx" values
                        case "$e" in
                                -*) continue ;;
                                *:*=*) continue ;; #
                                *=\"*\") ;; # set value
                                *=\"*) continue ;; #
                                *=*) ;; # set value ...
                                *) continue ;; # something else
                        esac
                        # this was interesting element, take it
                        echo "$e"
                done
                )

        # show the variables
        echo "initialHeapSize $initialHeapSize"
        echo "maximumHeapSize $maximumHeapSize"

done < some.xml

RudiC · February 13, 2016, 6:40am

Nice one! Alas it doesn't extract e.g. debugArgs or enericJvmArguments that have a list of space separated strings enclosed in double quotes.

kchinnam · February 13, 2016, 8:19am

Thanks kshji, I will try that and see if there are any short comings.

Scrutinizer · February 13, 2016, 8:41am

With the GNU utilities that you have , you could:

sed -r 's/.*initialHeapSize[^0-9]*=[^0-9]*([0-9]+).*/\1/'

or

grep -Eo 'initialHeapSize[^0-9]*=[^0-9]*[0-9]+' | sed 's/.*[^0-9]//'

or with the perl -P extension:

grep -Po 'initialHeapSize\D*=\D*\d+' | sed 's/.*[^0-9]//'

kchinnam · February 13, 2016, 10:03pm

This is not working. It is printing all XML

sed -r 's/.*initialHeapSize[^0-9]*=[^0-9]*([0-9]+).*/\1/' server.xml

below two solutions are working.

grep -Eo 'initialHeapSize[^0-9]*=[^0-9]*[0-9]+' | sed 's/.*[^0-9]//' server.xml
grep -Po 'initialHeapSize\D*=\D*\d+' | sed 's/.*[^0-9]//'  server.xml

But I think using initialHeapSize[^0-9]*= could select any text preceding = .
Better way to do it is to say there can be spaces between key and = . There can be zero or more spaces after = and a single or double quote followed by digits.

cat server.xml | grep '<jvmEntries' | grep -iEo 'initialHeapSize[[:space:]]*=[[:space:]]*[\x27"]?[0-9]+'  | sed 's/.*[^0-9]//'
256

If I can use above regex in a single sed statement that would even better to understand and maintain.

I think this also is a great solution, if only I could push grep condition into perl regex. This is case insensitive and tries to be precise with key=value matching with best guess for space, =, quotes

cat server.xml | grep -i '<jvmEntries' |  perl -ne 'print /initialheapSize[[:space:]]*=[[:space:]]*[\x27"]?(\d+)/i'

Don_Cragun · February 14, 2016, 5:49am

You could also try this awk script. It can handle single-quoted strings, double-quoted strings, and unquoted strings terminated by a space or ">". It requires an equal-sign (with optional leading and trailing spaces) between keyword and its value. If the value is an empty string, it must be quoted; otherwise the value doesn't need to be quoted unless the value contains a space or ">". Single-quotes can be included in double-quoted strings and double-quotes can be included in single-quoted strings.

#!/bin/ksh
file="$1"
tag="$2"
shift 2
printf '%s\n' "$@" | awk -v tag="$tag" -v sq="'" -v dq='"' '
FNR == NR {
	# Get keyword list.
	list[++n] = $0
#printf("list[%d] set to %s\n", n, list[n])
}
$1 == "<" tag {
	# Look for the requested keywords in this tag...
	for(i = 1; i <= n; i++) {
		if(match($0, "[: ]" list " *= *") <= 0) {
			# No match for this keyword.
			print "***No match"
			continue
		}
		val1 = RSTART + RLENGTH
		if((c1 = substr($0, val1, 1)) == dq || c1 == sq) {
			# We have a single-quoted string or double-quoted
			# string value.  Find the end of the string value.
			val_len = index(substr($0, val1 + 1), c1) - 1
			# Extract the string value.
			val = substr($0, val1 + 1, val_len)
		} else {# We have a space or ">" terminated value.
			# Find the end of the value.
			val_len = match(substr($0, val1), /[ >]/) - 1
			val = substr($0, val1, val_len)
		}
		print val
	}
}' - "$file" | (
	while [ $# -gt 0 ]
	do	read -r value
		printf 'tag %s keyword %s=%s\n' "$tag" "$1" "$value"
		shift
	done
)

Invoke it with the 1st operand being the name of the XML file to be processed, the 2nd operand being the tag on the line to be processed, and the remaining operands being the keywords on that line whose values are to be printed with one output line for each keyword requested printed in the same order as the keywords were given on the command line.

As always, if you want to try this script on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk . (Note that nawk will not correctly process this script.)

If you have a file named file.xml containing:

 <NotjvmEntries xmi:id="NotJavaVirtualMachine_1337159909831" verboseModeClass="true" verboseModeGarbageCollection="false" verboseModeJNI="true" initialHeapSize="2560" maximumHeapSize="5120" runHProf="true" hprofArguments="null" debugMode="true" debugArgs='-DDQ=" -Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777' enericJvmArguments="-DSq=' -Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120">
  <jvmEntries xmi:id="JavaVirtualMachine_1337159909831" verboseModeClass="false" verboseModeGarbageCollection="true" verboseModeJNI="false" initialHeapSize="256" maximumHeapSize="512" runHProf="false" hprofArguments="" debugMode="false" debugArgs="-Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777" enericJvmArguments="-Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120">
xmi:id="JavaVirtualMachine_1337159909831" verboseModeClass="false" verboseModeGarbageCollection="true" verboseModeJNI="false" initialHeapSize="256" maximumHeapSize="512" runHProf="false" hprofArguments="" debugMode="false" debugArgs="-Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777" enericJvmArguments="-Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120"
 <test xmi:id=NotJavaVirtualMachine_1337159909831 verboseModeClass=true verboseModeGarbageCollection=false verboseModeJNI=true initialHeapSize=2560 maximumHeapSize=5120 runHProf=true hprofArguments=null debugMode=true debugArgs='-DDQ=" -Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777' genericJvmArguments="-DSq=' -Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120">

and you have saved the above script as an executable script named tester , then the command:

tester file.xml test id verboseModeClass hprofArguments maximumHeapSize minimumHeapSize initialHeapSize debugArgs genericJvmArguments enericJvmArguments

produces the output:

tag test keyword id=NotJavaVirtualMachine_1337159909831
tag test keyword verboseModeClass=true
tag test keyword hprofArguments=null
tag test keyword maximumHeapSize=5120
tag test keyword minimumHeapSize=***No match
tag test keyword initialHeapSize=2560
tag test keyword debugArgs=-DDQ=" -Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777
tag test keyword genericJvmArguments=-DSq=' -Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120
tag test keyword enericJvmArguments=***No match

and the command:

tester file.xml jvmEntries id verboseModeClass hprofArguments maximumHeapSize minimumHeapSize initialHeapSize debugArgs genericJvmArguments enericJvmArguments

produces the output:

tag jvmEntries keyword id=JavaVirtualMachine_1337159909831
tag jvmEntries keyword verboseModeClass=false
tag jvmEntries keyword hprofArguments=
tag jvmEntries keyword maximumHeapSize=512
tag jvmEntries keyword minimumHeapSize=***No match
tag jvmEntries keyword initialHeapSize=256
tag jvmEntries keyword debugArgs=-Djava.compiler=NONE -Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=7777
tag jvmEntries keyword genericJvmArguments=***No match
tag jvmEntries keyword enericJvmArguments=-Dawt.headless=true -Xjit:disableIdiomRecognition -Dsun.net.inetaddr.ttl=120

kchinnam · February 14, 2016, 2:48pm

Thanks Don for taking time to hammer it, this is certainly comprehensive solution.