String parsing help across multiple UNIX platforms

harleyvrodred · June 8, 2017, 12:46pm

Need to parse XML like strings from a file.

Using `egrep -A 1 "Panel Temp" "$2" | tail -2` I get the following string:

<parameter name="Panel Temp" unit="0.1 C"> <value size="1" starttime="06-08-2017 09:36:56.968">95</value>

I want to output:

{"Panel Temp" 9.5 C}

The 9.5 C is the value of 9 * unit of 0.1 C

I need something that'll work from a shell script on old and new systems. Particularly from IRIX 6.5...Linux 2.4 to Linix 2.6+ kernel based systems.
Korn ksh is the default shell on IRIX and is probably the most limiting factor, i.e. ksh features that'll work on old and new.

Any help appreciated

bakunin · June 8, 2017, 3:13pm

As often stated: one can do that, but be aware that in a strict sense you can't do it by regular exprerssions. For instance, in XML it is not guaranteed that tags are on the same line. They could be on different lines, in different order, etc.. Therefore any regexp-based solution short of a real XML-parser will make implicit assumptions about the format of the file which could or could not be correct.

harleyvrodred:

I get the following string:
<parameter name="Panel Temp" unit="0.1 C"> <value size="1" starttime="06-08-2017 09:36:56.968">95</value>
I want to output:
{"Panel Temp" 9.5 C}
The 9.5 C is the value of 9 * unit of 0.1 C

OK, the part "9 * 0.1" is perhaps a typo and should (i suppose) read "95 * 0.1". I also understand the format of the string you cut out which containsthe relevant tags. But the command producing this string looks fishy. Can you please post a part of your input file (a few lines surrounding the part you want to cut out and parse), because i think this cutting could be done more easily.

As it is i don't think so: ksh comes in two variants, ksh88 and ksh93 and ksh88 is a strict subset of what ksh93 can do. Find out what IRIX uses, but i think writing for ksh88 should put you on the safe side.

Linux, though, is a different matter, depending on what distribution you use. SuSE 11 for instance, had a real AT&T ksh93 when you installed it and called ksh on the command line. SuSE 12, though, dropped the AT&T version from the repository completely and if you invoke ksh you get some "mksh" (which is a not-compatible clone with most ksh-features missing but some bash-features included), but it won't tell you so. I learned that a few weeks ago, when i gave a script i wrote on RHEL (which again has a real ksh when you invoke ksh ) to a colleague from the Linux team and we wondered why it didn't work. So be prepared for some surprises.

I hope this helps.

bakunin

drl · June 8, 2017, 3:31pm

Hi, harleyvrodred.

Welcome to the forum.

I'd be inclined to look at perl .

Best wishes ... cheers, drl

harleyvrodred · June 8, 2017, 4:23pm

<?xml version="1.0" encoding="UTF-8"?>                                                                                                                          
                                                                                                                                                                
<log cabinet="403541" MLN="9999" USN="0001538397" timeStamp="Mon May 22 21:02:53 2017" timeZone="MST-MDT">                                                  
   <message type="parametricLog" time="Mon May 22 21:02:53 2017" source="Unknown">                                                                              
      <text />                                                                                                                                                  
      <detail>                                                                                                                                                  
         <parametricdata>                                                                                                                                       
            <data>                                                                                                                                              
               <config>                                                                                                                                         
                  <property name="SW Rev">122.33_V01</property>                                                                                          
               </config>                                                                                                                                        
               <group name="Cooling Cabinet">                                                                                                                   
                  <parameter name="Coolant Temp"  unit="0.1 C">                                                                                                 
                     <value size="1"  starttime="06-08-2017 13:40:52.563">293</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:41:42.612">1641</value>                                                                          
                  </parameter>                                                                                                                                  
                  <parameter name="Coolant Temp"  unit="0.1 C">                                                                                                 
                     <value size="1"  starttime="06-08-2017 13:41:52.525">304</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Actual Air Velocity"  unit="fpm">                                                                                            
                     <value size="1"  starttime="06-08-2017 13:41:57.728">877</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:42:42.573">658</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Actual Air Velocity"  unit="fpm">                                                                                            
                     <value size="1"  starttime="06-08-2017 13:42:57.751">869</value>  
                  </parameter>  
                  <parameter name="Cabinet Temp"  unit="0.1 C">                                                                                                                  
                     <value size="1"  starttime="06-08-2017 13:42:59.751">310</value>                                                                                                                                                                                                                                                                                                                              
                  </parameter>                                                                                                                                  
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:43:42.555">1229</value>                                                                          
                  </parameter>                                                                                                                                  
                  <parameter name="Coolant Temp"  unit="0.1 C">                                                                                                 
                     <value size="1"  starttime="06-08-2017 13:44:02.457">293</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:44:42.586">1545</value>                                                                          
                  </parameter>                                                                                                                                  
                  <parameter name="Coolant Temp"  unit="0.1 C">                                                                                                 
                     <value size="1"  starttime="06-08-2017 13:45:02.455">307</value>                                                                           
                  </parameter>                                                                                                                                  
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:45:42.621">819</value>                                                                           
                  </parameter>
                  <parameter name="Cabinet Temp"  unit="0.1 C">                                                                                                                  
                     <value size="1"  starttime="06-08-2017 13:46:29.751">360</value>                                                                                                                                                                                                                                                                                                                              
                  </parameter>                                                                                                                                                                                                                                                                    
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:46:42.593">1667</value>                                                                          
                  </parameter>                                                                                                                                  
               </group>                                                                                                                                         
            </data>                                                                                                                                             
         </parametricdata>                                                                                                                                      
      </detail>                                                                                                                                                 
   </message>                                                                                                                                                   
</log>

---------- Post updated at 04:23 PM ---------- Previous update was at 04:17 PM ----------

I'm looking to get the latest values for the parameters

RudiC · June 8, 2017, 4:55pm

A bit difficult to look for "Panel Temp" if there's none in the input file.
OK, looking for "Coolant" in lieu, try

awk '
match ($0, /"Coolant Temp"/)    {printf "{%s", substr ($0, RSTART, RLENGTH)
                                 gsub (/^.*=|"|> *$/, "")
                                 split ($0, UN)
                                 getline
                                 gsub (/<[^>]*>/, "")
                                 print " "UN[1] * $0, UN[2] "}" 
                                }
' file
{"Coolant Temp" 29.3 C}
{"Coolant Temp" 30.4 C}
{"Coolant Temp" 29.3 C}
{"Coolant Temp" 30.7 C}

harleyvrodred · June 8, 2017, 11:38pm

Apologize for the inconsistencies...

This does indeed find every instance and print them out. Nice job!

What I need is for it to find the last instance and return that last one in a string variable so I can pass it onto the next step.

Do you have such an eloquent example of how this might be accomplished as well?

Much appreciated!

Don_Cragun · June 9, 2017, 2:07am

Without knowing what shell you're using, we can only guess at how the output of a command should be assigned to a shell variable. And, without knowing what operating system you're using, we have to make several other wild assumption that may not work in your environment, but the following seems to do what you seem to what on the MacOS system I'm using when using a Korn shell:

#!/bin/ksh
variable=$(awk -F'"' -v pattern='Coolant Temp' '
$2 == pattern {
	split($4, t, / /)
	getline
	split($0, m, /[<>]/)
	o = sprintf("{%s %.1f %s}", FS pattern FS, t[1] * m[3], t[2])
}
END {	print o
}' file)

printf 'Last value extracted from file was: %s\n' "$variable"

The command syntax in this script should work with other shells that perform basic command substitutions as required by the POSIX standards (such as bash ). If you want to try this on a Solaris/SunOS system, you'll need to change awk to /usr/xpg4/bin/awk or nawk .

The above is a slight variation on the code RudiC suggested that only prints the last occurrence of the data pattern found in the input file,, save the result in a shell variable, and prints the contents of that shell variable before the script exits.

If a file named file contains the sample contents you provided in post #4 in this thread, the output produced by running the above script is:

Last value extracted from file was: {"Coolant Temp" 30.7 C}

RudiC · June 9, 2017, 4:13am

Small correction to Don Cragun's fine proposal: in the sprintf command, replace "pat" by "pattern" to get the desired result.

Use this for the last occurrence of the pattern in file:

awk '
match ($0, /"Coolant Temp"/)    {TMP = substr ($0, RSTART, RLENGTH)
                                 gsub (/^.*=|"|> *$/, "")
                                 split ($0, UN)
                                 getline
                                 gsub (/<[^>]*>/, "")
                                 OUT = sprintf ("{%s %.1f %c}", TMP, UN[1] * $0, UN[2]) 
                                }
END                             {print OUT
                                }
' file

Don_Cragun · June 9, 2017, 4:38am

Hi RudiC,
I should know better than to try to make code easier to read after it has been tested.

My post #7 has now been updated as you suggested.

Thanks,
Don

harleyvrodred · June 9, 2017, 10:18pm

Nice examples. I took this and made a command I could put into the clipboard and paste into a telnet window. Because I want to select-cut the result into the clipboard I'm wrapping the output onto as few lines as possible.

sh << 'PASTE'
xml_token () {
 variable=$(awk -F'"' -v pattern="$1" '
   $2 == pattern {
    split($4, t, / /)
    getline
    split($0, m, /[<>]/)
    o = sprintf("{%s %.1f %s}", FS pattern FS, t[1] * m[3], t[2])
   }
  END {	print o
   }' $2)
 if [ $(( ${#buff} + ${#variable} )) -ge $twidth ]; then
  echo $buff; buff=$variable
   else
  buff+=$variable
   fi
}
twidth=`tput cols`
xml_token 'Coolant Temp' ~me/cooling.xml
xml_token 'Cabinet Temp' ~me/cooling.xml
echo $buff
'PASTE'

The Output using the example file is:

{"Coolant Temp" 30.7 C}{"Cabinet Temp" 36.0 C}

I was thinking about the possibility of making the search pattern string contain a list, as in:

xml_token '/Coolant Temp/ || /Cabinet Temp/' ~me/cooling.xml

I could make the list of search patterns as many as I needed, certainly more than 2. More like a dozen or so search strings (or patterns as they are here). My attempts have failed, ideas?

Don_Cragun · June 9, 2017, 11:22pm

I don't understand. Are you saying that you want to search for a bunch of patterns and return the last element in the XML file that matched any one of those patterns? So, if we look at the last couple of entries in your sample:

                  <parameter name="Cabinet Temp"  unit="0.1 C">                                                                                                                  
                     <value size="1"  starttime="06-08-2017 13:46:29.751">360</value>                                                                                                                                                                                                                                                                                                                              
                  </parameter>                                                                                                                                                                                                                                                                    
                  <parameter name="Electronics PID Control Value"  unit="-">                                                                                    
                     <value size="1"  starttime="06-08-2017 13:46:42.593">1667</value> 
                  </parameter>

and you decide to search for Cabinet Temp and Electronics PID Control Value , you want the output to be:

{"Electronics PID Control Value" 0 }

Note that the 0 comes from multiplying 1667 times the string "-" and the space at the end is because there is no second value after a space in unit="-" like there was in your earlier example with unit="0.1 C" .

If we had a clear definition of what you're trying to do, we might be able to help you get there. But, with your current description, I'm not able to guess at the output you hope to achieve.

harleyvrodred · June 9, 2017, 11:44pm

In the case you mentioned, I'd prefer to see

{"Electronics PID Control Value" 1667 -}

I tried adding a few more entries to the example, namely Cabinet Temp - would also have the same fields as Coolant Temp. The real-life file has many Temp related entries that I'm interested in. I could use what is here now and use the xml_token function individually with multiple calls for each one, getting the last entry each. I was wondering if there would be a way to use a search string with multiple entries at the same time so it would return the last entry of each.

I know the subset that I�m interested in.

Another approach might be to take each <parameter> name and return the last instance of each. This help?

I'm data mining. Pulling data from site machines for analysis. There is too much to pull it all. I'm just choosing and picking

Don_Cragun · June 10, 2017, 12:14am

We are very glad you know the subset of entries you're interested in. But my crystal ball isn't showing me what is inside your head. What would help would be for us to know the subset of entries you're interested in. Or, if you want the code to determine which entries it should extract, explain to us how you would determine that an entry is interesting by describing what you see on the first line of that <parameter>...</parameter> XML tag that makes it interesting. (Is it that the last word inside the 1st pair of double-quotes is Temp ? Is it that string between the 2nd pair of double-quotes is 0.1 C ? If it isn't one of these, what is it?)

After you describe the logic that determines which entries are interesting, please show us the exact output that you want your script to produce given the sample you provided in post #4 in this thread. Or post new data in a new post (with CODE tags) and show us the exact output (also in CODE tags) you're hoping to produce from that input with the list of interesting tag values or the logic that you described to determine which tags are interesting.

Aia · June 10, 2017, 1:22am

harleyvrodred:

In the case you mentioned, I'd prefer to see
{"Electronics PID Control Value" 1667 -}
I tried adding a few more entries to the example, namely Cabinet Temp - would also have the same fields as Coolant Temp. The real-life file has many Temp related entries that I'm interested in. I could use what is here now and use the xml_token function individually with multiple calls for each one, getting the last entry each. I was wondering if there would be a way to use a search string with multiple entries at the same time so it would return the last entry of each.

I know the subset that I'm interested in.

Another approach might be to take each <parameter> name and return the last instance of each. This help?

I'm data mining. Pulling data from site machines for analysis. There is too much to pull it all. I'm just choosing and picking

Run as perl example.pl harleyvrodred.example

my %tmp;
my %parameters;
while(<>){
  if(/<parameter name/../<\/parameter/){
    /(name)="(\w[^"]+)/ and $tmp{$1} = $2;
    /(unit)="([^"]+)/ and $tmp{$1} = $2;
    /(value)[^>]+>(\d+)</ and $tmp{$1} = $2;

    if(/<\/parameter/) {
      $parameters{$tmp{'name'}} = {'unit' => $tmp{'unit'}, 'value' => $tmp{'value'}};
      undef %tmp;
    }
  }
}
for my $entry (keys %parameters) {
  my @param = @{${parameters}{$entry}}{qw(value unit)};
  print qq/{"$entry" @param}\n/;
}

Output:

{"Electronics PID Control Value" 1667 -}
{"Coolant Temp" 307 0.1 C}
{"Cabinet Temp" 360 0.1 C}
{"Actual Air Velocity" 869 fpm}

RudiC · June 10, 2017, 2:06am

Wildly guessing on what you might want, how about

awk '
NR == 1         {for (n=split(PL, PA, "|"); n>0; n--) PAT[PA[n]]
                }
                {for (p in PAT) if (match ($0, p))      {TMP = substr ($0, RSTART, RLENGTH)
                                                         gsub (/^.*=|"|> *$/, "")
                                                         n = split ($0, UN)
                                                         UN[0] = 1 
                                                         getline
                                                         gsub (/<[^>]*>/, "")
                                                         OUT[p] = sprintf ("{\"%s\" %.1f %s}", TMP, UN[n-1] * $0, UN[n]) 
                                                        }
                }
END             {for (p in PAT) print OUT[p]
                }
' PL="Coolant Temp|Actual Air Velocity|Cabinet Temp" file
{"Actual Air Velocity" 869.0 fpm}
{"Coolant Temp" 30.7 C}
{"Cabinet Temp" 36.0 C}

Don_Cragun · June 10, 2017, 3:49am

rudic:

Wildly guessing on what you might want, how about

awk '
NR == 1         {for (n=split(PL, PA, "|"); n>0; n--) PAT[PA[n]]
   }
   {for (p in PAT) if (match ($0, p))      {TMP = substr ($0, RSTART, RLENGTH)
   gsub (/^.*=|"|> *$/, "")
   n = split ($0, UN)
   UN[0] = 1 
   getline
   gsub (/<[^>]*>/, "")
   OUT[p] = sprintf ("{\"%s\" %.1f %s}", TMP, UN[n-1] * $0, UN[n]) 
   }
   }
END             {for (p in PAT) print OUT[p]
   }
' PL="Coolant Temp|Actual Air Velocity|Cabinet Temp" file
{"Actual Air Velocity" 869.0 fpm}
{"Coolant Temp" 30.7 C}
{"Cabinet Temp" 36.0 C}

If there are a lot of strings in PL and one or more of those strings might not appear in all files that will be processed, you might want to change the END clause to:

END             {for (p in OUT) print OUT[p]
                }

to avoid printing empty lines. Or, change:

NR == 1         {for (n=split(PL, PA, "|"); n>0; n--) PAT[PA[n]]
                }

to something like:

NR == 1         {for (n=split(PL, PA, "|"); n>0; n--)   {PAT[PA[n]]
                                                         OUT[PA[n]] = sprintf("{\"%s\" Not Found}")
                                                        }
                }

to print an indication of which strings were not found in the file being processed. But, of course, I still think we need a clear specification of the desired behavior from harleyvrodred.

harleyvrodred · June 10, 2017, 11:44am

I was looking to reduce the script as much as possible

I'm looking at your comments now and trying just that, thanks

Okay, I cannot use sub(), match(), and gsub() as they are not supported on all the platforms. That breaks the original requirement.