awk multiline matching

I have a file that looks something like this with lots of text before and after.

Distance method: Sum of squared size difference (RST)
</data> <pairwiseDifferenceMatrix time="02/08/11 at 13:08:27">

                       1          2
            1  448.82151  507.94231
            2   56.51684  454.02943
</pairwiseDifferenceMatrix> <data>

I want to extract the diagonal values 448.82 and 454.03. I was trying to first get the lines the values were one and could only get values from the search line. Is the white space messing things up or am I not specifying the field separator correctly? Here is the script I am using.

awk ' BEGIN {FS="\n"} /<pairwiseDifferenceMatrix/{print $3, $4}' inputfile.txt >> outputfile.txt

Is the white space messing things up or am I not specifying the field separator correctly?

Any advice would be greatly appreciated.
Thank you

can you explain a bit more.

if you have these tags in your file

<pairwiseDifferenceMatrix time="02/08/11 at 13:08:27">

 1 2
 1 448.82151 507.94231
 2 56.51684 454.02943
 </pairwiseDifferenceMatrix> <data>

are you always going to be looking for the first number on the line beginning with "1", and the second number on the line beginning with "2"

Yes, but there are other matrices in the file with similar format. I need to pull the data out of this one, which has the unique phrase "<pairwiseDifferenceMatrix" proceeding it. The time stamp will change also throughout the different files I'll be using this script on.

Do it in a small steps. Test input/output and learn incremental:

  1. Find all your chunks and learn their structure:
awk '/<pairwise.../, /<\/pairwise.../' INPUTFILE
  1. It looks like you want process lines with only 3 fields, pipe the output to the next awk:
awk 'NF == 3'

Maybe it's not enough and you want something like

awk 'NF == 3 && $1 ~ /[0-9]/ 


... && $2 ~ /^[0-9.]+$/ && $3 ~ ...
  1. You should get the even(!) number of lines. You can check it piping to "wc -l"
  2. You can squeeze your lines with:
sed 'N; s/\n/ /'
  1. And cut them with
cut -d' ' -f2,6
  1. And then you can format your numbers with printf:
xargs printf "%.2f %.2f\n"

The final result:

awk '/<pairwise/, /<\/pairwise/' INPUTFILE | 
awk 'NF == 3' | 
sed 'N; s/\n/ /' | 
cut -d' ' -f2,6  | 
xargs printf "%.2f %.2f\n"

And one more time - you can change, tune, test every your step separately.

1 Like

Try this awk script...

awk '
   /pairwiseDifferenceMatrix/     {f=1}
   /^<\/pairwiseDifferenceMatrix/ {f=0}
   f && NF==3 {printf("%s ",$2);getline;print $3}
' file

I was wondering if you would be able to break that awk command apart with comments if you wouldn't mind? I'm trying to understand what the purpose of {f=1} & {f=0} are.

This looks like something I may be able to use at some point but I apologize, I don't understand what parts of it are doing :o

awk scripts are made up of pattern/action pairs which are executed on every line that awk reads.

/pairwiseDifferenceMatrix/ tells awk that whenever it sees that pattern on a line the action should be to enable a flag variable f...set f to this is kind of like saying START.

/^<\/pairwiseDifferenceMatrix/ tells awk that whenever it sees that pattern on a line the action should be to disable the flag variable f...set it to this is kind of like saying STOP.

f && NF==3 tells awk that if "f" is non-zero and NF (number of fields) equals should print the 2nd field of the current line...followed by getting the next line and printing its third field.

awk '
   /pairwiseDifferenceMatrix/     {f=1}
   /^<\/pairwiseDifferenceMatrix/ {f=0}
   f && NF==3 {printf("%s ",$2);getline;print $3}
' file

Thank you so much Shamrock *nods* Believe it or not I have an issue today where this will probably come in handy! Very exciting. :slight_smile: