awk multiline matching

mgray · August 29, 2011, 7:21pm

I have a file that looks something like this with lots of text before and after.

Distance method: Sum of squared size difference (RST)
</data> <pairwiseDifferenceMatrix time="02/08/11 at 13:08:27">

                       1          2
            1  448.82151  507.94231
            2   56.51684  454.02943
</pairwiseDifferenceMatrix> <data>

I want to extract the diagonal values 448.82 and 454.03. I was trying to first get the lines the values were one and could only get values from the search line. Is the white space messing things up or am I not specifying the field separator correctly? Here is the script I am using.

awk ' BEGIN {FS="\n"} /<pairwiseDifferenceMatrix/{print $3, $4}' inputfile.txt >> outputfile.txt

Is the white space messing things up or am I not specifying the field separator correctly?

Any advice would be greatly appreciated.
Thank you

declanryan · August 29, 2011, 8:42pm

can you explain a bit more.

if you have these tags in your file


<pairwiseDifferenceMatrix time="02/08/11 at 13:08:27">

 1 2
 1 448.82151 507.94231
 2 56.51684 454.02943
 </pairwiseDifferenceMatrix> <data>

are you always going to be looking for the first number on the line beginning with "1", and the second number on the line beginning with "2"
?

mgray · August 29, 2011, 11:21pm

Yes, but there are other matrices in the file with similar format. I need to pull the data out of this one, which has the unique phrase "<pairwiseDifferenceMatrix" proceeding it. The time stamp will change also throughout the different files I'll be using this script on.

yazu · August 30, 2011, 12:14am

Do it in a small steps. Test input/output and learn incremental:

Find all your chunks and learn their structure:

awk '/<pairwise.../, /<\/pairwise.../' INPUTFILE

It looks like you want process lines with only 3 fields, pipe the output to the next awk:

awk 'NF == 3'

Maybe it's not enough and you want something like

awk 'NF == 3 && $1 ~ /[0-9]/

or

... && $2 ~ /^[0-9.]+$/ && $3 ~ ...

You should get the even(!) number of lines. You can check it piping to "wc -l"
You can squeeze your lines with:

sed 'N; s/\n/ /'

And cut them with

cut -d' ' -f2,6

And then you can format your numbers with printf:

xargs printf "%.2f %.2f\n"

The final result:

awk '/<pairwise/, /<\/pairwise/' INPUTFILE | 
awk 'NF == 3' | 
sed 'N; s/\n/ /' | 
cut -d' ' -f2,6  | 
xargs printf "%.2f %.2f\n"

And one more time - you can change, tune, test every your step separately.

shamrock · August 30, 2011, 12:59pm

Try this awk script...

awk '
   /pairwiseDifferenceMatrix/     {f=1}
   /^<\/pairwiseDifferenceMatrix/ {f=0}
   f && NF==3 {printf("%s ",$2);getline;print $3}
' file

jtollefson · August 30, 2011, 3:44pm

Shamrock,
I was wondering if you would be able to break that awk command apart with comments if you wouldn't mind? I'm trying to understand what the purpose of {f=1} & {f=0} are.

This looks like something I may be able to use at some point but I apologize, I don't understand what parts of it are doing :o

shamrock · August 30, 2011, 4:27pm

awk scripts are made up of pattern/action pairs which are executed on every line that awk reads.

/pairwiseDifferenceMatrix/ tells awk that whenever it sees that pattern on a line the action should be to enable a flag variable f...set f to 1...so this is kind of like saying START.

/^<\/pairwiseDifferenceMatrix/ tells awk that whenever it sees that pattern on a line the action should be to disable the flag variable f...set it to zero...so this is kind of like saying STOP.

f && NF==3 tells awk that if "f" is non-zero and NF (number of fields) equals 3...it should print the 2nd field of the current line...followed by getting the next line and printing its third field.

awk '
   /pairwiseDifferenceMatrix/     {f=1}
   /^<\/pairwiseDifferenceMatrix/ {f=0}
   f && NF==3 {printf("%s ",$2);getline;print $3}
' file

jtollefson · August 31, 2011, 2:05pm

Thank you so much Shamrock *nods* Believe it or not I have an issue today where this will probably come in handy! Very exciting.