[awk] line by line processing the same file

origamisven · October 2, 2012, 6:20am

Hey, not too good at this, so I only managed a clumsy and SLOW solution to my problem that needs a drastic speed up. Any ideas how I write the following in awk only?

Code is supposed to do...
For every line read column values $6, $7, $8 and do a calculation with the same column values of every other line in the same file. If conditions are met, write information out to file.

CODE:

while read line; do                                                                                     
    XI=$(echo $line | awk '{print $6}')
    YI=$(echo $line | awk '{print $7}')
    ZI=$(echo $line | awk '{print $8}')
    ATOM_TYPE=$(echo $line | awk '{print $3}')
    awk -v xi="$XI" -v yi="$YI" -v zi="$ZI" -v atom="$ATOM_TYPE" -v cutoff="$DISTCUT" '{dist=sqrt(( xi- $6)^2 + ( yi- $7)^2 + ( zi- $8)^2); if (dist <= cutoff && dist != '0') print atom, $3, dist}' sub_oxy_high >> oxy_dist_all
done < sub_oxy_high

INPUT:

ATOM   5202   C3  TB   347      47.749   6.795 193.827
ATOM   5203   C4  TB   347      46.729   7.915 193.597
ATOM   5204   O5  TB   347      47.109   9.075 193.407
ATOM   5205   O6  TB   347      45.329   7.594 193.517
...

OUTPUT:

C3 C4 9.999
C3 O5 9.999
C3 O6 9.999
...

elixir_sinari · October 2, 2012, 6:40am

And what's the value of DISTCUT for the output posted?

Try:

awk '{atom[NR]=$3;xi[NR]=$6;yi[NR]=$7;zi[NR]=$8}
END{
for(i=1;i<=NR;i++)
 for(j=1;j<=NR;j++)
 {
  if(j==i) continue
  dist=sqrt((xi-xi[j])^2 + (yi-yi[j])^2 + (zi-zi[j])^2)
  if(dist!=0 && dist<=cutoff)
   print atom,atom[j],dist
 }
}' cutoff="$DISTCUT" sub_oxy_high > oxy_dist_all

RudiC · October 2, 2012, 6:55am

awk     '{for (i=3;i<=NF;i++) TMP[NR,i]=$i}
         END {for (i=1;i<=NR;i++)
                {for (j=NR;j>i;j--)
                  {dist = sqrt  ( (TMP[i,6]-TMP[j,6])^2 + (TMP[i,7]-TMP[j,7])^2 + (TMP[i,8]-TMP[j,8])^2 );
                   if (dist != 0 && dist <= co)  print TMP[i,3],TMP[j,3],dist
                  }
                }
             }
        ' co="$DISTCUT"

With the data from your example:

C3 O6 2.56728
C3 O5 2.40508
C3 C4 1.53222
C4 O6 1.43856
C4 O5 1.23535
O5 O6 2.31816

@elixir_sinari: too fast for me! But - you're outputting each pair of atoms twice; not sure if that's desired...

elixir_sinari · October 2, 2012, 7:02am

Is it? But, then that's a "faithful" conversion of that loop to an awk script.

C3 C4 1.53222
C3 O5 2.40508
C3 O6 2.56728
C4 C3 1.53222
C4 O5 1.23535
C4 O6 1.43856
O5 C3 2.40508
O5 C4 1.23535
O5 O6 2.31816
O6 C3 2.56728
O6 C4 1.43856
O6 O5 2.31816

is the output for the sample.

RudiC · October 2, 2012, 7:05am

Yes: e.g.

C3 C4 1.53222
C4 C3 1.53222

But maybe that's desired?

elixir_sinari · October 2, 2012, 7:20am

If it is not desired, a slight tweak will do the trick.

awk '{atom[NR]=$3;xi[NR]=$6;yi[NR]=$7;zi[NR]=$8}
END{
for(i=1;i<=NR;i++)
 for(j=i+1;j<=NR;j++)
 {
  dist=sqrt((xi-xi[j])^2 + (yi-yi[j])^2 + (zi-zi[j])^2)
  if(dist!=0 && dist <=cutoff)
   print atom,atom[j],dist
 }
}' cutoff="$DISTCUT" sub_oxy_high > oxy_dist_all

origamisven · October 2, 2012, 8:03am

You guys are awesome, thanks all around... Double entries were not desired, I just left the issue out because I didn't want to cause confusion.

DISTCUT=3.5 by the way, a geometric hydrogen bonding criterion in angstrom...

This forum is so good