selecting and deleting specific lines with condition

vjramana · September 8, 2011, 9:07am

I have a set of data as below:

HBOND SUMMARY
output to file HB_lowLyo_D_lipid_A_water_001_064.tbl,
data was sorted, intra-residue interactions are NOT included,
Distance cutoff is 4.00 angstroms, angle cutoff is 120.00 degrees
Hydrogen bond information dumped for occupancies > 0.00

DONOR ACCEPTORH ACCEPTOR
atom# res@atom atom# res@atom atom# res@atom %occupied distance angle
| 4645 58@O12 | 23489 1174@H1 23488 1174@O | 22.79 2.945 ( 0.28) 26.79 (14.41)
| 4645 58@O12 | 23490 1174@H2 23488 1174@O | 22.49 2.965 ( 0.31) 28.01 (14.47)
| 2701 34@O12 | 23333 1122@H1 23332 1122@O | 20.60 2.965 ( 0.23) 30.07 (14.18)
| 2701 34@O12 | 23334 1122@H2 23332 1122@O | 19.74 2.963 ( 0.23) 31.43 (13.88)
| 271 4@O12 | 23334 1122@H2 23332 1122@O | 19.70 2.825 ( 0.19) 21.92 (12.15)
| 271 4@O12 | 23333 1122@H1 23332 1122@O | 19.55 2.826 ( 0.19) 22.22 (12.71)
| 4655 58@O16 | 21156 396@H2 21154 396@O | 19.43 2.933 ( 0.22) 31.95 (15.18)
| 4658 58@O15 | 21156 396@H2 21154 396@O | 18.96 3.163 ( 0.27) 37.03 (14.63)
| 4310 54@O26 | 23202 1078@H2 23200 1078@O | 18.73 2.821 ( 0.24) 25.87 (13.92)
| 4655 58@O16 | 21155 396@H1 21154 396@O | 18.63 2.917 ( 0.22) 31.91 (15.00)
| 1820 23@O16 | 21167 400@H1 21166 400@O | 18.14 2.910 ( 0.22) 27.20 (13.87)
| 1820 23@O16 | 21168 400@H2 21166 400@O | 17.96 2.907 ( 0.21) 26.69 (13.86)
| 3845 48@O16 | 23454 1162@H2 23452 1162@O | 17.68 2.991 ( 0.31) 28.45 (14.88)
| 4658 58@O15 | 21155 396@H1 21154 396@O | 17.31 3.177 ( 0.27) 38.82 (14.69)
| 3845 48@O16 | 23453 1162@H1 23452 1162@O | 17.29 3.016 ( 0.32) 28.84 (14.57)
| 1489 19@O13 | 23201 1078@H1 23200 1078@O | 16.66 2.884 ( 0.23) 31.39 (15.56)
| 3824 48@O26 | 21099 377@H2 21097 377@O | 15.44 2.992 ( 0.30) 30.78 (15.01)
| 4253 53@O15 | 23454 1162@H2 23452 1162@O | 14.98 2.961 ( 0.27) 33.71 (15.09)
| 1459 19@O22 | 23201 1078@H1 23200 1078@O | 14.84 3.012 ( 0.33) 35.08 (16.12)
| 1081 14@O12 | 21173 402@H1 21172 402@O | 14.76 2.937 ( 0.24) 27.54 (14.26)
| 4253 53@O15 | 23453 1162@H1 23452 1162@O | 14.63 2.955 ( 0.25) 33.68 (15.11)
| 1081 14@O12 | 21174 402@H2 21172 402@O | 14.41 2.944 ( 0.25) 28.34 (14.35)
| 3824 48@O26 | 21098 377@H1 21097 377@O | 13.70 3.002 ( 0.30) 31.00 (15.21)
| 3845 48@O16 | 21156 396@H2 21154 396@O | 13.06 2.934 ( 0.26) 27.71 (14.05)
.
.
.
few thousand lines

The first field, $1 represent "|".
The $3 (3rd field) and $6 (6th field) in my data file represent "number-molecule" which has arrangement as below:

   1    2   3   4   5   6   7       8

   9    10  11  12  13  14  15      16
  17    18  19  20  21  22  23      24
  25    26  27  28  29  30  31      32
  33    34  35  36  37  38  39      40
  41    42  43  44  45  46  47      48
  49    50  51  52  53  54  55      56

  57    58  59  60  61  62  63      64

Any pairs made from above numbers actually represents pairs in the 3rd and 6th field of each line in the data file.

What I want is to select the pairs from the data file made only by the numbers which are arranged at the outer most lines of the above number-molecule ordering.

In short, ANY PAIRS made by only the numbers

 (1 2 3 4 5 6 7 8   57 58 59 60 61 62 63 64   9 17 25 33 41 49 57   8 16 24 32 40 48 56 64)

in other words

1 , 2
1 , 3
1 , 4
.
.
1 , 57
1 , 58
1 , 59
.
.
.
2, 1
2, 3
2, 4
2, 5
.
.
.
2, 57
2, 58
2, 59
.
.
.

are need to be deleted from the data file.

To achieve this I have tried to write awk script as below to test to print out the line which I suppose to delete. But at this level I fail to select those line pairs.

 #!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[i++] = n;
   for (n=57; n<=64; n++) set[i++] = n;
   for (n=9; n<=49; n+=8) {set[i++] = n; set[i++] = n+7};
    }


 ($1== "|") {
     split($3, res1, "@"); split($6, res2, "@"); #print res1[1], res2[1]

     if ( (res1[1] in set) == (res2[1] in set) ); 

     {
       print;
      }

 }

Can I get any help to resolve this needs?

Thanks in advance

DGPickett · September 8, 2011, 9:43am

Generally, by delete one means make a new file without. I am a bit confused in trying to see the objective. It may be a multi-pass project, to collect information, rearrange it to decide what to do, and then apply those results to the original. Where did it get hard?

yazu · September 8, 2011, 9:47am

I'm afraid it's hard to understand your problem. May you give expected or deleted rows (and please, one more time - why?) in this test set:

awk '/^\|/ {print $3, $6}' INPUTFILE
58@O12 1174@H1
58@O12 1174@H2
34@O12 1122@H1
34@O12 1122@H2
4@O12 1122@H2
4@O12 1122@H1
58@O16 396@H2
58@O15 396@H2
54@O26 1078@H2
58@O16 396@H1
23@O16 400@H1
23@O16 400@H2
48@O16 1162@H2
58@O15 396@H1
48@O16 1162@H1
19@O13 1078@H1
48@O26 377@H2
53@O15 1162@H2
19@O22 1078@H1
14@O12 402@H1
53@O15 1162@H1
14@O12 402@H2
48@O26 377@H1
48@O16 396@H2

vjramana · September 8, 2011, 1:08pm

I shall give another set of data for clarity purpose.

DONOR ACCEPTORH atom# res@atom | 4726 59@O12 | 1487 | 1499 19@O15 | 1730 | 1216 16@O22 | 1460 | 4232 53@O25 | 4143 | 3683 46@O16 | 4163 | 4162 52@O13 | 4079 | 3764 47@O16 | 3825 | 193 3@O13 | 353 | 3035 38@O16 | 3350 | 686 9@O16 | 893 | 1478 19@O25 | 1703 | 3521 44@O16 | 747 | 1313 17@O26 | 1217 | 4159 52@O12 | 3684 | 4331 54@O16 | 1490 | 3440 43@O16 | 3906 | 1334 17@O16 | 1247 | 1729 22@O12 | 1557 | 4151 52@O25 | 4484 | 1502 19@O11 | 1730 | 3014 38@O26 | 3353 | 3524 44@O15 | 3917 | 2390 30@O15 | 2756 | 1739 22@O16 | 5115 | 4574 57@O16 | 5061 | 2846 36@O24 | 3566 | 605 8@O16 | 839 ACCEPTOR
atom# res@atom atom# res@atom %occupied distance angle
19@H12 1486 19@O12 | 85.66 2.819 ( 0.18) 21.85 (12.11)
24@H12 1729 24@O12 | 83.15 3.190 ( 0.31) 22.36 (12.73)
17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
52@H24 4142 52@O24 | 74.35 2.916 ( 0.25) 28.27 (13.26)
52@H13 4162 52@O13 | 73.78 2.963 ( 0.29) 23.65 (14.14)
51@H12 4078 51@O12 | 73.68 2.841 ( 0.19) 21.25 (11.87)
48@H26 3824 48@O26 | 70.52 2.973 ( 0.28) 26.88 (13.14)
5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
42@H12 3349 42@O12 | 67.19 2.790 ( 0.16) 18.72 (10.47)
12@H22 892 12@O22 | 66.87 2.905 ( 0.22) 26.53 (10.90)
22@H22 1702 22@O22 | 64.37 2.864 ( 0.21) 31.87 (14.12)
10@H26 746 10@O26 | 63.71 2.941 ( 0.27) 26.82 (13.51)
16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
46@H16 3683 46@O16 | 62.43 2.900 ( 0.22) 35.69 (12.23)
19@H13 1489 19@O13 | 61.80 2.989 ( 0.29) 26.58 (14.32)
49@H26 3905 49@O26 | 60.17 2.964 ( 0.28) 28.61 (13.24)
16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
20@H26 1556 20@O26 | 58.11 3.036 ( 0.27) 32.81 (11.84)
56@H12 4483 56@O12 | 57.67 2.917 ( 0.32) 27.71 (15.02)
22@H12 1729 22@O12 | 57.53 3.184 ( 0.26) 41.62 (13.24)
42@H13 3352 42@O13 | 57.42 2.884 ( 0.24) 22.59 (12.87)
49@H12 3916 49@O12 | 57.35 3.227 ( 0.35) 25.52 (13.61)
35@H22 2755 35@O22 | 57.28 3.074 ( 0.33) 31.27 (14.44)
64@H24 5114 64@O24 | 56.78 2.876 ( 0.28) 20.94 (13.42)
63@H16 5060 63@O16 | 56.57 2.956 ( 0.25) 30.52 (14.00)
45@H22 3565 45@O22 | 55.92 2.880 ( 0.24) 22.85 (12.39)
11@H12 838 11@O12 | 55.67 2.894 ( 0.24) 25.45 (13.25)

If you notice the first line field 3 ($3), the residue number is 59 and in filed 6, the residue number is 19. Number 59 is in the outer most line and 19 is not according to the number-molecule arrangement. So this line should NOT be deleted.

If you notice the second line, field 3 ($3), the number 19 and in filed 6 ($6) the number is 24. Number 19 is not in the outer most line but number 24 is in the outer most line. This line also should not be deleted since NOT both the numbers are in the outer most lines.

If you notice the third line, field 3 ($3), the number is 16 and filed 6 ($6) the number is 17. Since both the numbers in this pair belongs to the outer most numbers, then this line should be deleted.

So after testing the criteria of the numbers to be in the outer most lines then that line should be deleted. This is what I need to achieve and this code simply does not work as I wanted.

Thanks in advance.

DGPickett · September 9, 2011, 4:28pm

So, This is a negative join. Semms like his approach should be good: you need to save the outer numbers in an array, and then as you go through the lines, look them up and decide if you want to copy. You could use while read in ksh/bash and put @ in IFS to split that field into two. You could decide each number's row mathematically (( (N%8) < 2 )).

What about when field 6 and 8 do not match? No different?

rwuerth · September 9, 2011, 5:31pm

Seems to me that you could use modulus to simplify the tests.

Your number x (assumed to be less than or equal to 64?)

if x % 8 = 0 it's in the right hand column
if x % 8 = 1 it's in the left hand column

then you just have the ranges

2<=x<=7

and

58<=x<=63

---------- Post updated at 05:31 PM ---------- Previous update was at 05:03 PM ----------

Ahhh...just realized, this test is wrong!

if ( (res1[1] in set) == (res2[1] in set) );

You can't test for the value of the array element this way, only that the subscript exists!

You could set your array differently instead of using set[i\+\+] why not
break up the array and use set[n]? Then your test should work as you would have elements as follows:

set[1] through set[8]
set[9], set[16]
set[17], set[24]
set[25], set[32]
set[33], set[40]
set[41], set[48]
set[49], set[56]
set[57] through set[64]

Two other things I noticed...remove the semicolon after the test
and change "==" to &&

if ( (res1[1] in set) && (res2[1] in set) );

This is the biggest problem your script had other than trying to
use "in" to test the set values instead of the subscripts.

This code worked for me.

#!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[n] = n;
   for (n=9; n<=49; n+=8) {
     set[n] = n 
     set[n+7] = n+7 
   };
   for (n=57; n<=64; n++) set[n] = n;
 }

 ($1 == "|") {
     split($3, res1, "@"); split($6, res2, "@");
     if ( (res1[1] in set) && (res2[1] in set) ) # <--- no ';' here!
     {
       print;
     }

 }

udc1.txt was both your first and second examples put together in that order.

vjramana · September 11, 2011, 9:54pm

Dear sir,

Thanks so much for your kind reply. The code perfectly works now as per my need. But additionally I want to ask you something related to this. At the end of the code I write "print" so that I want to see if the code selecting the lines which I dont want exactly. Now if I want to delete those selected lines, what command should I should use?

durden_tyler · September 12, 2011, 12:47am

Using Perl -

$
$
$ cat f8
DONOR ACCEPTORH ACCEPTOR
atom# res@atom atom# res@atom atom# res@atom %occupied distance angle
| 4726 59@O12 | 1487 19@H12 1486 19@O12 | 85.66 2.819 ( 0.18) 21.85 (12.11)
| 1499 19@O15 | 1730 24@H12 1729 24@O12 | 83.15 3.190 ( 0.31) 22.36 (12.73)
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 4232 53@O25 | 4143 52@H24 4142 52@O24 | 74.35 2.916 ( 0.25) 28.27 (13.26)
| 3683 46@O16 | 4163 52@H13 4162 52@O13 | 73.78 2.963 ( 0.29) 23.65 (14.14)
| 4162 52@O13 | 4079 51@H12 4078 51@O12 | 73.68 2.841 ( 0.19) 21.25 (11.87)
| 3764 47@O16 | 3825 48@H26 3824 48@O26 | 70.52 2.973 ( 0.28) 26.88 (13.14)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 3035 38@O16 | 3350 42@H12 3349 42@O12 | 67.19 2.790 ( 0.16) 18.72 (10.47)
| 686 9@O16 | 893 12@H22 892 12@O22 | 66.87 2.905 ( 0.22) 26.53 (10.90)
| 1478 19@O25 | 1703 22@H22 1702 22@O22 | 64.37 2.864 ( 0.21) 31.87 (14.12)
| 3521 44@O16 | 747 10@H26 746 10@O26 | 63.71 2.941 ( 0.27) 26.82 (13.51)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 4159 52@O12 | 3684 46@H16 3683 46@O16 | 62.43 2.900 ( 0.22) 35.69 (12.23)
| 4331 54@O16 | 1490 19@H13 1489 19@O13 | 61.80 2.989 ( 0.29) 26.58 (14.32)
| 3440 43@O16 | 3906 49@H26 3905 49@O26 | 60.17 2.964 ( 0.28) 28.61 (13.24)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
| 1729 22@O12 | 1557 20@H26 1556 20@O26 | 58.11 3.036 ( 0.27) 32.81 (11.84)
| 4151 52@O25 | 4484 56@H12 4483 56@O12 | 57.67 2.917 ( 0.32) 27.71 (15.02)
| 1502 19@O11 | 1730 22@H12 1729 22@O12 | 57.53 3.184 ( 0.26) 41.62 (13.24)
| 3014 38@O26 | 3353 42@H13 3352 42@O13 | 57.42 2.884 ( 0.24) 22.59 (12.87)
| 3524 44@O15 | 3917 49@H12 3916 49@O12 | 57.35 3.227 ( 0.35) 25.52 (13.61)
| 2390 30@O15 | 2756 35@H22 2755 35@O22 | 57.28 3.074 ( 0.33) 31.27 (14.44)
| 1739 22@O16 | 5115 64@H24 5114 64@O24 | 56.78 2.876 ( 0.28) 20.94 (13.42)
| 4574 57@O16 | 5061 63@H16 5060 63@O16 | 56.57 2.956 ( 0.25) 30.52 (14.00)
| 2846 36@O24 | 3566 45@H22 3565 45@O22 | 55.92 2.880 ( 0.24) 22.85 (12.39)
| 605 8@O16 | 839 11@H12 838 11@O12 | 55.67 2.894 ( 0.24) 25.45 (13.25)
$
$
$ perl -lne 'BEGIN {@x=grep {$_%8 == 0 or $_%8 == 1 or $_ < 8} (1..64)}
             print if /^\|.*?(\d+)\@.*?(\d+)\@.*/ and grep /$1/, @x and grep /$2/, @x
            ' f8
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
$
$
$

tyler_durden

DGPickett · September 12, 2011, 9:29am

You can use sort and comm to find lines in one file's set or another or both, dividing your set.

rwuerth · September 13, 2011, 7:45am

Simply negate the tests so you will print the data line if it is not a part
of the "set" array, and print all lines that don't start with "|" as below,
assuming you want to see all the headers and other non data information.

#!/usr/bin/awk -f
 
 BEGIN  {
   i=0                   # <--- Not needed anymore. Delete
   for (n=1; n<=8; n++) set[n] = n;
   for (n=9; n<=49; n+=8) {
     set[n] = n 
     set[n+7] = n+7 
   };
   for (n=57; n<=64; n++) set[n] = n;
 }
 
 ($1 == "|") {
     split($3, res1, "@"); split($6, res2, "@");
     if ( ! (res1[1] in set) && ! (res2[1] in set) ) #<--- negate tests
     {
       print;
     }
 
  }
 
  ($1 != "|") { print; }            # <--- prints headers and other lines