selecting and deleting specific lines with condition

I have a set of data as below:

The first field, $1 represent "|".
The $3 (3rd field) and $6 (6th field) in my data file represent "number-molecule" which has arrangement as below:

   1    2   3   4   5   6   7       8

   9    10  11  12  13  14  15      16
  17    18  19  20  21  22  23      24
  25    26  27  28  29  30  31      32
  33    34  35  36  37  38  39      40
  41    42  43  44  45  46  47      48
  49    50  51  52  53  54  55      56

  57    58  59  60  61  62  63      64 

Any pairs made from above numbers actually represents pairs in the 3rd and 6th field of each line in the data file.

What I want is to select the pairs from the data file made only by the numbers which are arranged at the outer most lines of the above number-molecule ordering.

In short, ANY PAIRS made by only the numbers

 (1 2 3 4 5 6 7 8   57 58 59 60 61 62 63 64   9 17 25 33 41 49 57   8 16 24 32 40 48 56 64)

in other words

1 , 2
1 , 3
1 , 4
.
.
1 , 57
1 , 58
1 , 59
.
.
.
2, 1
2, 3
2, 4
2, 5
.
.
.
2, 57
2, 58
2, 59
.
.
.

are need to be deleted from the data file.

To achieve this I have tried to write awk script as below to test to print out the line which I suppose to delete. But at this level I fail to select those line pairs.

 #!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[i++] = n;
   for (n=57; n<=64; n++) set[i++] = n;
   for (n=9; n<=49; n+=8) {set[i++] = n; set[i++] = n+7};
    }


 ($1== "|") {
     split($3, res1, "@"); split($6, res2, "@"); #print res1[1], res2[1]

     if ( (res1[1] in set) == (res2[1] in set) ); 

     {
       print;
      }

 }

Can I get any help to resolve this needs?

Thanks in advance

:confused:

Generally, by delete one means make a new file without. I am a bit confused in trying to see the objective. It may be a multi-pass project, to collect information, rearrange it to decide what to do, and then apply those results to the original. Where did it get hard?

I'm afraid it's hard to understand your problem. May you give expected or deleted rows (and please, one more time - why?) in this test set:

awk '/^\|/ {print $3, $6}' INPUTFILE
58@O12 1174@H1
58@O12 1174@H2
34@O12 1122@H1
34@O12 1122@H2
4@O12 1122@H2
4@O12 1122@H1
58@O16 396@H2
58@O15 396@H2
54@O26 1078@H2
58@O16 396@H1
23@O16 400@H1
23@O16 400@H2
48@O16 1162@H2
58@O15 396@H1
48@O16 1162@H1
19@O13 1078@H1
48@O26 377@H2
53@O15 1162@H2
19@O22 1078@H1
14@O12 402@H1
53@O15 1162@H1
14@O12 402@H2
48@O26 377@H1
48@O16 396@H2

I shall give another set of data for clarity purpose.

If you notice the first line field 3 ($3), the residue number is 59 and in filed 6, the residue number is 19. Number 59 is in the outer most line and 19 is not according to the number-molecule arrangement. So this line should NOT be deleted.

If you notice the second line, field 3 ($3), the number 19 and in filed 6 ($6) the number is 24. Number 19 is not in the outer most line but number 24 is in the outer most line. This line also should not be deleted since NOT both the numbers are in the outer most lines.

If you notice the third line, field 3 ($3), the number is 16 and filed 6 ($6) the number is 17. Since both the numbers in this pair belongs to the outer most numbers, then this line should be deleted.

So after testing the criteria of the numbers to be in the outer most lines then that line should be deleted. This is what I need to achieve and this code simply does not work as I wanted.

Thanks in advance.

So, This is a negative join. Semms like his approach should be good: you need to save the outer numbers in an array, and then as you go through the lines, look them up and decide if you want to copy. You could use while read in ksh/bash and put @ in IFS to split that field into two. You could decide each number's row mathematically (( (N%8) < 2 )).

What about when field 6 and 8 do not match? No different?

Seems to me that you could use modulus to simplify the tests.

Your number x (assumed to be less than or equal to 64?)

if x % 8 = 0 it's in the right hand column
if x % 8 = 1 it's in the left hand column

then you just have the ranges

2<=x<=7

and

58<=x<=63

---------- Post updated at 05:31 PM ---------- Previous update was at 05:03 PM ----------

Ahhh...just realized, this test is wrong!

if ( (res1[1] in set) == (res2[1] in set) ); 

You can't test for the value of the array element this way, only that the subscript exists!

You could set your array differently instead of using set[i\+\+] why not
break up the array and use set[n]? Then your test should work as you would have elements as follows:

set[1] through set[8]
set[9], set[16]
set[17], set[24]
set[25], set[32]
set[33], set[40]
set[41], set[48]
set[49], set[56]
set[57] through set[64]

Two other things I noticed...remove the semicolon after the test
and change "==" to &&

if ( (res1[1] in set) && (res2[1] in set) ); 
 

This is the biggest problem your script had other than trying to
use "in" to test the set values instead of the subscripts.

This code worked for me.

#!/usr/bin/awk -f

 BEGIN  {
   i=0
   for (n=1; n<=8; n++) set[n] = n;
   for (n=9; n<=49; n+=8) {
     set[n] = n 
     set[n+7] = n+7 
   };
   for (n=57; n<=64; n++) set[n] = n;
 }

 ($1 == "|") {
     split($3, res1, "@"); split($6, res2, "@");
     if ( (res1[1] in set) && (res2[1] in set) ) # <--- no ';' here!
     {
       print;
     }

 }

udc1.txt was both your first and second examples put together in that order.

1 Like

Dear sir,

Thanks so much for your kind reply. The code perfectly works now as per my need. But additionally I want to ask you something related to this. At the end of the code I write "print" so that I want to see if the code selecting the lines which I dont want exactly. Now if I want to delete those selected lines, what command should I should use?

Using Perl -

$
$
$ cat f8
DONOR ACCEPTORH ACCEPTOR
atom# res@atom atom# res@atom atom# res@atom %occupied distance angle
| 4726 59@O12 | 1487 19@H12 1486 19@O12 | 85.66 2.819 ( 0.18) 21.85 (12.11)
| 1499 19@O15 | 1730 24@H12 1729 24@O12 | 83.15 3.190 ( 0.31) 22.36 (12.73)
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 4232 53@O25 | 4143 52@H24 4142 52@O24 | 74.35 2.916 ( 0.25) 28.27 (13.26)
| 3683 46@O16 | 4163 52@H13 4162 52@O13 | 73.78 2.963 ( 0.29) 23.65 (14.14)
| 4162 52@O13 | 4079 51@H12 4078 51@O12 | 73.68 2.841 ( 0.19) 21.25 (11.87)
| 3764 47@O16 | 3825 48@H26 3824 48@O26 | 70.52 2.973 ( 0.28) 26.88 (13.14)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 3035 38@O16 | 3350 42@H12 3349 42@O12 | 67.19 2.790 ( 0.16) 18.72 (10.47)
| 686 9@O16 | 893 12@H22 892 12@O22 | 66.87 2.905 ( 0.22) 26.53 (10.90)
| 1478 19@O25 | 1703 22@H22 1702 22@O22 | 64.37 2.864 ( 0.21) 31.87 (14.12)
| 3521 44@O16 | 747 10@H26 746 10@O26 | 63.71 2.941 ( 0.27) 26.82 (13.51)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 4159 52@O12 | 3684 46@H16 3683 46@O16 | 62.43 2.900 ( 0.22) 35.69 (12.23)
| 4331 54@O16 | 1490 19@H13 1489 19@O13 | 61.80 2.989 ( 0.29) 26.58 (14.32)
| 3440 43@O16 | 3906 49@H26 3905 49@O26 | 60.17 2.964 ( 0.28) 28.61 (13.24)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
| 1729 22@O12 | 1557 20@H26 1556 20@O26 | 58.11 3.036 ( 0.27) 32.81 (11.84)
| 4151 52@O25 | 4484 56@H12 4483 56@O12 | 57.67 2.917 ( 0.32) 27.71 (15.02)
| 1502 19@O11 | 1730 22@H12 1729 22@O12 | 57.53 3.184 ( 0.26) 41.62 (13.24)
| 3014 38@O26 | 3353 42@H13 3352 42@O13 | 57.42 2.884 ( 0.24) 22.59 (12.87)
| 3524 44@O15 | 3917 49@H12 3916 49@O12 | 57.35 3.227 ( 0.35) 25.52 (13.61)
| 2390 30@O15 | 2756 35@H22 2755 35@O22 | 57.28 3.074 ( 0.33) 31.27 (14.44)
| 1739 22@O16 | 5115 64@H24 5114 64@O24 | 56.78 2.876 ( 0.28) 20.94 (13.42)
| 4574 57@O16 | 5061 63@H16 5060 63@O16 | 56.57 2.956 ( 0.25) 30.52 (14.00)
| 2846 36@O24 | 3566 45@H22 3565 45@O22 | 55.92 2.880 ( 0.24) 22.85 (12.39)
| 605 8@O16 | 839 11@H12 838 11@O12 | 55.67 2.894 ( 0.24) 25.45 (13.25)
$
$
$ perl -lne 'BEGIN {@x=grep {$_%8 == 0 or $_%8 == 1 or $_ < 8} (1..64)}
             print if /^\|.*?(\d+)\@.*?(\d+)\@.*/ and grep /$1/, @x and grep /$2/, @x
            ' f8
| 1216 16@O22 | 1460 17@H22 1459 17@O22 | 75.74 2.757 ( 0.14) 24.55 (13.66)
| 193 3@O13 | 353 5@H12 352 5@O12 | 67.49 2.780 ( 0.17) 17.85 (10.90)
| 1313 17@O26 | 1217 16@H22 1216 16@O22 | 63.09 2.807 ( 0.16) 22.23 (11.92)
| 1334 17@O16 | 1247 16@H13 1246 16@O13 | 59.31 2.828 ( 0.18) 25.35 (12.61)
$
$
$

tyler_durden

1 Like

You can use sort and comm to find lines in one file's set or another or both, dividing your set.

1 Like

Simply negate the tests so you will print the data line if it is not a part
of the "set" array, and print all lines that don't start with "|" as below,
assuming you want to see all the headers and other non data information.

#!/usr/bin/awk -f
 
 BEGIN  {
   i=0                   # <--- Not needed anymore. Delete
   for (n=1; n<=8; n++) set[n] = n;
   for (n=9; n<=49; n+=8) {
     set[n] = n 
     set[n+7] = n+7 
   };
   for (n=57; n<=64; n++) set[n] = n;
 }
 
 ($1 == "|") {
     split($3, res1, "@"); split($6, res2, "@");
     if ( ! (res1[1] in set) && ! (res2[1] in set) ) #<--- negate tests
     {
       print;
     }
 
  }
 
  ($1 != "|") { print; }            # <--- prints headers and other lines
1 Like