Removing rows from a file

I have a file like below and want to use awk to solve this problem. The record separator is ">". I want to look at each record section enclosed within ">". Find the row with the 2nd and 3rd columns being 0, such as

10 0  0 

I need to take the first number which in this case is 10. Then take the first number in each row in the section between ">" and check if the difference from 10 is greater than 40. If it is greater the row is removed.

For example we do something like this

10-10
13-10
16-10
19-10
22-10
25-10
28-10
31-10
34-10
37-10

If value greater than 40, we remove the row.

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
>
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
>

some better input test data would be nice.

personally, i see no relationship between your initial paragraph ( 10 0 0 ), your test data input and then your output.

Otherwise, this sounds like a fairly straightforward awk script.

I didn't get your question. You said that your field seperator is ">" but in your code its something as your row. Can you explain in better way?

I have this file:

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
>
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 0 0
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
100 37.8001 37.8001
>

From each section withing the ">" signs, I take each row and find the one having 0 as the second and third number. I take the first number. For example in the first section, it's a 10, because we find 10 0 0.

Then we take each row and subtract the first number from 10. Then check whether the result is greater than 40. If it is greater than 40, we remove the row.

Hope this described things better

Output would be

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
>
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 0 0
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
>

The row

100 37.8001 37.8001

has been removed because in the second section 100 - 49 > 40.

awk 'BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
FNR == NR { 
  />/ && fnr[FNR] = ++sec 
  $2 + $3 || idx[sec] = $1
  next
  }
FNR in fnr { v = idx[fnr[FNR]] }
$1 - v < max' max=40 infile

---------- Post updated at 05:10 PM ---------- Previous update was at 04:53 PM ----------

I corrected some really stupid typos from the code, sorry for the previous one :slight_smile:

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
73 38.6296 38.6296
76 39.4503 39.4503
79 40.2424 40.2424
82 40.997 40.997
85 41.7681 41.7681
88 42.5001 42.5001
91 43.2316 43.2316
94 43.9289 43.9289
97 44.6221 44.6221
100 45.3015 45.3015
103 45.9617 45.9617
106 46.6138 46.6138
109 47.2457 47.2457
112 47.8904 47.8904
115 48.5016 48.5016
118 49.1305 49.1305
121 49.7498 49.7498
124 50.3272 50.3272
127 50.8841 50.8841
130 51.472 51.472
133 52.0619 52.0619
136 52.6079 52.6079
139 53.1586 53.1586
142 53.7149 53.7149
145 54.2602 54.2602
148 54.7771 54.7771
151 55.3154 55.3154
154 55.8316 55.8316
157 56.366 56.366
160 56.8704 56.8704
163 57.358 57.358
166 57.8577 57.8577
169 58.338 58.338
172 58.8308 58.8308
175 59.308 59.308
178 59.7918 59.7918
181 60.2547 60.2547
184 60.7199 60.7199
187 61.1781 61.1781
190 61.643 61.643
193 62.1091 62.1091
196 62.5579 62.5579
199 62.9957 62.9957
>

Tried it on the file above, however it has not solved the problems. Rows with 196-10, 199-10 etc are all greater than 40, but they still show in the output file

First of all, a few minutes ago I fixed some errors and updated the post.
Second: am I missing something or you should remove all the records after this one:

49 30.9858 30.9858

52 - 10 > 40 ...

Yes I should remove them since all fail the condition.

So, what you get with the last version of the code?
This one:

awk 'BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
FNR == NR { 
  />/ && fnr[FNR] = ++sec 
  $2 + $3 || idx[sec] = $1
  next
  }
FNR in fnr { v = idx[fnr[FNR]] }
$1 - v < max' max=40 infile

If input is this

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
73 38.6296 38.6296
76 39.4503 39.4503
79 40.2424 40.2424
82 40.997 40.997
85 41.7681 41.7681
88 42.5001 42.5001
91 43.2316 43.2316
94 43.9289 43.9289
97 44.6221 44.6221
100 45.3015 45.3015
103 45.9617 45.9617
106 46.6138 46.6138
109 47.2457 47.2457
112 47.8904 47.8904
115 48.5016 48.5016
118 49.1305 49.1305
121 49.7498 49.7498
124 50.3272 50.3272
127 50.8841 50.8841
130 51.472 51.472
133 52.0619 52.0619
136 52.6079 52.6079
139 53.1586 53.1586
142 53.7149 53.7149
145 54.2602 54.2602
148 54.7771 54.7771
151 55.3154 55.3154
154 55.8316 55.8316
157 56.366 56.366
160 56.8704 56.8704
163 57.358 57.358
166 57.8577 57.8577
169 58.338 58.338
172 58.8308 58.8308
175 59.308 59.308
178 59.7918 59.7918
181 60.2547 60.2547
184 60.7199 60.7199
187 61.1781 61.1781
190 61.643 61.643
193 62.1091 62.1091
196 62.5579 62.5579
199 62.9957 62.9957
>

Result should be this

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
>

And another question: is the record that contains the value used in the comparison (10 in the last example) in an unknown position? Otherwise the solution will be less noisy :slight_smile:

---------- Post updated at 05:29 PM ---------- Previous update was at 05:26 PM ----------

I get this:

% cat infile 
>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
73 38.6296 38.6296
76 39.4503 39.4503
79 40.2424 40.2424
82 40.997 40.997
85 41.7681 41.7681
88 42.5001 42.5001
91 43.2316 43.2316
94 43.9289 43.9289
97 44.6221 44.6221
100 45.3015 45.3015
103 45.9617 45.9617
106 46.6138 46.6138
109 47.2457 47.2457
112 47.8904 47.8904
115 48.5016 48.5016
118 49.1305 49.1305
121 49.7498 49.7498
124 50.3272 50.3272
127 50.8841 50.8841
130 51.472 51.472
133 52.0619 52.0619
136 52.6079 52.6079
139 53.1586 53.1586
142 53.7149 53.7149
145 54.2602 54.2602
148 54.7771 54.7771
151 55.3154 55.3154
154 55.8316 55.8316
157 56.366 56.366
160 56.8704 56.8704
163 57.358 57.358
166 57.8577 57.8577
169 58.338 58.338
172 58.8308 58.8308
175 59.308 59.308
178 59.7918 59.7918
181 60.2547 60.2547
184 60.7199 60.7199
187 61.1781 61.1781
190 61.643 61.643
193 62.1091 62.1091
196 62.5579 62.5579
199 62.9957 62.9957
>
% awk 'BEGIN { ARGV[ARGC++] = ARGV[ARGC-1] }
FNR == NR {
  />/ && fnr[FNR] = ++sec
  $2 + $3 || idx[sec] = $1
  next
  }
FNR in fnr { v = idx[fnr[FNR]] }
$1 - v < max' max=40 infile
>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
>

Note that you should use gawk, nawk or /usr/xpg4/bin/awk on Solaris, instead of the old, broken awk (/usr/bin/awk).

Yes, it's an unknown position.

>
10 0 0
13 5.92346 5.92346
16 10.3106 10.3106
19 13.9672 13.9672
22 16.9838 16.9838
25 19.4407 19.4407
28 21.4705 21.4705
31 23.1547 23.1547
34 24.6813 24.6813
37 26.0695 26.0695
40 27.3611 27.3611
43 28.631 28.631
46 29.8366 29.8366
49 30.9858 30.9858
52 32.0934 32.0934
55 33.1458 33.1458
58 34.1637 34.1637
61 35.1297 35.1297
64 36.0253 36.0253
67 36.9248 36.9248
70 37.8001 37.8001
73 38.6296 38.6296
76 39.4503 39.4503
79 40.2424 40.2424
82 40.997 40.997
85 41.7681 41.7681
88 42.5001 42.5001
91 43.2316 43.2316
94 43.9289 43.9289
97 44.6221 44.6221
100 45.3015 45.3015
103 45.9617 45.9617
106 46.6138 46.6138
109 47.2457 47.2457
112 47.8904 47.8904
115 48.5016 48.5016
118 49.1305 49.1305
121 49.7498 49.7498
124 50.3272 50.3272
127 50.8841 50.8841
130 51.472 51.472
133 52.0619 52.0619
136 52.6079 52.6079
139 53.1586 53.1586
142 53.7149 53.7149
145 54.2602 54.2602
148 54.7771 54.7771
151 55.3154 55.3154
154 55.8316 55.8316
157 56.366 56.366
160 56.8704 56.8704
163 57.358 57.358
166 57.8577 57.8577
169 58.338 58.338
172 58.8308 58.8308
175 59.308 59.308
178 59.7918 59.7918
181 60.2547 60.2547
184 60.7199 60.7199
187 61.1781 61.1781
190 61.643 61.643
193 62.1091 62.1091
196 62.5579 62.5579
199 62.9957 62.9957
>
10 5.92346 5.92346
13 0 0
16 5.92346 5.92346
19 10.3106 10.3106
22 13.9672 13.9672
25 16.9838 16.9838
28 19.4407 19.4407
31 21.4705 21.4705
34 23.1547 23.1547
37 24.6813 24.6813
40 26.0695 26.0695
43 27.3611 27.3611
46 28.631 28.631
49 29.8366 29.8366
52 30.9858 30.9858
55 32.0934 32.0934
58 33.1458 33.1458
61 34.1637 34.1637
64 35.1297 35.1297
67 36.0253 36.0253
70 36.9248 36.9248
73 37.8001 37.8001
76 38.6295 38.6295
79 39.4503 39.4503
82 40.2424 40.2424
85 40.9969 40.9969
88 41.7681 41.7681
91 42.5001 42.5001
94 43.2316 43.2316
97 43.9289 43.9289
100 44.6221 44.6221
103 45.3016 45.3016
106 45.9617 45.9617
109 46.6137 46.6137
112 47.2457 47.2457
115 47.8904 47.8904
118 48.5016 48.5016
121 49.1304 49.1304
124 49.7498 49.7498
127 50.3272 50.3272
130 50.8841 50.8841
133 51.472 51.472
136 52.0619 52.0619
139 52.608 52.608
142 53.1586 53.1586
145 53.7149 53.7149
148 54.2603 54.2603
151 54.7771 54.7771
154 55.3154 55.3154
157 55.8315 55.8315
160 56.3661 56.3661
163 56.8704 56.8704
166 57.358 57.358
169 57.8597 57.8597
172 58.3415 58.3415
175 58.8302 58.8302
178 59.3081 59.3081
181 59.7919 59.7919
184 60.2545 60.2545
187 60.7185 60.7185
190 61.1985 61.1985
193 61.6302 61.6302
196 62.1146 62.1146
199 62.5397 62.5397
>
10 10.3106 10.3106
13 5.92346 5.92346
16 0 0
19 5.92346 5.92346
22 10.3106 10.3106
25 13.9672 13.9672
28 16.9838 16.9838
31 19.4407 19.4407
34 21.4705 21.4705
37 23.1547 23.1547
40 24.6814 24.6814
43 26.0695 26.0695
46 27.3611 27.3611
49 28.631 28.631
52 29.8366 29.8366
55 30.9858 30.9858
58 32.0934 32.0934
61 33.1458 33.1458
64 34.1637 34.1637
67 35.1297 35.1297
70 36.0253 36.0253
73 36.9248 36.9248
76 37.8001 37.8001
79 38.6295 38.6295
82 39.4503 39.4503
85 40.2424 40.2424
88 40.9969 40.9969
91 41.7681 41.7681
94 42.5001 42.5001
97 43.2316 43.2316
100 43.9289 43.9289
103 44.6221 44.6221
106 45.3016 45.3016
109 45.9617 45.9617
112 46.6137 46.6137
115 47.2457 47.2457
118 47.8903 47.8903
121 48.5016 48.5016
124 49.1304 49.1304
127 49.7498 49.7498
130 50.3272 50.3272
133 50.8841 50.8841
136 51.472 51.472
139 52.0618 52.0618
142 52.608 52.608
145 53.1586 53.1586
148 53.715 53.715
151 54.2604 54.2604
154 54.777 54.777
157 55.3153 55.3153
160 55.8315 55.8315
163 56.3661 56.3661
166 56.8704 56.8704
169 57.358 57.358
172 57.8598 57.8598
175 58.3418 58.3418
178 58.8302 58.8302
181 59.3081 59.3081
184 59.7919 59.7919
187 60.2544 60.2544
190 60.7184 60.7184
193 61.1985 61.1985
196 61.6303 61.6303
199 62.1147 62.1147
>

OK,
so what is the output you' re getting with my AWK command?

I have put the code in a file awk2.scr and run the command

awk -f awk2.scr test > test.2

awk2.scr consists of

max=40
FNR == NR { 
  />/ && fnr[FNR] = ++sec 
  $2 + $3 || idx[sec] = $1
  next
  }
FNR in fnr { v = idx[fnr[FNR]] }
$1 - v < max

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

I get the same as the original file. No row gets removed.

Well,
this is normal, when you modify the code I'm posting like this :slight_smile:

The awk2.scr should consist of:

BEGIN { 
  ARGV[ARGC++] = ARGV[ARGC-1] 
  max = 40
  }
FNR == NR { 
  />/ && fnr[FNR] = ++sec 
  $2 + $3 || idx[sec] = $1
  next
  }
FNR in fnr { v = idx[fnr[FNR]] }
$1 - v < max

Works great now. :b: