awk-filter record by another file

biomed · June 20, 2012, 4:55am

I have file1

and file2

3983    4981
5843    7501
9169    11160
12222   12776
14276   15016
17390   19207
20065   20781
21922   22746
23512   24480
25457   26044
27418   30078
30656   32185
33362   33610
34289   34639
36834   37322
38330   39691
40664   42940
45072   45596
48065   48874
49576   50022
53338   55938
58650   59420
60581   62711
63709   64716
65602   65925
67187   68425
73410   74783
75569   76438
78806   79312
79687   80358
80927   82090
82426   82869
85172   86095
87726   88358

The output file should be

4672
22631
45324

I need to filter file1 if the value is in between $1 and $2 of file2
Could anyone give a help? awk maybe a good way to filter them...

elixir_sinari · June 20, 2012, 5:26am

Please post the output required.

Does this meet your requirement?

awk 'NR==FNR{a[$1]=$2;next} {
 for(i in a)
 {
  if($0 >= i && $0 <= a)
  {
   print
   break
  }
 }
}' file2 file1

biomed · June 20, 2012, 5:41am

Yes, your code meet my requirements, thanks a lot...

sdf · June 20, 2012, 6:06am

awk 'NR==FNR{a[i++]=$1;next}{for(x=1;x<=i;x++) if( a[x] >= $1 && a[x] <= $2) print a[x]}' file1 file2 >outfile

---------- Post updated at 12:06 PM ---------- Previous update was at 11:50 AM ----------

elixir_sinari:

Please post the output required.

Does this meet your requirement?
awk 'NR==FNR{a[$1]=$2;next} {
 for(i in a)
 {
  if($0 >= i && $0 <= a)
  {
   print
   break
  }
 }
}' file2 file1

Are you sure with your code? These two 3049 and 3138 digits shall not be in outfile.

outfile

elixir_sinari · June 20, 2012, 6:12am

Pretty sure that those 2 numbers don't turn up in the output.

sdf · June 20, 2012, 6:15am

Odd, checked your code on gawk 3 and 4 and they appear.

elixir_sinari · June 20, 2012, 6:29am

Try then with

gawk 'NR==FNR{a[$1]=$2;next} {
 for(i in a)
 {
  if(int($0) >= int(i) && int($0) <= int(a))
  {
   print
   break
  }
 }
}' file2 file1

Or this to force a numeric comparison...

gawk 'NR==FNR{a[$1]=$2;next} {
 for(i in a)
 {
  if(0+$0 >= 0+i && 0+$0 <= 0+a)
  {
   print
   break
  }
 }
}' file2 file1

The latter is safer as you prevent truncating floats..

drl · June 20, 2012, 6:50am

Hi.

I agree with sdf, the extra 2 lines appear. The alternate statements:

if($0 >= i+0 && $0 <= a) # SUCCEEDS!
if($0 >= int(i) && $0 <= a) # SUCCEEDS!

will both work. Of the two, I think the i+0 is a bit tricky, especially for people who don't know awk well, so I would choose int(i) and add a comment why it is required. Or perhaps add the int() to all 4 as elixir_sinari wrote and omit any confusing explanation. Good point about floats, however. I suppose one
could precondition all the data to truncate to integers, especially in this case ... cheers, drl

sdf · June 20, 2012, 6:57am

The numeric comparisons work!