I am trying to use awk to find all the $3 values in file2 that are between $2 and $3 in file1 . If a value in $3 of file2 is between the file1 fields then it is printed along with the $6 value in file1 . Both file1 and file2 are tab-delimited as well as the desired output . If there is nothing to print then the next line is processed. The awk below currently just prints all of file1 , no matter if the values are found. Thank you :).
Both commands run great, my actual dataset is ~960,000 lines or 26 MB. Is there a more efficient way to search this file? The two file formats are as posted, they are just quite large. Thank you :).
I will post code for a more robust and sophisticated solution that should avoid slow down with size below. If you can depend on file1, the filter range file, being sorted with no overlaps than you might be able to adjust the awk program to only look at filter records near the file2, field 3, key. Another simple solution may be to import the data into a relational database and query with SQL. If you have access to Perl and CPAN and can install the Perl module Tree::Range::RB then ...
#!/bin/bash
perl -Mstrict -MTree::Range::RB -wane'
our $rat;
BEGIN {
$rat = Tree::Range::RB->new({ "cmp" => sub { $_[0] <=> $_[1] }});
}
if (@ARGV) { # first - filter file
$rat->range_set($F[1], $F[2], $F[5])
}
else { # second file
if (my $v = $rat->get_range($F[2])) {
chomp;
print "$_\t$v\n";
}
}
' file1 file2