awk to print line is values between two fields in separate file

cmccabe · December 14, 2016, 9:38am

I am trying to use awk to find all the $3 values in file2 that are between $2 and $3 in file1 . If a value in $3 of file2 is between the file1 fields then it is printed along with the $6 value in file1 . Both file1 and file2 are tab-delimited as well as the desired output . If there is nothing to print then the next line is processed. The awk below currently just prints all of file1 , no matter if the values are found. Thank you :).

file1 tab-delimited

chr1	948953	948956	chr1:948953-948956	.	ISG15
chr1	949363	949858	chr1:949363-949858	.	ISG15
chr1	955542	955763	chr1:955542-955763	.	AGRN
chr1	957570	957852	chr1:957570-957852	.	AGRN
chr1	976034	976270	chr1:976034-976270	.	AGRN

file2 tab-delimited

rs13303106	1	891945	GG
rs28415373	1	893981	CC
rs13303010	1	894573	AA
rs6696281	1	903104	CC
rs28391282	1	904165	GG
rs4511111	1	949375	GG
rs6657048	1	957640	CC
rs2710888	1	959842	CT
rs3128126	1	962210	AG
rs2710875	1	977780	CT

desired output tab-delimited

rs4511111	1	949375	GG     ISG15

awk

awk -F'\t' -v OFS='\t' '                   
    NR == FNR {min[$1]=$2; max[$1]=$3; Gene[$6]=$NF; next}
    {                
        for (id in min) 
            if (min[id] < $3 && $3 < max[id]) {
                print $0, id, Gene[id]
                break              
            }
    }                                     
' file1 file2

RavinderSingh13 · December 14, 2016, 10:12am

Hello cmccabe,

I am not sure completely about your requirement but could you please try following and let me know if this helps.

awk 'FNR==NR{A[$3]=$0;next} {for(i in A){if(i>$2 && i<$3){print A FS $NF}}}'  Input_file2   Input_file1

Output will be as follows.

rs4511111       1       949375  GG ISG15
rs6657048       1       957640  CC AGRN

If you have any other requirements then please do let us know with more details.

NOTE: So above code will search each line of Input_file1 with each line of Input_file2.

Thanks,
R. Singh

ronaldxs · December 14, 2016, 10:46am

It looks to me like your program can be fixed with just small changes to one line. I get the same output as R. Singh

change

NR == FNR {min[$1]=$2; max[$1]=$3; Gene[$6]=$NF; next}

to

NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}

Now each record in the filter file has a separate entry in the min, max, and Gene arrays.

cmccabe · December 14, 2016, 5:47pm

Both commands run great, my actual dataset is ~960,000 lines or 26 MB. Is there a more efficient way to search this file? The two file formats are as posted, they are just quite large. Thank you :).

ronaldxs · December 14, 2016, 7:22pm

I will post code for a more robust and sophisticated solution that should avoid slow down with size below. If you can depend on file1, the filter range file, being sorted with no overlaps than you might be able to adjust the awk program to only look at filter records near the file2, field 3, key. Another simple solution may be to import the data into a relational database and query with SQL. If you have access to Perl and CPAN and can install the Perl module Tree::Range::RB then ...

#!/bin/bash

perl -Mstrict -MTree::Range::RB -wane'
    our $rat;
    BEGIN {
        $rat = Tree::Range::RB->new({ "cmp" => sub { $_[0] <=> $_[1] }});
    }
    if (@ARGV) { # first - filter file
        $rat->range_set($F[1], $F[2], $F[5])
    }
    else { # second file
        if (my $v = $rat->get_range($F[2])) {
            chomp;
            print "$_\t$v\n";
        }
    }
' file1 file2