Filter records based on 2nd file

Hello,

I want to filter records of a file if they fall in range associated with a second file. First the chr number (2nd col of 1st file and 1st col of 2nd file) needs to be matched. Then if the 3rd col of the first file falls within any of the ranges specified by the 2nd and 3rd cols , then that record goes to the output.
All files are sorted from low to high.

File to be filtered looks like

9927    chr1    83      T       C
9927    chr1    92      A       C
9927    chr1    97      A       C
9927    chr2    262     C       G
9927    chr2    292     C       G
9927    chr2    367     C       G

Range file looks like

chr1    46    84
chr1    95    227
chr2    261  326

Filtered output

9927    chr1    83      T       C
9927    chr1    97      A       C
9927    chr2    262     C       G
9927    chr2    292     C       G

I have 758 files to be filtered, I think I can do a loop like the following
if I have the inside magic_script.

for file in * do magic_script  $file range_file > $file_filtered done

Hi,
Try it:

$ cat chr1.txt
9927    chr1    83      T       C
9927    chr1    92      A       C
9927    chr1    97      A       C
9927    chr2    262     C       G
9927    chr2    292     C       G
9927    chr2    367     C       G
$ cat chr2.txt
chr1    46    84
chr1    95    227
chr2    261  326
$ sed 's/  / /g' <(awk '{printf("xxxx %s %s\nyyyy %s %s\n",$1,$2,$1,$3)}' chr2.txt) chr1.txt | sort -k2 -n -k3 | sed -n '/xxxx/,/yyyy/{/xxxx\|yyyy/!p;}'
9927  chr1  83   T    C
9927  chr1  97   A    C
9927  chr2  262   C    G
9927  chr2  292   C    G

Regards.

1 Like

Edit - Nevermind, don't pay attention to this stupid question.

Is it acceptable to use a range file like this?

1 Like

Here is an awk based approach that might work:

awk '
        NR == FNR {
                A[$1] = A[$1] ? A[$1] "," $2 "," $3 : $2 "," $3
                next
        }
        A[$2] {
                n = split ( A[$2], R, "," )
                for ( i = 1; i <= n; i += 2 )
                {
                        if ( $3 >= R && $3 <= R[i+1] )
                        {
                                if ( ! ( R[$0] ) )
                                {
                                        print $0
                                        R[$0] = $0
                                }
                        }
                }
        }
' OFS='\t' rangefile file
1 Like

Try also this awk code as well :

awk 'NR==FNR{A[++i,1]=$1;A[i,2]=$2;A[i,3]=$3;next}
{j=0;while(j++<i)if(($2==A[j,1])&&($3>=A[j,2])&&($3<=A[j,3]))print}
' filterfile datafile
1 Like