awk to print out lines that do not fall between range in file

cmccabe · June 1, 2017, 12:15pm

In the awk below I am trying to print out those lines in file2 that are no between $2 and $3 in file1 . Both files are
tab-delimeted and I think it's close but currently it is printeing out the matches. The --- are not part of the files they are just to show what lines match or fall into
the range and don't need to be printed. Thank you :).

file1

chr1 948953 948956 chr1:948953-948956 . ISG15
chr1 949363 949858 chr1:949363-949858 . ISG15
chr19 42373737 42373856 chr19:42373737-42373856 . RPS19

file2

chr1 948796 949006 chr1:948796-949006 . ISG15                  ---- line1 of file1
chr1 949313 949969 chr1:949313-949969 . ISG15                  ---- line2 of file1
chr19 42363937 42364409 chr19:42363937-42364409 . RPS19
chr19 42364286 42364565 chr19:42364286-42364565 . RPS19
chr19 42364465 42364614 chr19:42364465-42364614 . RPS19
chr19 42364794 42364965 chr19:42364794-42364965 . RPS19
chr19 42365130 42365331 chr19:42365130-42365331 . RPS19
chr19 42373050 42373334 chr19:42373050-42373334 . RPS19
chr19 42373718 42373873 chr19:42373718-42373873 . RPS19    ---- line3 of file1
chr19 42375368 42375534 chr19:42375368-42375534 . RPS19

awk

awk '
    NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
    d[$1,$2]{print $0}' file1 file2

current output

chr1 948953 948956 chr1:948953-948956 . ISG15
chr1 949363 949858 chr1:949363-949858 . ISG15
chr19 42373737 42373856 chr19:42373737-42373856 . RPS19

desired output

chr19 42363937 42364409 chr19:42363937-42364409 . RPS19
chr19 42364286 42364565 chr19:42364286-42364565 . RPS19
chr19 42364465 42364614 chr19:42364465-42364614 . RPS19
chr19 42364794 42364965 chr19:42364794-42364965 . RPS19
chr19 42365130 42365331 chr19:42365130-42365331 . RPS19
chr19 42373050 42373334 chr19:42373050-42373334 . RPS19
chr19 42375368 42375534 chr19:42375368-42375534 . RPS19

Don_Cragun · June 1, 2017, 5:56pm

Your requirements aren't clear. Are you trying to:

print all lines where $1, $2 in file2 does appear in the range $1, [$2-$3] in file1 (which is what your code is currently doing),
print all lines where $1, $2 in file2 does NOT appear in the range $1, [$2-$3] in file1 ,
print all lines where no element in the range $1, [$2-$3] in file2 appears in the range $1, [$2-$3] in file1 , or
print all lines where at least one element in the range $1, [$2-$3] in file2 does not appear in the range $1, [$2-$3] in file1 ?

cmccabe · June 1, 2017, 6:03pm

The above is what I am trying to do as each element is treated as a pair, so it. That is each $2 is combined with a $3 . Basically, the opposite of my code. I can seem to print the lines in the range, but not the lines not in the range. Thank you :).

Don_Cragun · June 1, 2017, 6:26pm

You have confused the matter more. You are not looking at $3 in file2 so it can't possibly affect the output produced by your script. If you just want to reverse the output produced by your script change it to:

awk '
    NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
    !d[$1,$2]{print $0}' file1 file2

or, using the default action when a condition is met:

awk '
    NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
    !d[$1,$2]' file1 file2

or to take less space:

awk '
    NR==FNR{for(i=$2;i<=$3;++i) d[$1,i]; next}
    !(($1,$2) in d)' file1 file2

cmccabe · June 1, 2017, 6:40pm

Sorry for the typo, how does the shorter, less space awk work? Thank you :).

Don_Cragun · June 1, 2017, 6:50pm

When reading the 1st input file, it creates empty array elements instead of assigning values to them (so you don't need space to store the strings you were assigning to those elements). When reading the 2nd input file, it checks to see if an element with the given index has been created instead of checking to see whether the value of the array element with that index has been assigned a non-empty string, non-zero value.

cmccabe · June 2, 2017, 8:27am

Thank you very much :).