This is a snippet of my data. What I want to do is to find out if the range of column 5 and column 6 is a subset of the range between column 2 and column 3. The data in column 2 and 3 are longer than data in column 5 and 6. A script has to scan through columns 2 and 3 in totality for every range defined by column 5 and 6. How do I do it. Any awk scripts? I am sorry if I did not follow the forum's rules, this is my first time using it.
You would need to compare $5 and $6 to all the ranges in $2, $3 so you would need to put them in memory first, so a way to do it would be to read the inputfile twice, the first time to put $2, $3 in memory, the second time to compare $5 and $6 to the ranges.
A simple first approach, assuming that $1 is always he same value could look something like this:
awk '
NR==FNR { # When reading the file for the first time
R[$2 FS $3] # Store the ranges $2 to $3 in array R, using the standard field separator
next
}
{ # When reading the file for the second time
for(i in R) { # For every line, for every range
split(i,F) # Split the stored range in minimum and maximum using the standard field separator
if(F[1]<=$5 && $5<=F[2]) # F[1] will contain the minimum, F[2] the maximum, so if $5 , $6 are inside it.
print $0, "range " $5 "-" $6 " inside " F[1] "-" F[2] # print the result
}
}
' infile infile # read the file twice
With your data this should produce:
Ca21chr2_C_albicans_SC5314 618550 627903 Ca21chr2_C_albicans_SC5314 7510 8043 range 7510-8043 inside 5286-50509
Thank you so much for your solution. But I used the following code:
awk '
NR==FNR { # When reading the file for the first time
R[$2 FS $3] # Store the ranges $2 to $3 in array R, using the standard field separator
next
}
{ # When reading the file for the second time
for(i in R) { # For every line, for every range
split(i,F) # Split the stored range in minimum and maximum using the standard field separator
if(F[1]<=$5 && $5<=F[2]) # F[1] will contain the minimum, F[2] the maximum, so if $5 , $6 are inside it.
print $0, "range " $5 "-" $6 " inside " F[1] "-" F[2] # print the result
}
}
' h1.txt h1.txt
But it is returning me the following, NOT what you wrote above:
You are right, that "feature" should not be taken for granted, but both my linux and FreeBSD versions have it. Still your caveat can / should be kept in mind.