Cannot subset ranges from another range set

cryptodice · January 3, 2020, 4:22am

Ca21chr2_C_albicans_SC5314	2159343	2228327	Ca21chr2_C_albicans_SC5314	636587	638608
Ca21chr2_C_albicans_SC5314	5286	50509	Ca21chr2_C_albicans_SC5314	634021	636276
Ca21chr2_C_albicans_SC5314	1886545	1900975	Ca21chr2_C_albicans_SC5314	610758	613544
Ca21chr2_C_albicans_SC5314	1919115	1930649	Ca21chr2_C_albicans_SC5314	606248	608308
Ca21chr2_C_albicans_SC5314	590278	603163	Ca21chr2_C_albicans_SC5314	1554724	1556511
Ca21chr2_C_albicans_SC5314	267403	279993	Ca21chr2_C_albicans_SC5314	1547799	1548998
Ca21chr2_C_albicans_SC5314	1611869	1622753	Ca21chr2_C_albicans_SC5314	1519257	1520960
Ca21chr2_C_albicans_SC5314	1479229	1490747	Ca21chr2_C_albicans_SC5314	1514712	1516178
Ca21chr2_C_albicans_SC5314	157814	166956	Ca21chr2_C_albicans_SC5314	897896	900774
Ca21chr2_C_albicans_SC5314	2148223	2149627	Ca21chr2_C_albicans_SC5314	890821	892818
Ca21chr2_C_albicans_SC5314	1041578	1051493	Ca21chr2_C_albicans_SC5314	588237	589598
Ca21chr2_C_albicans_SC5314	736894	745664	Ca21chr2_C_albicans_SC5314	557079	558713
Ca21chr2_C_albicans_SC5314	618550	627903	Ca21chr2_C_albicans_SC5314	7510	8043
Ca21chr2_C_albicans_SC5314	1116919	1125425	Ca21chr2_C_albicans_SC5314	922654	924717
Ca21chr2_C_albicans_SC5314	1262940	1271939	Ca21chr2_C_albicans_SC5314	1778986	1779687
Ca21chr2_C_albicans_SC5314	288630	296284	Ca21chr2_C_albicans_SC5314	795730	798201
Ca21chr2_C_albicans_SC5314	1250513	1258731	Ca21chr2_C_albicans_SC5314	766651	768309
Ca21chr2_C_albicans_SC5314	1499806	1508334	Ca21chr2_C_albicans_SC5314	763501	765159
Ca21chr2_C_albicans_SC5314	98269	105803	Ca21chr2_C_albicans_SC5314	758203	758733
Ca21chr2_C_albicans_SC5314	1604362	1611315	Ca21chr2_C_albicans_SC5314	700893	702539

This is a snippet of my data. What I want to do is to find out if the range of column 5 and column 6 is a subset of the range between column 2 and column 3. The data in column 2 and 3 are longer than data in column 5 and 6. A script has to scan through columns 2 and 3 in totality for every range defined by column 5 and 6. How do I do it. Any awk scripts? I am sorry if I did not follow the forum's rules, this is my first time using it.

Scrutinizer · January 3, 2020, 5:17am

You would need to compare $5 and $6 to all the ranges in $2, $3 so you would need to put them in memory first, so a way to do it would be to read the inputfile twice, the first time to put $2, $3 in memory, the second time to compare $5 and $6 to the ranges.

A simple first approach, assuming that $1 is always he same value could look something like this:

awk '
  NR==FNR {                                                    # When reading the file for the first time
    R[$2 FS $3]                                                # Store the ranges $2 to $3 in array R, using the standard field separator
    next
  }
  {                                                            # When reading the file for the second time
    for(i in R) {                                              # For every line, for every range
      split(i,F)                                               # Split the stored range in minimum and maximum using the standard field separator
      if(F[1]<=$5 && $5<=F[2])                                 # F[1] will contain the minimum, F[2] the maximum, so if $5 , $6 are inside it.
        print $0, "range " $5 "-" $6 " inside " F[1] "-" F[2]  # print the result
    }
  }
' infile infile                                                # read the file twice

With your data this should produce:

Ca21chr2_C_albicans_SC5314	618550	627903	Ca21chr2_C_albicans_SC5314	7510	8043 range 7510-8043 inside 5286-50509

nezabudka · January 3, 2020, 5:39am

Hi
I'm a little confused. Is that necessary?
cat file

Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	4	7
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	1	7
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	4	13
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	0	17
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	0	3
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	10	17
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	0	2
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	11	17

awk '
($5 >= $2 && $5 <= $3) ||
($6 <= $3 && $6 >= $2) ||
($5 < $2 && $6 > $3)    {print $0 RS ($2>$5?$2:$5) FS ($3>$6?$6:$3)}
' file

Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	4	7
4 7
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	1	7
3 7
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	4	13
4 10
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	0	17
3 10
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	0	3
3 3
Ca21chr2_C_albicans_SC5314	3	10	Ca21chr2_C_albicans_SC5314	10	17
10 10

cryptodice · January 3, 2020, 7:38am

Thank you so much for your solution. But I used the following code:

awk '
  NR==FNR {                                                    # When reading the file for the first time
    R[$2 FS $3]                                                # Store the ranges $2 to $3 in array R, using the standard field separator
    next
  }
  {                                                            # When reading the file for the second time
    for(i in R) {                                              # For every line, for every range
      split(i,F)                                               # Split the stored range in minimum and maximum using the standard field separator
      if(F[1]<=$5 && $5<=F[2])                                 # F[1] will contain the minimum, F[2] the maximum, so if $5 , $6 are inside it.
        print $0, "range " $5 "-" $6 " inside " F[1] "-" F[2]  # print the result
    }
  }
' h1.txt h1.txt

But it is returning me the following, NOT what you wrote above:

inside 5286-50509s_SC5314      618550  627903  Ca21chr2_C_albicans_SC5314      7510    8043

Scrutinizer · January 3, 2020, 3:16pm

Try converting the input file to UNIX format first.
It appears to be in Windows format

tr -d '\r' <file >newfile

--
or you can maybe use

dos2unix file

if available on your OS.

MadeInGermany · January 3, 2020, 4:12pm

In awk you can strip the \r as follows

awk '
{ sub(/\r$/, "") }
...

RudiC · January 3, 2020, 4:33pm

Similar approach to Scrutinizer's, but opens/reads file but once and keeps data in memory:

awk '
        {LN[NR]   = $0
         MIN2[NR] = $2
         MAX3[NR] = $3
         MIN5[NR] = $5
         MAX6[NR] = $6
        }
END     {for (i=1; i<=NR; i++)
           for (j=1; j<=NR; j++) if ((MIN5 >= MIN2[j]) && (MAX6 <= MAX3[j]))  print LN, "range", MIN5, "-", MAX6, "within", MIN2[j], "-", MAX3[j], "boundaries."
        }
' file
Ca21chr2_C_albicans_SC5314    618550    627903    Ca21chr2_C_albicans_SC5314    7510    8043 range 7510 - 8043 within 5286 - 50509 boundaries.

vgersh99 · January 3, 2020, 4:46pm

rudic:

Similar approach to Scrutinizer's, but opens/reads file but once and keeps data in memory:


awk '
   {LN[NR]   = $0
   MIN2[NR] = $2
   MAX3[NR] = $3
   MIN5[NR] = $5
   MAX6[NR] = $6
   }
END     {for (i=1; i<=NR; i++)
   for (j=1; j<=NR; j++) if ((MIN5 >= MIN2[j]) && (MAX6 <= MAX3[j]))  print LN, "range", MIN5, "-", MAX6, "within", MIN2[j], "-", MAX3[j], "boundaries."
   }
' file
Ca21chr2_C_albicans_SC5314    618550    627903    Ca21chr2_C_albicans_SC5314    7510    8043 range 7510 - 8043 within 5286 - 50509 boundaries.

be careful... some awk's don't have NR available in the END block. Just reassign NR to nr in your "main" block and use it in the END .

RudiC · January 3, 2020, 4:59pm

You are right, that "feature" should not be taken for granted, but both my linux and FreeBSD versions have it. Still your caveat can / should be kept in mind.

cryptodice · January 3, 2020, 9:43pm

Thank you so much. It worked. From changing file format to RudiC's new approach.