awk to capture lines that meet either condition

cmccabe · August 2, 2017, 8:16am

I am trying to modify and understand an awk written by @Scrutinizer

The below awk will filter a list of 30,000 lines in the tab-delimited file. What I am having trouble with is adding a condition to SVTYPE=CNV
that will only print that line if CI=,0.95: portion in blue in file is <1.9 .

The other condition works perfectly and I added comments as to what I think is happening in each step. Thank you :).

file

chr16	68771250	CDH1	G	<CNV>	100.0	PASS	HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=68867430;LEN=96180;NUMTILES=39;SD=0.47;CDF_MAPD=0.01:0.962265,0.025:0.985543,0.05:1.006017,0.1:1.030158,0.2:1.060175,0.25:1.071807,0.5:1.12,0.75:1.17036,0.8:1.183201,0.9:1.217678,0.95:1.246897,0.975:1.272801,0.99:1.303591;REF_CN=2;CI=0.05:0.895574,0.95:1.16322;RAW_CN=1.12;FUNC=[{'gene':'CDH1'}]	GT:GQ:CN	./.:0:1.02
chr15	90631824	IDH2	G	<CNV>	100.0	PASS	FR=.;PRECISE=FALSE;SVTYPE=CNV;END=90631954;LEN=130;NUMTILES=1;SD;CDF_MAPD=0.01:0.647181,0.025:0.751369,0.05:0.85432,0.1:0.99068,0.2:1.185313,0.25:1.268903,0.5:1.67,0.75:2.197882,0.8:2.352881,0.9:2.815138,0.95:3.264469,0.975:3.711758,0.99:4.309304;REF_CN=2;CI=0.05:0.727022,0.95:3.40497;RAW_CN=1.67;FUNC=[{'gene':'IDH2'}]	GT:GQ:CN	./.:0:1.63

awk

awk -F'[\t;]' '              # define FS as tab and ;
   {
     split(x,V)
     for(i=1; i<=NF; i++) { # create loop i (which is each line) and iterate though    
       split($i,F,/=/)      # each portion (in green) of the line with the pattern = read into array F splitting using FS
       V[F[1]]=F[2]         # set each split in array F equal to array V (defined below)
     }
   }
   (V["SVTYPE"]=="CNV"    && V["CI"]+0 < 1.9) ||     # define V for CNV   - not sure if the entire CI is being used or maybe splitting on the , would work better
   (V["SVTYPE"]=="Fusion" && V["READ_COUNT"]+0 > 10) # define V for Fusion
' file > out

desired output - only this line has a CI=0.95 value < 1.9

chr16	68771250	CDH1	G	<CNV>	100.0	PASS	HS;FR=.;PRECISE=FALSE;SVTYPE=CNV;END=68867430;LEN=96180;NUMTILES=39;SD=0.47;CDF_MAPD=0.01:0.962265,0.025:0.985543,0.05:1.006017,0.1:1.030158,0.2:1.060175,0.25:1.071807,0.5:1.12,0.75:1.17036,0.8:1.183201,0.9:1.217678,0.95:1.246897,0.975:1.272801,0.99:1.303591;REF_CN=2;CI=0.05:0.895574,0.95:1.16322;RAW_CN=1.12;FUNC=[{'gene':'CDH1'}]	GT:GQ:CN	./.:0:1.02

Corona688 · August 2, 2017, 12:14pm

By adding print V["CI"] to the code I discovered that it thinks CI is 0.05:0.895574,0.95:1.16322

Which makes sense, as the input is only split on equals.

So I've added code to split the CI value specially:

awk -F'[\t;]' '              # define FS as tab and ;
   {
     split(x,V)
     for(i=1; i<=NF; i++) { # create loop i (which is each line) and iterate th$
       split($i,F,/=/)      # each portion (in green) of the line with the patt$
       V[F[1]]=F[2]         # set each split in array F equal to array V (defin$
     }

     split(V["CI"], A, ":"); # A[1]=0.05, A[2]=0.895574,0.95, etc
     V["CI"]=A[1]; # V["CI"]=0.95
   }
   (V["SVTYPE"]=="CNV"    && V["CI"]+0 < 1.9) ||     # define V for CNV   - not$
   (V["SVTYPE"]=="Fusion" && V["READ_COUNT"]+0 > 10) # define V for Fusion
' inputfile

cmccabe · August 3, 2017, 7:48am

Thank you very much :).