awk to skip lines find text and add text based on number

I am trying to use awk skip each line with a ## or

#

and check each line after for STB= and if that value in greater than or = to 0.8, then at the end of line the text "STRAND BIAS" is written in else "GOOD".

So in the file of 4 entries attached.

awk tried:

 awk NR > "##"' "#" -F"STB=" '{print $NF}' file 

desired output:

##
##
##
....
....
....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER
..... GOOD
..... GOOD
..... GOOD
..... STRAND BIAS 

file:

##fileformat=VCFv4.1
##FILTER=<ID=NOCALL,Description="Generic filter. Filtering details stored in FR info tag.">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele frequency based on Flow Evaluator observation counts">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=FAO,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observation count">
##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Flow Evaluator Read Depth">
##FORMAT=<ID=FRO,Number=1,Type=Integer,Description="Flow Evaluator Reference allele observation count">
##FORMAT=<ID=FSAF,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the forward strand">
##FORMAT=<ID=FSAR,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the reverse strand">
##FORMAT=<ID=FSRF,Number=1,Type=Integer,Description="Flow Evaluator reference observations on the forward strand">
##FORMAT=<ID=FSRR,Number=1,Type=Integer,Description="Flow Evaluator reference observations on the reverse strand">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##FORMAT=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency based on Flow Evaluator observation counts">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=FAO,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations">
##INFO=<ID=FDP,Number=1,Type=Integer,Description="Flow Evaluator read depth at the locus">
##INFO=<ID=FR,Number=.,Type=String,Description="Reason why the variant was filtered.">
##INFO=<ID=FRO,Number=1,Type=Integer,Description="Flow Evaluator Reference allele observations">
##INFO=<ID=FSAF,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the forward strand">
##INFO=<ID=FSAR,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the reverse strand">
##INFO=<ID=FSRF,Number=1,Type=Integer,Description="Flow Evaluator Reference observations on the forward strand">
##INFO=<ID=FSRR,Number=1,Type=Integer,Description="Flow Evaluator Reference observations on the reverse strand">
##INFO=<ID=FWDB,Number=A,Type=Float,Description="Forward strand bias in prediction.">
##INFO=<ID=FXX,Number=1,Type=Float,Description="Flow Evaluator failed read ratio">
##INFO=<ID=HRUN,Number=A,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
##INFO=<ID=HS,Number=0,Type=Flag,Description="Indicate it is at a hot spot">
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
##INFO=<ID=MLLD,Number=A,Type=Float,Description="Mean log-likelihood delta per read.">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=PB,Number=A,Type=Float,Description="Bias of relative variant position in reference reads versus variant reads. Equals Mann-Whitney U rho statistic P(Y>X)+0.5P(Y=X)">
##INFO=<ID=PBP,Number=A,Type=Float,Description="Pval of relative variant position in reference reads versus variant reads.  Related to GATK ReadPosRankSumTest">
##INFO=<ID=QD,Number=1,Type=Float,Description="QualityByDepth as 4*QUAL/FDP (analogous to GATK)">
##INFO=<ID=RBI,Number=A,Type=Float,Description="Distance of bias parameters from zero.">
##INFO=<ID=REFB,Number=A,Type=Float,Description="Reference Hypothesis bias in prediction.">
##INFO=<ID=REVB,Number=A,Type=Float,Description="Reverse strand bias in prediction.">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observations">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SSEN,Number=A,Type=Float,Description="Strand-specific-error prediction on negative strand.">
##INFO=<ID=SSEP,Number=A,Type=Float,Description="Strand-specific-error prediction on positive strand.">
##INFO=<ID=SSSB,Number=A,Type=Float,Description="Strand-specific strand bias for allele.">
##INFO=<ID=STB,Number=A,Type=Float,Description="Strand bias in variant relative to reference.">
##INFO=<ID=STBP,Number=A,Type=Float,Description="Pval of Strand bias in variant relative to reference.">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=VARB,Number=A,Type=Float,Description="Variant Hypothesis bias in prediction.">
##LeftAlignVariants="analysis_type=LeftAlignVariants bypassFlowAlign=true kmer_len=19 min_var_count=5 short_suffix_match=5 min_indel_size=4 max_hp_length=8 min_var_freq=0.15 min_var_score=10.0 relative_strand_bias=0.8 output_mnv=0 sse_hp_size=0 sse_report_file= target_size=1.0 pref_kmer_max=3 pref_kmer_min=0 pref_delta_max=2 pref_delta_min=0 suff_kmer_max=3 suff_kmer_min=0 suff_delta_max=2 suff_delta_min=0 motif_min_ppv=0.2 generate_flow_position=0 analyze_missmatches=0 sse_rate=0.07 input_file=[] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/results/referenceLibrary/tmap-f3/hg19/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 combined_sample_name= num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=/results/analysis/output/Home/Auto_user_Proton-32-Lurie_Inh_Disease_151029_79_081/plugin_out/variantCaller_out.125/IonXpress_005/small_variants.sorted.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub filter_mismatching_base_and_quals=false"
##basecallerVersion="4.6-11/0c0ef91"
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr10,length=135534747,assembly=hg19>
##contig=<ID=chr11,length=135006516,assembly=hg19>
##contig=<ID=chr12,length=133851895,assembly=hg19>
##contig=<ID=chr13,length=115169878,assembly=hg19>
##contig=<ID=chr14,length=107349540,assembly=hg19>
##contig=<ID=chr15,length=102531392,assembly=hg19>
##contig=<ID=chr16,length=90354753,assembly=hg19>
##contig=<ID=chr17,length=81195210,assembly=hg19>
##contig=<ID=chr18,length=78077248,assembly=hg19>
##contig=<ID=chr19,length=59128983,assembly=hg19>
##contig=<ID=chr2,length=243199373,assembly=hg19>
##contig=<ID=chr20,length=63025520,assembly=hg19>
##contig=<ID=chr21,length=48129895,assembly=hg19>
##contig=<ID=chr22,length=51304566,assembly=hg19>
##contig=<ID=chr3,length=198022430,assembly=hg19>
##contig=<ID=chr4,length=191154276,assembly=hg19>
##contig=<ID=chr5,length=180915260,assembly=hg19>
##contig=<ID=chr6,length=171115067,assembly=hg19>
##contig=<ID=chr7,length=159138663,assembly=hg19>
##contig=<ID=chr8,length=146364022,assembly=hg19>
##contig=<ID=chr9,length=141213431,assembly=hg19>
##contig=<ID=chrM,length=16569,assembly=hg19>
##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##fileDate=20151029
##fileUTCtime=2015-10-29T22:48:03
##parametersDetails="germline_low_stringency_proton, TS version: 4.6"
##parametersName="Generic - Proton - Germ Line - Low Stringency"
##phasing=none
##reference=/results/referenceLibrary/tmap-f3/hg19/hg19.fasta
##reference=file:///results/referenceLibrary/tmap-f3/hg19/hg19.fasta
##source="tvc 4.6-11 (0c0ef91) - Torrent Variant Caller"
##tmapVersion="4.6.11 (0c0ef91) (201506161725)"
##INFO=<ID=OID,Number=.,Type=String,Description="List of original Hotspot IDs">
##INFO=<ID=OPOS,Number=.,Type=Integer,Description="List of original allele positions">
##INFO=<ID=OREF,Number=.,Type=String,Description="List of original reference bases">
##INFO=<ID=OALT,Number=.,Type=String,Description="List of original variant bases">
##INFO=<ID=OMAPALT,Number=.,Type=String,Description="Maps OID,OPOS,OREF,OALT entries to specific ALT alleles">
##deamination_metric=0.23163526491
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NA12878chr1    977330    .    T    C    519.68    PASS    F=1;AO=55;DP=55;FAO=55;FDP=55;FR=.;FRO=0;FSAF=37;FSAR=18;FSRF=0;FSRR=0;FWDB=-0.0448496;FXX=0;HRUN=1;LEN=1;MLLD=88.0543;PB=0.5;PBP=1;QD=37.7947;RBI=0.0449422;REFB=0;REVB=0.0028832;RO=0;SAF=37;SAR=18;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.85968e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.00027863;OID=.;OPOS=977330;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:25:55:55:0:0:55:55:1:18:37:0:0:18:37:0:0
chr1    981931    .    A    G    1169.7    PASS    AF=0.984375;AO=125;DP=131;FAO=126;FDP=128;FR=.;FRO=2;FSAF=67;FSAR=59;FSRF=2;FSRR=0;FWDB=-0.000335669;FXX=0.022899;HRUN=1;LEN=1;MLLD=58.6432;PB=0.5;PBP=1;QD=36.5532;RBI=0.0593811;REFB=-0.0178713;REVB=-0.0593801;RO=2;SAF=66;SAR=59;SRF=2;SRR=0;SSEN=0;SSEP=0;SSSB=-0.0141789;STB=0.507352;STBP=0.247;TYPE=snp;VARB=-0.000577363;OID=.;OPOS=981931;OREF=A;OALT=G;OMAPALT=G    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:41:131:128:2:2:125:126:0.984375:59:66:2:0:59:67:2:0
chr1    982994    .    T    C    3016.4    PASS    AF=1;AO=317;DP=318;FAO=317;FDP=317;FR=.;FRO=0;FSAF=114;FSAR=203;FSRF=0;FSRR=0;FWDB=-0.0880862;FXX=0.00314456;HRUN=4;LEN=1;MLLD=48.0245;PB=0.5;PBP=1;QD=38.0619;RBI=0.13654;REFB=0.0494458;REVB=-0.104326;RO=1;SAF=114;SAR=203;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00234416;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000568883;OID=.;OPOS=982994;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:99:318:317:1:0:317:317:1:203:114:0:1:203:114:0:0
chr1    981931    .    A    C    1169.7    PASS    AF=0.984375;AO=125;DP=131;FAO=126;FDP=128;FR=.;FRO=2;FSAF=67;FSAR=59;FSRF=2;FSRR=0;FWDB=-0.000335669;FXX=0.022899;HRUN=1;LEN=1;MLLD=58.6432;PB=0.5;PBP=1;QD=36.5532;RBI=0.0593811;REFB=-0.0178713;REVB=-0.0593801;RO=2;SAF=66;SAR=59;SRF=2;SRR=0;SSEN=0;SSEP=0;SSSB=-0.0141789;STB=0.507352;STBP=0.247;TYPE=snp;VARB=-0.000577363;OID=.;OPOS=981931;OREF=A;OALT=G;OMAPALT=G    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:41:131:128:2:2:125:126:0.984375:59:66:2:0:59:67:2:0
chr1    982994    .    -    C    3016.4    PASS    AF=1;AO=317;DP=21;FAO=317;FDP=20;FR=.;FRO=0;FSAF=114;FSAR=203;FSRF=0;FSRR=0;FWDB=-0.0880862;FXX=0.00314456;HRUN=4;LEN=1;MLLD=48.0245;PB=0.5;PBP=1;QD=38.0619;RBI=0.13654;REFB=0.0494458;REVB=-0.104326;RO=1;SAF=114;SAR=203;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00234416;STB=0.9;STBP=1;TYPE=snp;VARB=-0.000568883;OID=.;OPOS=982994;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:99:318:317:1:0:317:317:1:203:114:0:1:203:114:0:0 

Perhaps Perl?

perl -ple '/^[^#].*STB=(\d+\.\d+);/ and $_.=$1 >= 0.8?" STRAND BIAS":" GOOD"'
1 Like

awk does not have the nice ( ) match clause that perl has.

awk -v M=";STB=" '/^[^#]/ && match($0,M"[^;]*") {LM=length(M); print $0, (substr($0,RSTART+LM,RLENGTH-LM)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt 

---------- Post updated at 02:51 PM ---------- Previous update was at 02:38 PM ----------

With the idea -F ";STB=" it simplifies a bit:

awk -F ";STB=" '/^[^#]/ && match($2,"[^;]*") {print $0, (substr($2,RSTART,RLENGTH)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt

---------- Post updated at 03:01 PM ---------- Previous update was at 02:51 PM ----------

In case you want to keep the comments:

awk -F ";STB=" '/^#/ {print; next} match($2,"[^;]*") {print $0, (substr($2,RSTART,RLENGTH)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt
1 Like

In addition to capturing the STB= value, how can I also capture the FDP= value and whatever the value is of FDP= "reads" appears next to the text "STRAND BIAS" or "GOOD". Thank you :).

perl -ple '/^[^#].*FDP=(\d+);*STB=(\d+\.\d+);/ and $1_= <30 $_.=$2 >= 0.8?" STRAND BIAS":" GOOD""$1 "reads""' 

desired output:

##
##
##
....
....
....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER
..... GOOD  128 reads
..... GOOD  317 reads
..... GOOD  128 reads
..... STRAND BIAS  20 reads

Please, try

perl -ple '/^[^#].*FDP=(\d+);.*STB=(\d+\.\d+);/ and $_.=($2 >= 0.8?" STRAND BIAS ":" GOOD ").$1." reads"'
1 Like

Sometimes a multi-liner is easier to understand+expand

perl -ple '                                       
/^#/ and next;
/;STB=([^;]+)/ and $_.=($1 >= 0.8 ? " STRAND BIAS " : " GOOD ");
/;FDP=([^;]+)/ and $_.=$1;
' awk_test.txt

The ( ) is referred as $1 .
$_ is the input line. /string/ is short for $_ =~ m/string/
The .= appends the string to $_ . It's short for $_ = $_ . string
The perl -p option loops and prints at the end of each cycle. (While the -n option only loops.)
In loop mode the next statement jumps to the next cycle. (Like in awk, that is always in loop mode.)

1 Like

Thank you both very much :slight_smile: