Find a matched pattern and perform comparison on numbers next to it

bunny_merah19 · October 1, 2019, 1:20am

Hi,

I have been trying to extract rows that match pattern "cov" with the value next to it to be > 3. The 'cov' pattern may appear either in $3 or $4 (if using ";" as field separator). Below is the example:-

input file

ENST00000652609.1|ENSG00000230590.10|OTTHUMG00000021850.6|OTTHUMT00000503925.1|FTX-232|FTX|2334|	StringTie	exon	385	622	1000	.	.	gene_id "SRR5206792.443"; transcript_id "SRR5206792.443.1"; exon_number "1"; cov "2.749580";
lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

From the sample file, the "cov" value for the first row is less than 3. Therefore, it should be excluded and the output should be like below.

lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

I know how to search the pattern but do not know how to compare the value to be > 3. Below is one of the sample codes that i did:-

awk 'BEGIN{FS=";"] $0~/cov/ && $3 || $4 >3 {print}' input file

tried couple of times to do the comparison by combining with pattern matching but failed. appreciate your kind help and advise. thanks

RudiC · October 1, 2019, 4:26am

Try (untested)

awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

bunny_merah19 · October 1, 2019, 8:02pm

Thanks so much. It works perfectly!

I am looking into your codes.

{split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}

This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it.

Chubler_XL · October 2, 2019, 11:33pm

With a slight change it should also work nicely for string data values:

$ echo ' wrongkey  "some data"; key "more data";' | 
     awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

bunny_merah19 · October 2, 2019, 11:48pm

chubler_xl:

With a slight change it should also work nicely for string data values:
$ echo ' wrongkey  "some data"; key "more data";' | 
   awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

This is great! thanks so much

bunny_merah19 · December 2, 2019, 2:32am

Hi,

there is a slight change for the output that I need to generate. Let say I have below input data

 SIN3A-2:2    StringTie    transcript    15    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; cov "2.846695"; FPKM "9.158292"; TPM "8.126626";
SIN3A-2:2    StringTie    exon    15    536    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "1"; cov "1.019540";
SIN3A-2:2    StringTie    exon    725    1045    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "2"; cov "2.834891";
SIN3A-2:2    StringTie    exon    1268    1509    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "3"; cov "5.954821";
SIN3A-2:2    StringTie    exon    1867    1990    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "4"; cov "3.971774";
SIN3A-2:2    StringTie    exon    2344    2465    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "5"; cov "3.590164";
SIN3A-2:2    StringTie    exon    2567    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "6"; cov "2.558140";
 SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
 SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

The first line containing "transcript" in $3 have "cov" value less than 3. Therefore, the lines following it ( with exon in $3) need to be removed as well although they have the cov more than 3.

The output file should be like below:

SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

I tried to play around with the given code before but the output file still retain the lines with "exon". Below is one of my attempts:

awk -F"[\t;]" '$3 ~/transcript/ {if(match ($0,/cov[ ".0-9]*;/)) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}     
        SRC=$1
        OUT=""
        }                
$1==SRC {OUT= OUT ORS $0} 

{print}'  inputfile > outputfile

Can anyone pls help and tell me what did I do wrong? thanks

RudiC · December 2, 2019, 5:46am

Try

awk  '
$3 == "transcript"      {match ($0, /cov[ ".0-9]*;/)
                         split (substr ($0, RSTART, RLENGTH), T, "\"")
                         PR =  (T[2] > 3)
                        }
PR
' file

bunny_merah19 · December 2, 2019, 8:35am

Hi RudiC,

It works great..thanks so much.