Hi,
I have been trying to extract rows that match pattern "cov" with the value next to it to be > 3. The 'cov' pattern may appear either in $3 or $4 (if using ";" as field separator). Below is the example:-
input file
ENST00000652609.1|ENSG00000230590.10|OTTHUMG00000021850.6|OTTHUMT00000503925.1|FTX-232|FTX|2334| StringTie exon 385 622 1000 . . gene_id "SRR5206792.443"; transcript_id "SRR5206792.443.1"; exon_number "1"; cov "2.749580";
lnc-NAA50-3:1 StringTie transcript 1 859 1000 . . gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1 StringTie exon 1 859 1000 . . gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1 StringTie transcript 1 2446 1000 . . gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1 StringTie exon 1 2446 1000 . . gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";
From the sample file, the "cov" value for the first row is less than 3. Therefore, it should be excluded and the output should be like below.
lnc-NAA50-3:1 StringTie transcript 1 859 1000 . . gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1 StringTie exon 1 859 1000 . . gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1 StringTie transcript 1 2446 1000 . . gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1 StringTie exon 1 2446 1000 . . gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";
I know how to search the pattern but do not know how to compare the value to be > 3. Below is one of the sample codes that i did:-
awk 'BEGIN{FS=";"] $0~/cov/ && $3 || $4 >3 {print}' input file
tried couple of times to do the comparison by combining with pattern matching but failed. appreciate your kind help and advise. thanks
RudiC
October 1, 2019, 4:26am
2
Try (untested)
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file
1 Like
rudic:
Try (untested)
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file
Thanks so much. It works perfectly!
I am looking into your codes.
{split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}
This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it.
With a slight change it should also work nicely for string data values:
$ echo ' wrongkey "some data"; key "more data";' |
awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data
1 Like
rudic:
Try (untested)
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file
chubler_xl:
With a slight change it should also work nicely for string data values:
$ echo ' wrongkey "some data"; key "more data";' |
awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data
This is great! thanks so much
1 Like
Hi,
there is a slight change for the output that I need to generate. Let say I have below input data
SIN3A-2:2 StringTie transcript 15 2652 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; cov "2.846695"; FPKM "9.158292"; TPM "8.126626";
SIN3A-2:2 StringTie exon 15 536 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "1"; cov "1.019540";
SIN3A-2:2 StringTie exon 725 1045 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "2"; cov "2.834891";
SIN3A-2:2 StringTie exon 1268 1509 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "3"; cov "5.954821";
SIN3A-2:2 StringTie exon 1867 1990 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "4"; cov "3.971774";
SIN3A-2:2 StringTie exon 2344 2465 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "5"; cov "3.590164";
SIN3A-2:2 StringTie exon 2567 2652 1000 + . gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "6"; cov "2.558140";
SIN3A-2:2 StringTie transcript 3744 4813 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2 StringTie exon 3744 3851 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2 StringTie exon 3937 4093 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2 StringTie exon 4211 4813 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
SIN3A-2:5 StringTie transcript 6 818 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5 StringTie exon 6 315 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5 StringTie exon 510 607 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5 StringTie exon 782 818 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";
The first line containing "transcript" in $3 have "cov" value less than 3. Therefore, the lines following it ( with exon in $3) need to be removed as well although they have the cov more than 3.
The output file should be like below:
SIN3A-2:2 StringTie transcript 3744 4813 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2 StringTie exon 3744 3851 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2 StringTie exon 3937 4093 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2 StringTie exon 4211 4813 1000 + . gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
SIN3A-2:5 StringTie transcript 6 818 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5 StringTie exon 6 315 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5 StringTie exon 510 607 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5 StringTie exon 782 818 1000 + . gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";
I tried to play around with the given code before but the output file still retain the lines with "exon". Below is one of my attempts:
awk -F"[\t;]" '$3 ~/transcript/ {if(match ($0,/cov[ ".0-9]*;/)) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}
SRC=$1
OUT=""
}
$1==SRC {OUT= OUT ORS $0}
{print}' inputfile > outputfile
Can anyone pls help and tell me what did I do wrong? thanks
RudiC
December 2, 2019, 5:46am
7
Try
awk '
$3 == "transcript" {match ($0, /cov[ ".0-9]*;/)
split (substr ($0, RSTART, RLENGTH), T, "\"")
PR = (T[2] > 3)
}
PR
' file
2 Likes
rudic:
Try
awk '
$3 == "transcript" {match ($0, /cov[ ".0-9]*;/)
split (substr ($0, RSTART, RLENGTH), T, "\"")
PR = (T[2] > 3)
}
PR
' file
Hi RudiC,
It works great..thanks so much.