Filtering based on column values

Hi there,

I am trying to filter a big file with several columns using values on a column with values like (AC=5;AN=10;SF=341,377,517,643,662;VRT=1). I wont to filter the data based on SF= values that are (bigger than 400)

Chr	Start	End	REF	ALT	GENE	EFFECT	Frequency	Existing	INFO	Number	Zygous
chr18	81447	81447	T	C	NON17	nonsynonymous SNV	0.003056	rs11	AC=5;AN=10;SF=341,377,517,643,662;VRT=1	5	10
chr19	12886	12886	G	C	NON17	nonsynonymous SNV	0.003056	rs14	AC=5;AN=10;SF=348,398,400,741,778;VRT=1	5	10
chr8	84550	84550	G	A	TSPO	nonsynonymous SNV	0.003056	rs18	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	5	10
chr8	82784	82784	C	T	TSPO	nonsynonymous SNV	0.003056	rs19	AC=5;AN=10;SF=372,478,684,708,799;VRT=1	5	10

For example

Chr	Start	End	REF	ALT	GENE	EFFECT	Frequency	Existing	INFO	Number	Zygous
chr8	84550	84550	G	A	TSPO	nonsynonymous SNV	0.003056	rs18	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	5	10

I would appreciate a help as I am doing that manually

I guess you're talking neither of the 8th nor the INFO column, but of col 10, headed "Number Zygous", in a <TAB> separated file. If the file structure is as given, i.e. col 10 is the only one with = , ; , and , in it, try

awk 'FNR == 1 || $6 > 400' FS="[,;=]" file
Chr	Start	End	REF	ALT	GENE	EFFECT	Frequency	Existing	INFO	Number	Zygous
chr8	84550	84550	G	A	TSPO	nonsynonymous SNV	0.003056	rs18	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	5	10

Yes its the 10th column, corrected the typo, however, the header was INFO. Will give it a try

---------- Post updated at 08:00 AM ---------- Previous update was at 07:33 AM ----------

It does filtering but i am focusing on SF= values. I have rows like these
AC=997;AN=1038;SF=0,1;VRT=1

I should be the SF=vales that are bigger than 400. is there a way to make it specific

As long as SF is the third in that column, it will work. It does with your new example. Where and how does it fail?

I don't think its filtering the right way. This is some of the output. it did filter rows but not sure what it did filter. e.g AC=65;AN=92;SF=0,1;VRT=1
it should be above 400

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	65872	.	T	G	2480.51	.	AC=65;AN=92;SF=0,1;VRT=1	GT:GQ:DP:AD:PL
chr1	65893	.	G	A	433.77	.	AC=3;AN=4;SF=0,1;VRT=1	GT:GQ:PL:AD:DP
chr1	65898	.	T	C	2055.11	.	AC=7;AN=10;SF=0,1;VRT=1	GT:DP:AD:PL:GQ
chr1	69511	rs75062661	A	G	822333	.	AC=894;AN=934;SF=0,1;VRT=1	GT:GQ:DP:AD:PL
chr1	69552	rs55874132	G	C	73.21	.	AC=20;AN=22;SF=0,1;VRT=1	GT:PL:AD:DP:GQ
chr1	721450	rs2977675	G	A	18243.4	.	AC=95;AN=188;SF=0,1;VRT=1	GT:PL:AD:DP:GQ

That looks a bit different from what you posted as a sample in post#1. But, none of those new lines will come through:

awk 'FNR == 1 || $6 > 400' FS="[,;=]" file
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT

yes it is different, does that change the filtering command ?

---------- Post updated at 08:55 AM ---------- Previous update was at 08:42 AM ----------

I tried to change the column number, none of them give the right output. I am guessing it doesn't have an affect on SF= values

awk 'FNR == 1 || $6 > 400' FS="[,;=]" file
awk 'FNR == 1 || $7 > 400' FS="[,;=]" file
awk 'FNR == 1 || $8 > 400' FS="[,;=]" file

Would this work?

perl -nle '/SF=(\d{3})/; print if $1 > 400 or $. == 1' daashti.example

Try

awk '{match ($0, ";SF=[^,]*,"); VAL = substr($0, RSTART+4, RLENGTH-4)} FNR == 1 || VAL > 400' file
awk '{match ($0, ";SF=[^,]*,"); VAL = substr($0, RSTART+4, RLENGTH-4)} FNR == 1 || VAL > 400' file

This didn't give the desired outcome.

and

perl -nle '/SF=(\d{3})/; print if $1 > 400 or $. == 1' daashti.example

gave me only the header

Hi daashti,
If you got only the header, using the Perl option, it means your input is not as advertised.
The command would had picked any pattern SF= followed by 3 digits and evaluated the digits as a number. If greater that 400 it would had output the whole line.

May I ask if you could post the result of running this command from the original input?

perl -nle '/(SF=\d+)/ and print $1' real.file

When I run that with the posted example, it returns the following:

SF=341
SF=348
SF=606
SF=372

Thanks for your reply, real file is very similar.
I do need the a whole line/row when filtering not only SF=values alone i.e SF values

I tried this

perl -nle '/(SF=\d+) and print $1' real.file

and got this

Search pattern not terminated at -e line 1.

not very familiar with perl.

Add the closing pattern search slash (highlighted)

I will clean the mess and put the example in a better format.

Data

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr1	65872	.	T	G	2480.51	.	AC=65;AN=92;SF=0,1;VRT=1	GT:GQ:DP:AD:PL
chr1	65893	.	G	A	433.77	.	AC=3;AN=4;SF=0,1;VRT=1	GT:GQ:PL:AD:DP
chr1	65898	.	T	C	2055.11	.	AC=7;AN=10;SF=0,1;VRT=1	GT:DP:AD:PL:GQ
chr1	69511	rs75062661	A	G	822333	.	AC=894;AN=934;SF=0,1;VRT=1	GT:GQ:DP:AD:PL
chr19	9088447	9088447	T	C	2420.41	.	AC=5;AN=10;SF=441,477,517,643,662;VRT=1	GT:GQ:DP:AD:PL
chr19	9072886	9072886	G	C	245.11	.	AC=5;AN=10;SF=448,498,500,741,778;VRT=1	GT:GQ:PL:AD:DP
chr7	149484550	149484550	G	A	305.11	.	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	GT:DP:AD:PL:GQ
chr7	149482784	149482784	C	T	822	.	AC=5;AN=10;SF=472,478,684,708,799;VRT=1	GT:GQ:DP:AD:PL

After filtering

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr19	9088447	9088447	T	C	2420.41	.	AC=5;AN=10;SF=441,477,517,643,662;VRT=1	GT:GQ:DP:AD:PL
chr19	9072886	9072886	G	C	245.11	.	AC=5;AN=10;SF=448,498,500,741,778;VRT=1	GT:GQ:PL:AD:DP
chr7	149484550	149484550	G	A	305.11	.	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	GT:DP:AD:PL:GQ
chr7	149482784	149482784	C	T	822	.	AC=5;AN=10;SF=472,478,684,708,799;VRT=1	GT:GQ:DP:AD:PL

The original Perl command gives me the same result you want when run against your latest example.

perl -nle '/SF=(\d{3})/; print if $1 > 400 or $. == 1' real.file
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
chr19   9088447 9088447 T       C       2420.41 .       AC=5;AN=10;SF=441,477,517,643,662;VRT=1 GT:GQ:DP:AD:PL
chr19   9072886 9072886 G       C       245.11  .       AC=5;AN=10;SF=448,498,500,741,778;VRT=1 GT:GQ:PL:AD:DP
chr7    149484550       149484550       G       A       305.11  .       AC=5;AN=10;SF=606,704,723,736,763;VRT=1 GT:DP:AD:PL:GQ
chr7    149482784       149482784       C       T       822     .       AC=5;AN=10;SF=472,478,684,708,799;VRT=1 GT:GQ:DP:AD:PL

If you only want the number as you mentioned in post #12, then:

perl -nle '/SF=(\d+)/; print $1 if $1 > 400' real.file
441
448
606
472

SF=are not only 3 digits
its random and it goes from 0-818
e.g
SF=0,10,35,55,300,455,654
SF=754,789,790,800

Sure, but any value greater than 400 would be always 3 digits (or more) and if the maximum number is 818 a three digit is sufficient.
A variation, if the maximum number could be arbitrary.

perl -nle '/SF=(\d+)/; print if $1 > 400 or $. == 1' real.file

Question:

Are you trying to filter if ANY of the numbers in the string SF=0,10,35,55,300,455,654 is greater than 400?

1 Like

yes i want lines/rows with SF=values larger than 400

So this brings us back to the question that should have been raised right in the beginning: What OS, shell, and awk version are you running?
Because, taking your last example from post#14, I see

awk '{match ($0, ";SF=[^,]*,"); VAL = substr($0, RSTART+4, RLENGTH-4)} FNR == 1 || VAL > 400' file 
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT
chr19	9088447	9088447	T	C	2420.41	.	AC=5;AN=10;SF=441,477,517,643,662;VRT=1	GT:GQ:DP:AD:PL
chr19	9072886	9072886	G	C	245.11	.	AC=5;AN=10;SF=448,498,500,741,778;VRT=1	GT:GQ:PL:AD:DP
chr7	149484550	149484550	G	A	305.11	.	AC=5;AN=10;SF=606,704,723,736,763;VRT=1	GT:DP:AD:PL:GQ
chr7	149482784	149482784	C	T	822	.	AC=5;AN=10;SF=472,478,684,708,799;VRT=1	GT:GQ:DP:AD:PL
  • EXACTLY what you required!

(Aside - as did awk 'FNR == 1 || $6 > 400' FS="[,;=]" file yield...)
(Next aside: as did Aia's perl code do!)

Hi RudiC

thanks for input RudiC,

i used

awk '{match ($0, ";SF=[^,]*,"); VAL = substr($0, RSTART+4, RLENGTH-4)} FNR == 1 || VAL > 400' file 

i got no output at all apart from header

My OS is Ubuntu 3.4.0+ #1 PREEMPT Thu Aug 1 17:06:05 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
Awk version
GNU Awk 4.0.1
Shell
bash 4.3.11(1)-release