I am trying to filter a big file with several columns using values on a column with values like (AC=5;AN=10;SF=341,377,517,643,662;VRT=1). I wont to filter the data based on SF= values that are (bigger than 400)
Chr Start End REF ALT GENE EFFECT Frequency Existing INFO Number Zygous
chr18 81447 81447 T C NON17 nonsynonymous SNV 0.003056 rs11 AC=5;AN=10;SF=341,377,517,643,662;VRT=1 5 10
chr19 12886 12886 G C NON17 nonsynonymous SNV 0.003056 rs14 AC=5;AN=10;SF=348,398,400,741,778;VRT=1 5 10
chr8 84550 84550 G A TSPO nonsynonymous SNV 0.003056 rs18 AC=5;AN=10;SF=606,704,723,736,763;VRT=1 5 10
chr8 82784 82784 C T TSPO nonsynonymous SNV 0.003056 rs19 AC=5;AN=10;SF=372,478,684,708,799;VRT=1 5 10
For example
Chr Start End REF ALT GENE EFFECT Frequency Existing INFO Number Zygous
chr8 84550 84550 G A TSPO nonsynonymous SNV 0.003056 rs18 AC=5;AN=10;SF=606,704,723,736,763;VRT=1 5 10
I would appreciate a help as I am doing that manually
I guess you're talking neither of the 8th nor the INFO column, but of col 10, headed "Number Zygous", in a <TAB> separated file. If the file structure is as given, i.e. col 10 is the only one with = , ; , and , in it, try
awk 'FNR == 1 || $6 > 400' FS="[,;=]" file
Chr Start End REF ALT GENE EFFECT Frequency Existing INFO Number Zygous
chr8 84550 84550 G A TSPO nonsynonymous SNV 0.003056 rs18 AC=5;AN=10;SF=606,704,723,736,763;VRT=1 5 10
I don't think its filtering the right way. This is some of the output. it did filter rows but not sure what it did filter. e.g AC=65;AN=92;SF=0,1;VRT=1
it should be above 400
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 65872 . T G 2480.51 . AC=65;AN=92;SF=0,1;VRT=1 GT:GQ:DP:AD:PL
chr1 65893 . G A 433.77 . AC=3;AN=4;SF=0,1;VRT=1 GT:GQ:PL:AD:DP
chr1 65898 . T C 2055.11 . AC=7;AN=10;SF=0,1;VRT=1 GT:DP:AD:PL:GQ
chr1 69511 rs75062661 A G 822333 . AC=894;AN=934;SF=0,1;VRT=1 GT:GQ:DP:AD:PL
chr1 69552 rs55874132 G C 73.21 . AC=20;AN=22;SF=0,1;VRT=1 GT:PL:AD:DP:GQ
chr1 721450 rs2977675 G A 18243.4 . AC=95;AN=188;SF=0,1;VRT=1 GT:PL:AD:DP:GQ
Hi daashti,
If you got only the header, using the Perl option, it means your input is not as advertised.
The command would had picked any pattern SF= followed by 3 digits and evaluated the digits as a number. If greater that 400 it would had output the whole line.
May I ask if you could post the result of running this command from the original input?
perl -nle '/(SF=\d+)/ and print $1' real.file
When I run that with the posted example, it returns the following:
I will clean the mess and put the example in a better format.
Data
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 65872 . T G 2480.51 . AC=65;AN=92;SF=0,1;VRT=1 GT:GQ:DP:AD:PL
chr1 65893 . G A 433.77 . AC=3;AN=4;SF=0,1;VRT=1 GT:GQ:PL:AD:DP
chr1 65898 . T C 2055.11 . AC=7;AN=10;SF=0,1;VRT=1 GT:DP:AD:PL:GQ
chr1 69511 rs75062661 A G 822333 . AC=894;AN=934;SF=0,1;VRT=1 GT:GQ:DP:AD:PL
chr19 9088447 9088447 T C 2420.41 . AC=5;AN=10;SF=441,477,517,643,662;VRT=1 GT:GQ:DP:AD:PL
chr19 9072886 9072886 G C 245.11 . AC=5;AN=10;SF=448,498,500,741,778;VRT=1 GT:GQ:PL:AD:DP
chr7 149484550 149484550 G A 305.11 . AC=5;AN=10;SF=606,704,723,736,763;VRT=1 GT:DP:AD:PL:GQ
chr7 149482784 149482784 C T 822 . AC=5;AN=10;SF=472,478,684,708,799;VRT=1 GT:GQ:DP:AD:PL
After filtering
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr19 9088447 9088447 T C 2420.41 . AC=5;AN=10;SF=441,477,517,643,662;VRT=1 GT:GQ:DP:AD:PL
chr19 9072886 9072886 G C 245.11 . AC=5;AN=10;SF=448,498,500,741,778;VRT=1 GT:GQ:PL:AD:DP
chr7 149484550 149484550 G A 305.11 . AC=5;AN=10;SF=606,704,723,736,763;VRT=1 GT:DP:AD:PL:GQ
chr7 149482784 149482784 C T 822 . AC=5;AN=10;SF=472,478,684,708,799;VRT=1 GT:GQ:DP:AD:PL
The original Perl command gives me the same result you want when run against your latest example.
perl -nle '/SF=(\d{3})/; print if $1 > 400 or $. == 1' real.file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr19 9088447 9088447 T C 2420.41 . AC=5;AN=10;SF=441,477,517,643,662;VRT=1 GT:GQ:DP:AD:PL
chr19 9072886 9072886 G C 245.11 . AC=5;AN=10;SF=448,498,500,741,778;VRT=1 GT:GQ:PL:AD:DP
chr7 149484550 149484550 G A 305.11 . AC=5;AN=10;SF=606,704,723,736,763;VRT=1 GT:DP:AD:PL:GQ
chr7 149482784 149482784 C T 822 . AC=5;AN=10;SF=472,478,684,708,799;VRT=1 GT:GQ:DP:AD:PL
If you only want the number as you mentioned in post #12, then:
Sure, but any value greater than 400 would be always 3 digits (or more) and if the maximum number is 818 a three digit is sufficient.
A variation, if the maximum number could be arbitrary.
perl -nle '/SF=(\d+)/; print if $1 > 400 or $. == 1' real.file
Question:
Are you trying to filter if ANY of the numbers in the string SF=0,10,35,55,300,455,654 is greater than 400?
So this brings us back to the question that should have been raised right in the beginning: What OS, shell, and awk version are you running?
Because, taking your last example from post#14, I see
awk '{match ($0, ";SF=[^,]*,"); VAL = substr($0, RSTART+4, RLENGTH-4)} FNR == 1 || VAL > 400' file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr19 9088447 9088447 T C 2420.41 . AC=5;AN=10;SF=441,477,517,643,662;VRT=1 GT:GQ:DP:AD:PL
chr19 9072886 9072886 G C 245.11 . AC=5;AN=10;SF=448,498,500,741,778;VRT=1 GT:GQ:PL:AD:DP
chr7 149484550 149484550 G A 305.11 . AC=5;AN=10;SF=606,704,723,736,763;VRT=1 GT:DP:AD:PL:GQ
chr7 149482784 149482784 C T 822 . AC=5;AN=10;SF=472,478,684,708,799;VRT=1 GT:GQ:DP:AD:PL
EXACTLY what you required!
(Aside - as did awk 'FNR == 1 || $6 > 400' FS="[,;=]" file yield...)
(Next aside: as did Aia's perl code do!)