I want to remove all rows with a minor repeating count less than 30% compared to the major repeating count from my table. The values of a col(starting col 2) can assume is A,T,G,C and N. Each row has at least 2 values and at most 4 repeating values(out of ATGC).
N is considered a missing value and shouldn't be considered.
These are the rules for filtering.
Consider the row which has a count of 4 for Ts and 1 for As (starting col 2). S10_14113025 T T T A T
If the count of the minor repeating value is less than 30% of the major repeating value, delete the row.
So count(A)/count(T)=1/4=25% < 30%...this row should be removed.
Consider the row with 2 Ts, and 1 A. S10_14113025 T N N A T
Ignoring the Ns, the minor frequency is
count(A)/count(T)=1/2=50% > 30% ....this row should NOT be removed.
Consider the row with more than 2 values (3 in this case as in G,C,A). S10_14113072 G C A G N
this row should NOT be removed,nothing needs to be calculated.
Inp
S10_14113025 T T T A T T
S10_14113072 A C C A A A
S10_14113073 G C G G C N
S10_14113079 G C C C N N
S10_14113080 G C C C N A
S10_14113027 T T N A N N
desired out
S10_14113072 A C C A A A
S10_14113073 G C G G C N
S10_14113080 G C C C N A
S10_14113027 T T N A N N
We only need to filter rows having biallelic nature (exactly two values excluding N).
The rows with more than two values are biologically significant and cant be filtered out.
you're counting Ns as legit fields while they should be skipped
there's no provision to print lines with more than 2 unique field values
'delete a' doesn't work on all awk-s. Most awk's don't allow deleting the whole array that way, but only individual array entries. Therefore, the trick is to use 'split' to null-out the array in one step.
There is one more filter that needs to be considered.
I want to treat any character other than A,C,G,T,N as a missing value N.
In the example, the character Y is treated as if it was a missing value N since it
does not belong to the subset {A,C,G,T,N}.
For example the row
S10_14113072 A C C A A Y
needs to be treated the same as
S10_14113072 A C C A A N
But if such a row appears in the output it should appear as
S10_14113072 A C C A A Y
---------- Post updated at 06:02 PM ---------- Previous update was at 05:57 PM ----------
I have done a small modification
if ($i != "[ACGTN]") { $i = "N" }
Doesnt seem to work. please help me with the correct code?
function initVars()
{
split("",n)
split("",a)
c=0
}
{
for(i=2;i<=NF;i++)
if ($i != "[ACGTN]") { $i = "N" }
if ($i != "N") {
if (!($i in a))
n[++c]=$i
a[$i]++
}
if (c>2) { initVars(); print;next }
div=a[n[1]]/a[n[2]]
div=(div>1)?1/div:div
if ( div*100 > 30)
print
initVars()
}