Filtering my major and minor values

I want to remove all rows with a minor repeating count less than 30% compared to the major repeating count from my table. The values of a col(starting col 2) can assume is A,T,G,C and N. Each row has at least 2 values and at most 4 repeating values(out of ATGC).
N is considered a missing value and shouldn't be considered.

These are the rules for filtering.

Consider the row which has a count of 4 for Ts and 1 for As (starting col 2).
S10_14113025 T T T A T
If the count of the minor repeating value is less than 30% of the major repeating value, delete the row.

So count(A)/count(T)=1/4=25% < 30%...this row should be removed.

Consider the row with 2 Ts, and 1 A.
S10_14113025 T N N A T
Ignoring the Ns, the minor frequency is
count(A)/count(T)=1/2=50% > 30% ....this row should NOT be removed.

Consider the row with more than 2 values (3 in this case as in G,C,A).
S10_14113072 G C A G N
this row should NOT be removed,nothing needs to be calculated.

Inp

S10_14113025        T    T    T    A    T    T
S10_14113072        A    C    C    A    A    A
S10_14113073        G    C    G    G    C    N
S10_14113079        G    C    C    C    N    N
S10_14113080        G    C    C    C    N    A
S10_14113027        T    T    N    A    N    N

desired out

S10_14113072        A    C    C    A    A    A
S10_14113073        G    C    G    G    C   N
S10_14113080        G    C    C    C    N    A
S10_14113027        T    T    N    A    N    N

given your explanation - I don't understand the LAST example. Why "nothing needs to be calculated"?
Are you only considering As and Ts?

1 Like

We only need to filter rows having biallelic nature (exactly two values excluding N).
The rows with more than two values are biologically significant and cant be filtered out.

ok, I almost got it, but....
why exactly this line did NOT make it into the output?

S10_14113079        G    C    C    C    N    N
1 Like

it's a bit verbose, but can be used as a start.
awk -f newbie.awk myInputFile
newbie.awk:

function initVars()
{
  split("",n)
  split("",a)
  c=0
}

{
  for(i=2;i<=NF;i++)
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

  if (c>2) { initVars(); print;next }

  div=a[n[1]]/a[n[2]]
  div=(div>1)?1/div:div
  if ( div*100 > 30)
     print
  initVars()
}

I`m sorry that line should be included...my bad :frowning:

then try the suggestion

1 Like

try also:

awk '
{
  delete a; mx=0; mn=100;
  for (i=2; i<=NF; i++) a[$i]++;
  for (i in a) {
    if (a > mx ) mx=a;
    if (a < mn ) mn=a;
  }
}
(mn / mx) * 100 > 30
' input

good idea, but.... a couple of missing points:

  1. you're counting Ns as legit fields while they should be skipped
  2. there's no provision to print lines with more than 2 unique field values
  3. 'delete a' doesn't work on all awk-s. Most awk's don't allow deleting the whole array that way, but only individual array entries. Therefore, the trick is to use 'split' to null-out the array in one step.

Hi,

There is one more filter that needs to be considered.
I want to treat any character other than A,C,G,T,N as a missing value N.
In the example, the character Y is treated as if it was a missing value N since it
does not belong to the subset {A,C,G,T,N}.

For example the row

S10_14113072        A    C    C    A    A    Y

needs to be treated the same as

S10_14113072        A    C    C    A    A    N

But if such a row appears in the output it should appear as

S10_14113072        A    C    C    A    A    Y

---------- Post updated at 06:02 PM ---------- Previous update was at 05:57 PM ----------

I have done a small modification

if ($i != "[ACGTN]") { $i = "N" }

Doesnt seem to work. please help me with the correct code?

function initVars()
{
  split("",n)
  split("",a)
  c=0
}

{
  for(i=2;i<=NF;i++)
    if ($i != "[ACGTN]") { $i = "N" } 
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

  if (c>2) { initVars(); print;next }

  div=a[n[1]]/a[n[2]]
  div=(div>1)?1/div:div
  if ( div*100 > 30)
     print
  initVars()
}

Change

  for(i=2;i<=NF;i++)
    if ($i != "[ACGTN]") { $i = "N" } 
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

to

  for(i=2;i<=NF;i++)
    if ($i ~ /[ACGT]/) {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }
awk '{
    l_max=l_min=hasN=cnt=0
    delete _
    for(i=2;i<=NF;i++){
        if($i=="N")
            hasN=1
        _[$i]++
    }
    for(i in _){
        cnt++
        if(l_max==0 || _>=l_max)
            l_max=_
        if(l_min==0 || _<=l_min)
            l_min=_
    }
    if((cnt==2 && hasN==0) || (cnt==3 && hasN==1)){
        per=l_min/l_max
        if(per>=0.3)
            print $0
    }
}' yourfile

change

if ($i != "N") {

to

if ($i != "N" && $i ~ /[ACGT]/ {