Filtering my major and minor values

newbie83 · January 3, 2013, 4:33pm

I want to remove all rows with a minor repeating count less than 30% compared to the major repeating count from my table. The values of a col(starting col 2) can assume is A,T,G,C and N. Each row has at least 2 values and at most 4 repeating values(out of ATGC).
N is considered a missing value and shouldn't be considered.

These are the rules for filtering.

Consider the row which has a count of 4 for Ts and 1 for As (starting col 2).
S10_14113025 T T T A T
If the count of the minor repeating value is less than 30% of the major repeating value, delete the row.

So count(A)/count(T)=1/4=25% < 30%...this row should be removed.

Consider the row with 2 Ts, and 1 A.
S10_14113025 T N N A T
Ignoring the Ns, the minor frequency is
count(A)/count(T)=1/2=50% > 30% ....this row should NOT be removed.

Consider the row with more than 2 values (3 in this case as in G,C,A).
S10_14113072 G C A G N
this row should NOT be removed,nothing needs to be calculated.

Inp

S10_14113025        T    T    T    A    T    T
S10_14113072        A    C    C    A    A    A
S10_14113073        G    C    G    G    C    N
S10_14113079        G    C    C    C    N    N
S10_14113080        G    C    C    C    N    A
S10_14113027        T    T    N    A    N    N

desired out

S10_14113072        A    C    C    A    A    A
S10_14113073        G    C    G    G    C   N
S10_14113080        G    C    C    C    N    A
S10_14113027        T    T    N    A    N    N

vgersh99 · January 3, 2013, 4:49pm

given your explanation - I don't understand the LAST example. Why "nothing needs to be calculated"?
Are you only considering As and Ts?

newbie83 · January 3, 2013, 5:05pm

We only need to filter rows having biallelic nature (exactly two values excluding N).
The rows with more than two values are biologically significant and cant be filtered out.

vgersh99 · January 3, 2013, 5:12pm

ok, I almost got it, but....
why exactly this line did NOT make it into the output?

S10_14113079        G    C    C    C    N    N

vgersh99 · January 3, 2013, 5:14pm

it's a bit verbose, but can be used as a start.
awk -f newbie.awk myInputFile
newbie.awk:

function initVars()
{
  split("",n)
  split("",a)
  c=0
}

{
  for(i=2;i<=NF;i++)
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

  if (c>2) { initVars(); print;next }

  div=a[n[1]]/a[n[2]]
  div=(div>1)?1/div:div
  if ( div*100 > 30)
     print
  initVars()
}

newbie83 · January 3, 2013, 5:14pm

I`m sorry that line should be included...my bad

vgersh99 · January 3, 2013, 5:18pm

then try the suggestion

rdrtx1 · January 3, 2013, 5:32pm

try also:

awk '
{
  delete a; mx=0; mn=100;
  for (i=2; i<=NF; i++) a[$i]++;
  for (i in a) {
    if (a > mx ) mx=a;
    if (a < mn ) mn=a;
  }
}
(mn / mx) * 100 > 30
' input

vgersh99 · January 3, 2013, 5:53pm

good idea, but.... a couple of missing points:

you're counting Ns as legit fields while they should be skipped
there's no provision to print lines with more than 2 unique field values
'delete a' doesn't work on all awk-s. Most awk's don't allow deleting the whole array that way, but only individual array entries. Therefore, the trick is to use 'split' to null-out the array in one step.

newbie83 · January 4, 2013, 5:02pm

Hi,

There is one more filter that needs to be considered.
I want to treat any character other than A,C,G,T,N as a missing value N.
In the example, the character Y is treated as if it was a missing value N since it
does not belong to the subset {A,C,G,T,N}.

For example the row

S10_14113072        A    C    C    A    A    Y

needs to be treated the same as

S10_14113072        A    C    C    A    A    N

But if such a row appears in the output it should appear as

S10_14113072        A    C    C    A    A    Y

---------- Post updated at 06:02 PM ---------- Previous update was at 05:57 PM ----------

I have done a small modification

if ($i != "[ACGTN]") { $i = "N" }

Doesnt seem to work. please help me with the correct code?

function initVars()
{
  split("",n)
  split("",a)
  c=0
}

{
  for(i=2;i<=NF;i++)
    if ($i != "[ACGTN]") { $i = "N" } 
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

  if (c>2) { initVars(); print;next }

  div=a[n[1]]/a[n[2]]
  div=(div>1)?1/div:div
  if ( div*100 > 30)
     print
  initVars()
}

binlib · January 4, 2013, 9:11pm

Change

  for(i=2;i<=NF;i++)
    if ($i != "[ACGTN]") { $i = "N" } 
    if ($i != "N") {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

to

  for(i=2;i<=NF;i++)
    if ($i ~ /[ACGT]/) {
     if (!($i in a))
       n[++c]=$i
     a[$i]++
    }

summer_cherry · January 5, 2013, 1:47am

awk '{
    l_max=l_min=hasN=cnt=0
    delete _
    for(i=2;i<=NF;i++){
        if($i=="N")
            hasN=1
        _[$i]++
    }
    for(i in _){
        cnt++
        if(l_max==0 || _>=l_max)
            l_max=_
        if(l_min==0 || _<=l_min)
            l_min=_
    }
    if((cnt==2 && hasN==0) || (cnt==3 && hasN==1)){
        per=l_min/l_max
        if(per>=0.3)
            print $0
    }
}' yourfile

vgersh99 · January 5, 2013, 10:09am

change

if ($i != "N") {

to

if ($i != "N" && $i ~ /[ACGT]/ {