How to print values that are greater than 0.1 in at least 80% of the samples?

input

        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10
v2    0       0       0.01    0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0
v4    0.2     0       0       0       0       0       0       0       0       0
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

output

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Tried
I tried sth that able to print values that are greater than 0.1

 awk '{for (i=2;i<=NF;i++) if ($i>0.1) {print $0;next} }'
awk '{ T=0 ; for (i=2;i<=NF;i++) if ($i>0.1) {T++ }  if(T > (NF * 0.8)) print}'

@corona688: it is just printing the header. I think something is wrong ?

Try

awk '{L=1; for (i=2;i<=NF;i++) L=L*($i>=0.1)} L' file4
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Please be aware that the header is printed "by sheer coincidence". To make it print safely, add sth like NR==1; or so...

@Corona688: I'm not sure I get your logics. Could you explain?

Hello quincyjones,

Could you please try following and let me know if this helps.(Little addition to Corona's code)

awk 'BEGIN{ T=0} ; {if(NR==1){print $0} else if(NR>1){for (i=2;i<=NF;i++) if ($i>0.1) {T++ }  if(T > (NF * 0.8)) {print;T=""}}}'   Input_file

Output will be as follows.

        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Thanks,
R. Singh

1 Like

Rats! I didn't read the title ... be back soon ...

---------- Post updated at 17:06 ---------- Previous update was at 16:57 ----------

... adapting Corona688's proposal slightly (as there are 10 date fields but 11 in total; and the request was "at least 80%"):

awk     '{T=0; for (i=2;i<=NF;i++) T+=($i>0.1)}
         T >= ((NF-1) * 0.8)
        ' file
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Thank you all. it's working great. However could someone please explain the logic behind

.

NF is the number of fields.

If T is greater than 80% of NF, print.

1 Like

Is it possible to extend the same code but calculating 80% in each group separately like the flowing

Input

        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10 0.2     0.1     0.1    0       1       2       3       4       9       10
v2    0       0       0.01    0       0       0       0       0       0       0 0       0       0.01    0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0 0       0       0       0       0       0       0       0       0       0
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

output

         group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

That certainly is possible.
Why do lines v2, v3, v4 show up in your sample output?
Are there always two groups? Of identical length?
What be the exact condition for when to print and when not?

oops. I corrected it now. So each value should be greater than 0.1 in 80% of the samples in at least one group. Ex: V4 satisfy this condition in group2 and V5 in group1.

Hello quincyjones,

I think output should be v1 . Following may help you in same, please let me know if this helps.

 awk '{for(i=2;i<=11;i++){if($i > .1 && $(i+10) > .1){T=1}};if(T){print $0;T=""}}'  Input_file

Output will be as follows.

        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10 0.2     0.1     0.1    0       1       2       3       4       9       10

EDIT: Sorry typo here changed the output now.

Thanks,
R. Singh

It is v4 and v5. Because v1 has three samples of either group 1 or group2 have values have <=0.1 (so it doesn't satisfy the condition "greater than 0.1 in at least 80% of the samples in a specific group). Hope that is clear.

You didn't answer my second & third question.

Sorry for not being so clear. Thanks.

This is for exactly the sample you posted - two groups of 10 members each:

awk     '       {G1=G2=0
                 for (i=2;i<=11;i++) {G1+=($i>0.1); G2+=($(i+10)>0.1)}
                }
         G1 >= 8 || G2 >= 8
        ' file
        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

NO flexibilty at all for changing group sizes or group count; count must be 10 each.

1 Like

Thank you RudiC, I misunderstood requirement, I thought we need to compare groups(which is correct) but didn't get about 80% concept thought user is asking any group is above 80% then it should print line.

Thanks,
R. Singh

so i think it doesnt work with multiple groups with different sample sizes ?

ex:

        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v1      0       0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v3      0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       1       0       0       0       0
v4      1       0       0       0       0       0       0       0       0       1       1       1       1       1       0       0       0       0       0       0
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

output

        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

ps:

Well, try this - developed for your former sample it seems to work with the actual one:

awk     'NR==1  {for (i=1; i<=NF; i++) GRCNT[$i]++
#                                                               for (i in GRCNT) print i, GRCNT 
                }

                {COL=2
                 for (gc in GRCNT)      {TOT[gc]=0
                                         STP=COL+GRCNT[gc]
                                         for (;COL<STP;COL++) TOT[gc]+=($COL>0.1)
                                        }

                 for (gc in TOT)        {#                      print gc, GRCNT[gc], TOT[gc]
                                         if (TOT[gc] >= GRCNT[gc] * 0.8) {print; break}
                                        }
                }
        ' file
        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

The two commented out print statements are for debugging if you need some insight into the script's internal operation...
It still needs groups to be in adjacent columns and the groups to start in col 2.

1 Like

seems there is a bug in the script. for examples it couldn't print v4 (satisfy the condition in group2) and v5(satisfy the condition in group-n)

input

        g1      g1      g1      g1      g1      g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2  g2       g2      g2      g2      g2      g2      gn      gn      gn      gn      gn
        t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t11     t12     t13     t14     t15      t16     t17     t18     t19     t20     t1      t2      t3      t4      t5
v1    0       0       0       0       0       0       0       0       0       0.1     0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0
v2    0.2     0.1     0.2     0.2     0.2     2       2       2       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0       1       2       3       2       2       2       2       2       2       2       2       2       2       2   2     0       0       0       0       0       0       0       0       0       0
v4    0       0       0       0       0       0       0       0       0       0       0.2     0.2     2       2       2       2       2       2       2       2       2       2       2       2   2     2       2       2       2       2       0       0       0       0       0
v5    0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       1       1       1       1       1

Output

        g1      g1      g1      g1      g1      g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2  g2       g2      g2      g2      g2      g2      gn      gn      gn      gn      gn
        t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t11     t12     t13     t14     t15      t16     t17     t18     t19     t20     t1      t2      t3      t4      t5
v2      0.2     0.1     0.2     0.2     0.2     2       2       2       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0

Output should be

        g1      g1      g1      g1      g1      g1      g1       g1      g1      g1      g2      g2      g2      g2      g2      g2       g2      g2      g2      g2      g2      g2      g2      g2  g2        g2      g2      g2      g2      g2      gn      gn      gn      gn       gn
        t1      t2      t3      t4      t5      t6      t7       t8      t9      t10     t1      t2      t3      t4      t5      t6       t7      t8      t9      t10     t11     t12     t13     t14     t15       t16     t17     t18     t19     t20     t1      t2      t3      t4       t5
v2     0.2     0.1     0.2     0.2     0.2     2       2       2       2        2       0       0       0       0       0       0       0       0        0       0       0       0       0       0   0     0       0       0        0       0       0       0       0       0       0
v4    0       0        0       0       0       0       0       0       0       0       0.2      0.2     2       2       2       2       2       2       2       2        2       2       2       2   2     2       2       2       2       2        0       0       0       0       0
v5    0       0       0        0       0       0       0       0       0       0       0       0        0       0       0       0       0       0       0       0       0        0       0       0   0     0       0       0       0       0       1        1       1       1       1