How to print values that are greater than 0.1 in at least 80% of the samples?

quincyjones · February 17, 2015, 10:02am

input

        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10
v2    0       0       0.01    0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0
v4    0.2     0       0       0       0       0       0       0       0       0
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

output

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Tried
I tried sth that able to print values that are greater than 0.1

 awk '{for (i=2;i<=NF;i++) if ($i>0.1) {print $0;next} }'

Corona688 · February 17, 2015, 10:42am

awk '{ T=0 ; for (i=2;i<=NF;i++) if ($i>0.1) {T++ }  if(T > (NF * 0.8)) print}'

quincyjones · February 17, 2015, 10:49am

@corona688: it is just printing the header. I think something is wrong ?

RudiC · February 17, 2015, 10:55am

Try

awk '{L=1; for (i=2;i<=NF;i++) L=L*($i>=0.1)} L' file4
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Please be aware that the header is printed "by sheer coincidence". To make it print safely, add sth like NR==1; or so...

@Corona688: I'm not sure I get your logics. Could you explain?

RavinderSingh13 · February 17, 2015, 10:55am

Hello quincyjones,

Could you please try following and let me know if this helps.(Little addition to Corona's code)

awk 'BEGIN{ T=0} ; {if(NR==1){print $0} else if(NR>1){for (i=2;i<=NF;i++) if ($i>0.1) {T++ }  if(T > (NF * 0.8)) {print;T=""}}}'   Input_file

Output will be as follows.

        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

Thanks,
R. Singh

RudiC · February 17, 2015, 11:06am

Rats! I didn't read the title ... be back soon ...

---------- Post updated at 17:06 ---------- Previous update was at 16:57 ----------

... adapting Corona688's proposal slightly (as there are 10 date fields but 11 in total; and the request was "at least 80%"):

awk     '{T=0; for (i=2;i<=NF;i++) T+=($i>0.1)}
         T >= ((NF-1) * 0.8)
        ' file
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7

quincyjones · February 17, 2015, 11:15am

Thank you all. it's working great. However could someone please explain the logic behind

.

Corona688 · February 17, 2015, 12:23pm

NF is the number of fields.

If T is greater than 80% of NF, print.

quincyjones · February 23, 2015, 5:43am

ravindersingh13:

Hello quincyjones,

Could you please try following and let me know if this helps.(Little addition to Corona's code)
awk 'BEGIN{ T=0} ; {if(NR==1){print $0} else if(NR>1){for (i=2;i<=NF;i++) if ($i>0.1) {T++ }  if(T > (NF * 0.8)) {print;T=""}}}'   Input_file
Output will be as follows.
   sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7
Thanks,
R. Singh

Is it possible to extend the same code but calculating 80% in each group separately like the flowing

Input

        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10 0.2     0.1     0.1    0       1       2       3       4       9       10
v2    0       0       0.01    0       0       0       0       0       0       0 0       0       0.01    0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0 0       0       0       0       0       0       0       0       0       0
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

output

         group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

RudiC · February 23, 2015, 8:21am

That certainly is possible.
Why do lines v2, v3, v4 show up in your sample output?
Are there always two groups? Of identical length?
What be the exact condition for when to print and when not?

quincyjones · February 23, 2015, 8:27am

oops. I corrected it now. So each value should be greater than 0.1 in 80% of the samples in at least one group. Ex: V4 satisfy this condition in group2 and V5 in group1.

RavinderSingh13 · February 23, 2015, 8:40am

Hello quincyjones,

I think output should be v1 . Following may help you in same, please let me know if this helps.

 awk '{for(i=2;i<=11;i++){if($i > .1 && $(i+10) > .1){T=1}};if(T){print $0;T=""}}'  Input_file

Output will be as follows.

        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1    0.2     0.1     0.1    0       1       2       3       4       9       10 0.2     0.1     0.1    0       1       2       3       4       9       10

EDIT: Sorry typo here changed the output now.

Thanks,
R. Singh

quincyjones · February 23, 2015, 8:45am

It is v4 and v5. Because v1 has three samples of either group 1 or group2 have values have <=0.1 (so it doesn't satisfy the condition "greater than 0.1 in at least 80% of the samples in a specific group). Hope that is clear.

RudiC · February 23, 2015, 8:50am

You didn't answer my second & third question.

quincyjones · February 23, 2015, 8:54am

Sorry for not being so clear. Thanks.

RudiC · February 23, 2015, 8:56am

This is for exactly the sample you posted - two groups of 10 members each:

awk     '       {G1=G2=0
                 for (i=2;i<=11;i++) {G1+=($i>0.1); G2+=($(i+10)>0.1)}
                }
         G1 >= 8 || G2 >= 8
        ' file
        group1  group1  group1  group1  group1  group1  group1  group1  group1  group1  group2  group2  group2  group2  group2  group2  group2  group2  group2  group2
        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4    0.2     0       0       0       0       0       0       0       0       0 0.1     0.1     0.2     0.2     10      2       3       5       6       7
v5    0.1     0.1     0.2     0.2     10      2       3       5       6       7 0.2     0       0       0       0       0       0       0       0       0

NO flexibilty at all for changing group sizes or group count; count must be 10 each.

RavinderSingh13 · February 23, 2015, 9:04am

Thank you RudiC, I misunderstood requirement, I thought we need to compare groups(which is correct) but didn't get about 80% concept thought user is asking any group is above 80% then it should print line.

Thanks,
R. Singh

quincyjones · February 23, 2015, 9:05am

so i think it doesnt work with multiple groups with different sample sizes ?

ex:

        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v1      0       0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v3      0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       1       0       0       0       0
v4      1       0       0       0       0       0       0       0       0       1       1       1       1       1       0       0       0       0       0       0
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

output

        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

ps:

RudiC · February 23, 2015, 9:16am

Well, try this - developed for your former sample it seems to work with the actual one:

awk     'NR==1  {for (i=1; i<=NF; i++) GRCNT[$i]++
#                                                               for (i in GRCNT) print i, GRCNT 
                }

                {COL=2
                 for (gc in GRCNT)      {TOT[gc]=0
                                         STP=COL+GRCNT[gc]
                                         for (;COL<STP;COL++) TOT[gc]+=($COL>0.1)
                                        }

                 for (gc in TOT)        {#                      print gc, GRCNT[gc], TOT[gc]
                                         if (TOT[gc] >= GRCNT[gc] * 0.8) {print; break}
                                        }
                }
        ' file
        g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g3      g3      g3      g3      g3      g3      g3      g3      g3      g3
        s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s1      s2      s3      s4      s5      s6      s7      s8      s9      s10
v2      0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       1       2       3       4       5       6       6       6
v5      0.2     0.2     0.2     0.2     0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0

The two commented out print statements are for debugging if you need some insight into the script's internal operation...
It still needs groups to be in adjacent columns and the groups to start in col 2.

quincyjones · February 24, 2015, 5:04am

seems there is a bug in the script. for examples it couldn't print v4 (satisfy the condition in group2) and v5(satisfy the condition in group-n)

input

        g1      g1      g1      g1      g1      g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2  g2       g2      g2      g2      g2      g2      gn      gn      gn      gn      gn
        t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t11     t12     t13     t14     t15      t16     t17     t18     t19     t20     t1      t2      t3      t4      t5
v1    0       0       0       0       0       0       0       0       0       0.1     0.1     0.1     0.1     0.1     0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0
v2    0.2     0.1     0.2     0.2     0.2     2       2       2       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0
v3    0       0       0       0       0       0       0       0       0       0       1       2       3       2       2       2       2       2       2       2       2       2       2       2   2     0       0       0       0       0       0       0       0       0       0
v4    0       0       0       0       0       0       0       0       0       0       0.2     0.2     2       2       2       2       2       2       2       2       2       2       2       2   2     2       2       2       2       2       0       0       0       0       0
v5    0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       1       1       1       1       1

Output

        g1      g1      g1      g1      g1      g1      g1      g1      g1      g1      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2      g2  g2       g2      g2      g2      g2      g2      gn      gn      gn      gn      gn
        t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t1      t2      t3      t4      t5      t6      t7      t8      t9      t10     t11     t12     t13     t14     t15      t16     t17     t18     t19     t20     t1      t2      t3      t4      t5
v2      0.2     0.1     0.2     0.2     0.2     2       2       2       2       2       0       0       0       0       0       0       0       0       0       0       0       0       0       0   0     0       0       0       0       0       0       0       0       0       0

Output should be

        g1      g1      g1      g1      g1      g1      g1       g1      g1      g1      g2      g2      g2      g2      g2      g2       g2      g2      g2      g2      g2      g2      g2      g2  g2        g2      g2      g2      g2      g2      gn      gn      gn      gn       gn
        t1      t2      t3      t4      t5      t6      t7       t8      t9      t10     t1      t2      t3      t4      t5      t6       t7      t8      t9      t10     t11     t12     t13     t14     t15       t16     t17     t18     t19     t20     t1      t2      t3      t4       t5
v2     0.2     0.1     0.2     0.2     0.2     2       2       2       2        2       0       0       0       0       0       0       0       0        0       0       0       0       0       0   0     0       0       0        0       0       0       0       0       0       0
v4    0       0        0       0       0       0       0       0       0       0       0.2      0.2     2       2       2       2       2       2       2       2        2       2       2       2   2     2       2       2       2       2        0       0       0       0       0
v5    0       0       0        0       0       0       0       0       0       0       0       0        0       0       0       0       0       0       0       0       0        0       0       0   0     0       0       0       0       0       1        1       1       1       1