input
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1 0.2 0.1 0.1 0 1 2 3 4 9 10
v2 0 0 0.01 0 0 0 0 0 0 0
v3 0 0 0 0 0 0 0 0 0 0
v4 0.2 0 0 0 0 0 0 0 0 0
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
output
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
Tried
I tried sth that able to print values that are greater than 0.1
awk '{for (i=2;i<=NF;i++) if ($i>0.1) {print $0;next} }'
awk '{ T=0 ; for (i=2;i<=NF;i++) if ($i>0.1) {T++ } if(T > (NF * 0.8)) print}'
@corona688 : it is just printing the header. I think something is wrong ?
RudiC
February 17, 2015, 10:55am
4
Try
awk '{L=1; for (i=2;i<=NF;i++) L=L*($i>=0.1)} L' file4
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
Please be aware that the header is printed "by sheer coincidence". To make it print safely, add sth like NR==1;
or so...
@Corona688 : I'm not sure I get your logics. Could you explain?
Hello quincyjones,
Could you please try following and let me know if this helps.(Little addition to Corona's code)
awk 'BEGIN{ T=0} ; {if(NR==1){print $0} else if(NR>1){for (i=2;i<=NF;i++) if ($i>0.1) {T++ } if(T > (NF * 0.8)) {print;T=""}}}' Input_file
Output will be as follows.
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
Thanks,
R. Singh
1 Like
RudiC
February 17, 2015, 11:06am
6
Rats! I didn't read the title ... be back soon ...
---------- Post updated at 17:06 ---------- Previous update was at 16:57 ----------
... adapting Corona688's proposal slightly (as there are 10 date fields but 11 in total; and the request was "at least 80%"):
awk '{T=0; for (i=2;i<=NF;i++) T+=($i>0.1)}
T >= ((NF-1) * 0.8)
' file
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
Thank you all. it's working great. However could someone please explain the logic behind
.
NF is the number of fields.
If T is greater than 80% of NF, print.
1 Like
ravindersingh13:
Hello quincyjones,
Could you please try following and let me know if this helps.(Little addition to Corona's code)
awk 'BEGIN{ T=0} ; {if(NR==1){print $0} else if(NR>1){for (i=2;i<=NF;i++) if ($i>0.1) {T++ } if(T > (NF * 0.8)) {print;T=""}}}' Input_file
Output will be as follows.
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7
Thanks,
R. Singh
Is it possible to extend the same code but calculating 80% in each group separately like the flowing
Input
group1 group1 group1 group1 group1 group1 group1 group1 group1 group1 group2 group2 group2 group2 group2 group2 group2 group2 group2 group2
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1 0.2 0.1 0.1 0 1 2 3 4 9 10 0.2 0.1 0.1 0 1 2 3 4 9 10
v2 0 0 0.01 0 0 0 0 0 0 0 0 0 0.01 0 0 0 0 0 0 0
v3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v4 0.2 0 0 0 0 0 0 0 0 0 0.1 0.1 0.2 0.2 10 2 3 5 6 7
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7 0.2 0 0 0 0 0 0 0 0 0
output
group1 group1 group1 group1 group1 group1 group1 group1 group1 group1 group2 group2 group2 group2 group2 group2 group2 group2 group2 group2
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4 0.2 0 0 0 0 0 0 0 0 0 0.1 0.1 0.2 0.2 10 2 3 5 6 7
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7 0.2 0 0 0 0 0 0 0 0 0
RudiC
February 23, 2015, 8:21am
10
That certainly is possible.
Why do lines v2, v3, v4 show up in your sample output?
Are there always two groups? Of identical length?
What be the exact condition for when to print and when not?
oops. I corrected it now. So each value should be greater than 0.1 in 80% of the samples in at least one group. Ex: V4 satisfy this condition in group2 and V5 in group1.
Hello quincyjones,
I think output should be v1
. Following may help you in same, please let me know if this helps.
awk '{for(i=2;i<=11;i++){if($i > .1 && $(i+10) > .1){T=1}};if(T){print $0;T=""}}' Input_file
Output will be as follows.
group1 group1 group1 group1 group1 group1 group1 group1 group1 group1 group2 group2 group2 group2 group2 group2 group2 group2 group2 group2
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v1 0.2 0.1 0.1 0 1 2 3 4 9 10 0.2 0.1 0.1 0 1 2 3 4 9 10
EDIT: Sorry typo here changed the output now.
Thanks,
R. Singh
It is v4 and v5. Because v1 has three samples of either group 1 or group2 have values have <=0.1 (so it doesn't satisfy the condition "greater than 0.1 in at least 80% of the samples in a specific group). Hope that is clear.
RudiC
February 23, 2015, 8:50am
14
You didn't answer my second & third question.
Why do lines v2, v3, v4 show up in your sample output?
It is only V4 and V5. Now I corrected it.
Are there always two groups? Of identical length?
No. Some of the groups could have different number of samples.
What be the exact condition for when to print and when not?
The values should be printed if it satisfy the condition that is greater than 0.1 in 80% of the samples in at least one of the group. And the ones which do not satisfy should be ignored.
Sorry for not being so clear. Thanks.
RudiC
February 23, 2015, 8:56am
16
This is for exactly the sample you posted - two groups of 10 members each:
awk ' {G1=G2=0
for (i=2;i<=11;i++) {G1+=($i>0.1); G2+=($(i+10)>0.1)}
}
G1 >= 8 || G2 >= 8
' file
group1 group1 group1 group1 group1 group1 group1 group1 group1 group1 group2 group2 group2 group2 group2 group2 group2 group2 group2 group2
sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
v4 0.2 0 0 0 0 0 0 0 0 0 0.1 0.1 0.2 0.2 10 2 3 5 6 7
v5 0.1 0.1 0.2 0.2 10 2 3 5 6 7 0.2 0 0 0 0 0 0 0 0 0
NO flexibilty at all for changing group sizes or group count; count must be 10 each.
1 Like
Thank you RudiC, I misunderstood requirement, I thought we need to compare groups(which is correct) but didn't get about 80% concept thought user is asking any group is above 80% then it should print line.
Thanks,
R. Singh
so i think it doesnt work with multiple groups with different sample sizes ?
ex:
g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g3 g3 g3 g3 g3 g3 g3 g3 g3 g3
s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
v1 0 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v2 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0 1 2 3 4 5 6 6 6
v3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
v4 1 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0
v5 0.2 0.2 0.2 0.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
output
g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g3 g3 g3 g3 g3 g3 g3 g3 g3 g3
s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
v2 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0 1 2 3 4 5 6 6 6
v5 0.2 0.2 0.2 0.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ps:
RudiC
February 23, 2015, 9:16am
19
Well, try this - developed for your former sample it seems to work with the actual one:
awk 'NR==1 {for (i=1; i<=NF; i++) GRCNT[$i]++
# for (i in GRCNT) print i, GRCNT
}
{COL=2
for (gc in GRCNT) {TOT[gc]=0
STP=COL+GRCNT[gc]
for (;COL<STP;COL++) TOT[gc]+=($COL>0.1)
}
for (gc in TOT) {# print gc, GRCNT[gc], TOT[gc]
if (TOT[gc] >= GRCNT[gc] * 0.8) {print; break}
}
}
' file
g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g3 g3 g3 g3 g3 g3 g3 g3 g3 g3
s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
v2 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0 1 2 3 4 5 6 6 6
v5 0.2 0.2 0.2 0.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The two commented out print statements are for debugging if you need some insight into the script's internal operation...
It still needs groups to be in adjacent columns and the groups to start in col 2.
1 Like
seems there is a bug in the script. for examples it couldn't print v4 (satisfy the condition in group2) and v5(satisfy the condition in group-n)
input
g1 g1 g1 g1 g1 g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 gn gn gn gn gn
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t1 t2 t3 t4 t5
v1 0 0 0 0 0 0 0 0 0 0.1 0.1 0.1 0.1 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v2 0.2 0.1 0.2 0.2 0.2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v3 0 0 0 0 0 0 0 0 0 0 1 2 3 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0
v4 0 0 0 0 0 0 0 0 0 0 0.2 0.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0
v5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
Output
g1 g1 g1 g1 g1 g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 gn gn gn gn gn
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t1 t2 t3 t4 t5
v2 0.2 0.1 0.2 0.2 0.2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Output should be
g1 g1 g1 g1 g1 g1 g1 g1 g1 g1 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 gn gn gn gn gn
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t1 t2 t3 t4 t5
v2 0.2 0.1 0.2 0.2 0.2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
v4 0 0 0 0 0 0 0 0 0 0 0.2 0.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0
v5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1