sum of a column and selecting lines with value above threshold

Hi again,
I need to further process the results of a previous manipulation.
I have a file with three columns
e.g.

AAA5 0.00175 1.97996e-06
AAA5 0.01334 2.14159e-05
AAA5 0.01340 4.12155e-05
AAA5 0.01496 1.10312e-05
AAA5 0.51401 0.0175308
BB0 0.00204 2.8825e-07
BB0 0.01569 7.94746e-07
BB0 0.01578 1.51949e-06
BB0 0.01766 3.91196e-07
BB0 2.18630 3.60312e-10
BB1 1.20490 6.06204e-09
CCC5 0.00226 5.31546e-10

What I would like to do is to sum the total of the 3rd column.
then, select (print) only those line whose ratio 3rd column/total is bigger than a certain value (stored in a variable).

I tried awk, but I still don't get the way it works.

Thanks in advance,

#! /usr/bin/perl -w
use strict;

my ($line, $sum, $crtn_value, $ratio) = (0, 0, 1e-4, 0);
my (@x);

open INPUT, "< input.txt";
for $line (<INPUT>) {
    @x = split /\s+/, $line;
    $sum = $sum + $x[2];
}
close INPUT;

open INPUT, "< input.txt";
for $line (<INPUT>) {
    @x = split /\s+/, $line;
    $ratio = $x[2] / $sum;
    ($ratio > $crtn_value) && print "$line";
}
close INPUT;
1 Like

Do you think that could this be done with AWK (maybe I gain some insight how it works)?

Try this...

awk '{sum+=$3;a[$0]=$3}END{for(i in a){if(a/sum > val){print i}}}' val="0.016" input_file

--ahamed

1 Like

Hi ahamed,
indeed it works thank you.
It is not clear in which order the results are printed.
Is there a way to print them in the same order as they appear in the input file?

sort the output if that helps i.e. awk '{...}' input_file | sort
or
Try this...

awk '{sum+=$3;a[++j]=$0}END{for(i=1;i<=j;i++){split(a,arr);if(arr[3]/sum > val){print a}}}' val="0.016" input_file

--ahamed

1 Like

Hi this solution works better than the previous, which didn't work all the time.
e.g.
file1

eu154     1.51000000        1.70513841e+10
eu154     1.52200000        2.07052522e+09
eu154     1.52200000        2.07052522e+09
eu154     1.53140000        2.07052522e+10
eu154     1.53780000        1.70513841e+11
eu154     1.55400000        4.87182404e+09
eu154     1.59300000        3.53207243e+12
eu154     1.59650000        6.27264499e+12
awk '{sum+=$3;a[$0]=$3}END{for(i in a){if(a/sum > val){print i}}}' val=0.0 file3

was giving out

eu154     1.53140000        2.07052522e+10
eu154     1.59300000        3.53207243e+12
eu154     1.52200000        2.07052522e+09
eu154     1.59650000        6.27264499e+12
eu154     1.55400000        4.87182404e+09
eu154     1.51000000        1.70513841e+10
eu154     1.53780000        1.70513841e+11

So it was missing one of the 2 identical lines

Still I have to sort the file with sort afterwards, but the command

awk '{sum+=$3;a[++j]=$0}END{for(i=1;i<=j;i++){split(a,arr);if(arr[3]/sum > val){print a}}}' val="0.0" file3

gives

eu154     1.51000000        1.70513841e+10
eu154     1.52200000        2.07052522e+09
eu154     1.52200000        2.07052522e+09
eu154     1.53140000        2.07052522e+10
eu154     1.53780000        1.70513841e+11
eu154     1.55400000        4.87182404e+09
eu154     1.59300000        3.53207243e+12
eu154     1.59650000        6.27264499e+12

as expected