Average across rows with a condition

jacobs.smith · September 17, 2014, 3:05pm

Hi Friends,

My input file

Gene1 10 20 0
Gene2 5 0 15
Gene3 10 10 10
Gene4 5 0 0

If there is a zero for any gene in any column, I don't want that column to be considered which reduces the denominator value during average.

Here is my output

Gene1 10 20 0 10
Gene2 5 0 15 10
Gene3 10 10 10 10
Gene4 5 0 0 5

The last column is my average. I didn't try anything for this solution because I didn't know where to start. Sorry.

junior-helper · September 17, 2014, 4:00pm

This approach utilizes bash arrays:

while read gname val1 val2 val3
 do
  arr=()
  [ $val1 -ne 0 ] && arr=("${arr[@]}" "$val1")
  [ $val2 -ne 0 ] && arr=("${arr[@]}" "$val2")
  [ $val3 -ne 0 ] && arr=("${arr[@]}" "$val3")
  nbrofvals=${#arr[@]}
  tot=0
    for arrval in "${arr[@]}"
     do
      tot=$((tot + arrval))
    done
  avg=$((tot/nbrofvals))
echo "$gname $val1 $val2 $val3 $avg"
done <input

jacobs.smith · September 17, 2014, 4:37pm

junior-helper:

This approach utilizes bash arrays:

while read gname val1 val2 val3
 do
  arr=()
  [ $val1 -ne 0 ] && arr=("${arr[@]}" "$val1")
  [ $val2 -ne 0 ] && arr=("${arr[@]}" "$val2")
  [ $val3 -ne 0 ] && arr=("${arr[@]}" "$val3")
  nbrofvals=${#arr[@]}
  tot=0
   for arrval in "${arr[@]}"
   do
   tot=$((tot + arrval))
   done
  avg=$((tot/nbrofvals))
echo "$gname $val1 $val2 $val3 $avg"
done <input

Thanks for your time. But I see the following errors

line 4: [: 1415.37: integer expression expected
line 5: [: 0.05248: integer expression expected
line 6: [: 2.20281: integer expression expected
line 13: tot/nbrofvals: division by 0 (error token is "s")

FYI, the zeros in the input files can even occur in second column which is your val1. Thanks for your time.

---------- Post updated at 04:37 PM ---------- Previous update was at 04:06 PM ----------

I tried all possible zeros occurrence like this

gene1	1	2	3
gene2	0	0	3
gene3	0	9	0
gene4	8	0	0
gene5	5	5	0
gene6	0	6	6
gene7	9	9	0

And then wrote if else statements for all combinations

awk '{if($2!=0 && $3!=0 && $4!=0) {v=($2+$3+$4)/3; print $0"\t"v} else if($2==0 && $3==0 && $4!=0) {v=($2+$3+$4)/1; print $0"\t"v} else if($2==0 && $3!=0 && $4==0) {v=($2+$3+$4)/1; print $0"\t"v} else if($2!=0 && $3==0 && $4==0) {v=($2+$3+$4)/1; print $0"\t"v} else if($2!=0 && $3!=0 && $4==0) {v=($2+$3+$4)/2; print $0"\t"v} else if($2==0 && $3!=0 && $4!=0) {v=($2+$3+$4)/2; print $0"\t"v} else if($2!=0 && $3!=0 && $4==0) {v=($2+$3+$4)/2; print $0"\t"v}}' input

output

gene1	1	2	3	2
gene2	0	0	3	3
gene3	0	9	0	9
gene4	8	0	0	8
gene5	5	5	0	5
gene6	0	6	6	6
gene7	9	9	0	9

junior-helper · September 17, 2014, 4:38pm

You're welcome.

OK, obviously you want to calculate average of floating point numbers, right?
I saw your input sample with whole numbers only and scripted something for that.

>the zeros in the input files can even occur in second column which is your val1
If you think of gname as val0, then val1 *is* the second column, thus the script would also handle a zero in the second column.
However, right now, it would fail if all three columns (col2, 3 and 4) were all zeros.

I suggest that you provide another few "real world" input lines and the expected output.
I'm not sure I can fix my code though :rolleyes:, but it would be very useful for others.

Yoda · September 17, 2014, 4:38pm

Try this:-

awk '
        {
                for ( i = 2; i <= NF; i++ )
                {
                        V[$1] = $0
                        A[$1] += $i
                        T[$1] += ( $i == 0 ? 0 : 1 )
                }
        }
        END {
                for ( k in A )
                        print V[k], T[k] ? A[k] / T[k] : 0
        }
' file | sort

RavinderSingh13 · September 18, 2014, 3:18am

Posted by jacobs.smith:

Hello jacobs.smith,

Following may help also.

 awk '{for(i=2;i<=NF;i++){if($i){A[NR]=A[NR]?A[NR]+$i:$i;B[NR]=B[NR]?B[NR]+1:1;}}} {p=f=NR;{for(r=NR;r<=NR;r++){c[p--]=A[r]/B[r]}}{for(t=f;t<=NR;t++){$0=$0 OFS c[t];print $0}}}'  Input_file

Output will be as follows.

Gene1 10 20 0 15
Gene2 5 0 15 10
Gene3 10 10 10 10
Gene4 5 0 0 5

EDIT: Adding a clear view of code for same.

awk '{
for(i=2;i<=NF;i++)
        {if($i)
                {A[NR]=A[NR]?A[NR]+$i:$i;B[NR]=B[NR]?B[NR]+1:1;}
        }
     }
{p=f=NR;
        {for(r=NR;r<=NR;r++)
                            {c[p--]=A[r]/B[r]}
        }
        {for(t=f;t<=NR;t++)
                            {$0=$0 OFS c[t];print $0}
        }
}' Input_file

Thanks,
R. Singh