Help needed!

jackken007 · January 24, 2013, 11:50pm

Hi,

I need to calculate average of a column based on row values of another file.

File 1:

ID     start    end   
1      2         5
1      8         10
2      3         5
2      7         11
...

File 2:

ID     position    y
1       1            0
1       2            2
1       3            2
1       4            0
1       5            1
1       6            0
1       7            0
1       8            1
1       9            1
1       10          1
1       11          1
1       12          2
...

output:

ID    start    end     mean.y
1      2        5         1.2
1      8        10       1
...

Basically, the added column "mean.y" is average "y" value of row 2 to 5 in file 2, i.e., 1.2=(2+2+0+1)/4.

Thanks a lot!

Yoda · January 25, 2013, 12:02am

Here is one way of doing it:

while read id st en
do
  awk -v I=$id -v S=$st -v E=$en 'BEGIN{c=0;m=0;}$1==I&&NR>=S&&NR<=E{++c;m+=$NF;}END{if(c!=0) printf "%d %d %d %.1f\n",I,S,E,m/c;}' file2
done < file1

jackken007 · January 25, 2013, 12:25am

Thank you so much!

There is a little bit problem. Could you please take a look?

 while read id start end
 do
 awk -v I=$id -v S=$start -v E=$end 'BEGIN{c=0; m=0;}$1==I&&NR>=S&&NR<=E{++c;m+=$NF;}END{if(c!=0) printf "%d %d %d %.1f\n",I,S,E,m/c;}' file2.txt
 done < file1.txt

awk: cmd. line:1:
' in expression1: ^ invalid char '
1 2 5 1.0
1 8 10 0.7

##############
Can you please check why the average of rows 2 to 5 is 1.0, rather than 1.2?

Thank you very much!

Yoda · January 25, 2013, 12:39am

Oh I missed to skip the headers in your files. Try this modified code:-

while read id st en
do
[[ $id =~ "^[0-9]+$" ]] && awk -v I=$id -v S=$st -v E=$en 'BEGIN{c=0;m=0;}$1==I&&NR>=S+1&&NR<=E+1{++c;m+=$NF;}END{if(c!=0) printf "%d %d %d %.1f\n",I,S,E,m/c;}' file2
done < file1

jackken007 · January 25, 2013, 1:06am

Thank you so much! bipinajith.

It works great!

However, It's totally my fault that I did not explain it correctly in my post.

Actully, what I really wanted was to average rows that have "pos" values (in file2) between "start" and "end".

For example, file 1 is still

id start end 
1 2 5
1 8 10
2 3 5
2 7 11

file 2

output:

id start end mean.y
1 2 5 0.75
1 8 10 1.0
2 3 5
2 7 11

where 0.75=(2+1)/4 <- because start=2 and end=5, so only rows with "pos" = 3 and 5 are averaged.

Thank you so much!

---------- Post updated at 01:06 AM ---------- Previous update was at 12:57 AM ----------

Hi, bipinajith.

I think your code works great! And IT IS what I wanted! Please ignore the last post!

Thanks a lot!