Mean of the specific columns

repinementer · September 28, 2009, 12:05am

I have a input file that has some common values in 1st,2nd and 3rd columns. 4th and 5th are different. Now I would like to print the mean of the fourth column of similar values in 1st.2nd and 3rd columns along with all the values in 5th column.

input

NM_0    1.22    CR5    0.4    n_21663
NM_0    1.22    CR5    0.1    n_1664
NM_0    1.22    CR5    0.6    n_21665
NM_11    1.36    AK09   0.9    n_19168
NM_11    1.36    AK09    -0.02    n_19169

output

NM_0    1.22    CR5    0.366  n_21663  n_1664  n_21665  
NM_11    1.36    AK09   0.44  n_19168  n_19169

Thanx in advance

danmero · September 28, 2009, 12:17am

Can you explain how you get 0.366 & 0.44 base on your input data

repinementer:

NM_0    1.22    CR5    0.4    n_21663
NM_0    1.22    CR5    0.1    n_1664
NM_0    1.22    CR5    0.6    n_21665
NM_11    1.36    AK09   0.9    n_19168
NM_11    1.36    AK09    -0.02    n_19169

vidyadhar85 · September 28, 2009, 12:23am

he has calculated the average...

(0.4+0.1+0.6)/3=0.3666
(0.9-0.02)/2=0.44

repinementer · September 28, 2009, 12:34am

Yes Mr.Vidya is right

ripat · September 28, 2009, 1:31am

Hi,

Something like this?

awk '{
    ind=sprintf("%s %s %s",$1,$2,$3)
    t[ind]+=$4
    n[ind]++
    s[ind]=s[ind] " " $5
}
END{
    for(i in t) printf "%s %.3f %s\n",i,t/n,s
}' file

repinementer · September 28, 2009, 2:01am

Thanx ripat. That is excatly what I want

---------- Post updated at 10:01 PM ---------- Previous update was at 09:42 PM ----------

Is it possible to chamnge the out put like this
Thanx

output

NM_0    1.22    CR5    0.366  n_21663  0.4  n_1664  0.1  n_21665  0.64
NM_11    1.36    AK09   0.44  n_19168  0.9 n_19169 -0.02

ripat · September 28, 2009, 2:15am

Simply change this line:

    s[ind]=s[ind] " " $5 " " $4

rdcwayx · September 28, 2009, 2:15am

NM_0    1.22    CR5    0.366  n_21663  0.4  n_1664  0.1  n_21665  0.64
NM_11    1.36    AK09   0.44  n_19168  0.9 n_19169 -0.02

How do you get 0.64?

If just attach the column 5 and 4, you can use below code:

awk '{
    ind=sprintf("%s %s %s",$1,$2,$3)
    t[ind]+=$4
    n[ind]++
    s[ind]=s[ind] " " $5 " " $4
}
END{
    for(i in t) printf "%s %.3f %s\n",i,t/n,s
}' file

repinementer · September 28, 2009, 3:20am

Working great, really innovative awk command Mr Ripat. Thank you

---------- Post updated at 11:20 PM ---------- Previous update was at 11:02 PM ----------

hey Guys If you don't mind could you plz explain the code.
thanx

ripat · September 28, 2009, 3:42am

awk '{
    # create an index composed by the concatanation of fields 1 to 3
    ind=sprintf("%s %s %s",$1,$2,$3)

    # array that compute the total for every line having the same $1,$2,$3
    t[ind]+=$4

    # this one counts the number of lines processed for every line having the same $1,$2,$3
    n[ind]++

    # here we concatenate the fields 5 and 4 for every line having the same $1,$2,$3
    s[ind]=s[ind] " " $5 " " $4

    # alternative to get rid of the extra space
    s[ind]=sprintf("%s%s%s %s", s[ind], s[ind]?" ":"", $5, $4)
}

# when all lines have been processed, we traverse the respective arrays and compute the average
END{
    for(i in t) printf "%s %.3f %s\n",i,t/n,s
}'

summer_cherry · September 28, 2009, 5:58am

awk '{
key=sprintf("%s %s %s",$1,$2,$3)
arr[key]+=$4
brr[key]++
crr[key]=sprintf("%s %s",crr[key],$5)
}
END{
for (i in arr)
 print i" "arr/brr" "crr
}'