Combine identical lines and average the one variable field

jfern · June 11, 2014, 4:14am

I have the following file

299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:1.45
299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:0.283
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:0.283
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:-0.324
388608 chrX_388393_388823 562  50.619 388603 chrX_388594_388612 18.4584 Tajd:0.342217 FayWu:-0.742664 T2:-0.421
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:0.803
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:-1.233
1220600 chrX_1220404_1220797 510 -0 1220617 chrX_1220608_1220626 16.7085 Tajd:0.391032 FayWu:-0.421912 T2:1.093

There are a lot of identical lines which differ only in the last field (T2:#). I'm looking for a way to combine these lines so that the T2 entry is averaged. In this excerpt I would wish to receive something like:

299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:0.8665
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:-0.0205
388608 chrX_388393_388823 562  50.619 388603 chrX_388594_388612 18.4584 Tajd:0.342217 FayWu:-0.742664 T2:-0.421
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:-0.215
1220600 chrX_1220404_1220797 510 -0 1220617 chrX_1220608_1220626 16.7085 Tajd:0.391032 FayWu:-0.421912 T2:1.093

The file is sorted, so all identical lines will be consecutive entries. The closest I have gotten is:

more input.file | awk '{split($10,a,":");avt2[$1]+=a[2];c[$1]++}END{for(i in avt2) print $0,avt2/c}' > output.file

but have not received any helpful results.
Thanks a lot for any help,
Jonas

clx · June 11, 2014, 6:14am

May be ,

awk -F: '{S=$1 FS $2 FS $3;a++;b=b+$4} END {for (i in a) {print i FS b/a}}' file

Note: I assumed you have 4 colon separated fields on each line.

Also, The file need not to be sorted in this case. It would work in both cases.

EDIT:
I see, you have used almost similar way, except the IFS=":".

jfern · June 11, 2014, 8:05am

Seems to have worked.

Thanks!