Average of columns with values of other column with same name

isildur1234 · December 13, 2012, 2:36am

I have a lot of input files that have the following form:

Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq	
1WBIN	23.45	1WBIN	23.45	1CVSIN	23.96	1CVSIN	23.14	S1	31.37	
1WBIN	23.53	1WBIN	23.53	1CVSIN	23.81	1CVSIN	23.24	S1	31.49	
1WBIN	24.55	1WBIN	24.55	1CVSIN	23.86	1CVSIN	23.24	S1	31.74	
1CVSIN	23.62	1CVSIN	23.62	1CVSIP	22.12	1CVSIP	21.53	S10	31.13	
1CVSIN	23.46	1CVSIN	21.74	1CVSIP	22.24	1CVSIP	21.40	S10	31.10	
ICVSIN	21.74	1CVSIN	23.33	1CVSIP	22.22	1CVSIP	21.36	S10	31.29	
...

I need to read each column, get the row values from the column next to it that have same name in the first column and compute the average of their values and output in a new file. They are usually triplicates of the same name, i need to get all three of its values and compute the average of these . The output should look like this in a new file:

Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq	
1WBIN	23.843	1WBIN	23.84	1CVSIN	23.87	1CVSIN	23.20	S1	31.53	
1CVSIN	22.94	1CVSIN	22.89	1CVSIP	22.19	1CVSIP	21.43	S10	31.17

I print only one of the triplicates and next to it the average of the three values that each of the triplicates had.
Is it possible to be done in awk? I am trying with perl.

With PERL:
I thought of converting the columns into rows, read each line and create a hash having as KEY each Sample name and adding the value of each triplicate every time i find the same name to its VALUE. At the end I will divide the value/3 and output the key->value in the output file.
To keep the records organised and in order I will read the input file again and print key->value from the hash for each line.

Is there an easier way to do this? Because with hashes I will lose the ordering of the columns in the file and it will get a bit messy I'm afraid.

Thank you for any input.

pamu · December 13, 2012, 4:22am

Try....

awk 'function print_data(){
        for(i=1;i<=NF;i++){if(i%2){k=k?k"\t"P[s,i]:P[s,i]}else{p=sprintf("%.02f", P[s,i]/A);k=k"\t"p}}print k;k="";
}
NR==1
{A[$1]++;for(i=1;i<=NF;i++){if(i%2){P[$1,i]=$i}else{P[$1,i]+=$i}}}
NR>1{if(s!=$1 && s){print_data()};s=$1}END{print_data()}' file

Sample  Cq      Sample  Cq      Sample  Cq      Sample  Cq      Sample  Cq
1WBIN   23.84   1WBIN   23.84   1CVSIN  23.88   1CVSIN  23.21   S1      31.53
1CVSIN  22.94   1CVSIN  22.90   1CVSIP  22.19   1CVSIP  21.43   S10     31.17

isildur1234 · December 13, 2012, 6:05am

Thank you so much pamu.
It works just great with this file.
Is there any clear reason that you can see that it does not work properly with columns that have spaces in their names?
for example if i try this input:

Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq	Sample	Cq
1 WB IN	23.45	1 WB IN	23.45	1 CVS IN	23.96	1 CVS IN	23.14	S1	31.37
1 WB IN	23.53	1 WB IN	23.53	1 CVS IN	23.81	1 CVS IN	23.24	S1	31.49
1 WB IP	24.55	1 WB IN	24.55	1 CVS IN	23.86	1 CVS IN	23.24	S1	31.74
1 CVS IN	23.62	1 CVS IN	23.62	1 CVS IP	22.12	1 CVS IP	38.3	S10	31.13
1 CVS IN	23.46	I CVS IN	21.74	1 CVS IP	22.24	1 CVS IP	21.4	S10	31.1
1 CVS IN	21.74	1 CVS IN	23.33	1 CVS IP	22.22	1 CVS IP	21.36	S10	31.29

it only outputs the first line.

Many thanks again.

pamu · December 13, 2012, 6:39am

I think you have filed separator as tab then try

awk -F "\t" 'function print_data(){
        for(i=1;i<=NF;i++){if(i%2){k=k?k"\t"P[s,i]:P[s,i]}else{p=sprintf("%.02f", P[s,i]/A);k=k"\t"p}}print k;k="";
}
NR==1
{A[$1]++;for(i=1;i<=NF;i++){if(i%2){P[$1,i]=$i}else{P[$1,i]+=$i}}}
NR>1{if(s!=$1 && s){print_data()};s=$1}END{print_data()}' file

Sample  Cq      Sample  Cq      Sample  Cq      Sample  Cq      Sample  Cq
1 WB IN 23.49   1 WB IN 23.49   1 CVS IN        23.88   1 CVS IN        23.19   S1      31.43
1 WB IP 24.55   1 WB IN 24.55   1 CVS IN        23.86   1 CVS IN        23.24   S1      31.74
1 CVS IN        22.94   1 CVS IN        22.90   1 CVS IP        22.19   1 CVS IP        27.02   S10     31.17