Average of a columns from three files

nans · August 31, 2017, 9:33am

hello, I have three files in the following order

==> File1 <==
1    20977000    20977000    A    C    1.00    0,15    15    45
1    115829313    115829313    G    A    0.500    6,7    13    99

==> File2 <==
1    20977000    20977000    A    C    1.00    0,13    13    39
1    115829313    115829313    G    A    0.500    7,7    14    99

==> File3 <==
1    20977000    20977000    A    C    1.00    0,16    16    48
1    115829313    115829313    G    A    0.500    7,10    17    99

I need to take the average of the 8th column from each file and output the average to the first file. So the results should look like

==> Results <==
1    20977000    20977000    A    C    1.00    0,15    15    45 14.6667
1    115829313    115829313    G    A    0.500    6,7    13    99 14.6667

I am able to calculate the average with this

awk '{a[FNR]=a[FNR]+$8;b[FNR]++;}END{for(i in a){print i,a/b}}' *.txt

but not able to write it into the first file. Any suggestions ?

RavinderSingh13 · August 31, 2017, 9:50am

Hello nans,

Could you please try following and let me know if this helps you.

awk 'FNR==NR && FNR==1{first_file=FILENAME} FNR==1 && val{close(val)} FNR==1{val=FILENAME} {a[FNR]+=$8;b[FNR]++;c[FNR]=$0}  END{for(i in a){print c,a/b > first_file}}' *.txt

Let me know how it goes then. Also I would say you could test this command in few test files, once you are happy with it then you could give it a shot for real files.

Thanks,
R. Singh

RudiC · August 31, 2017, 2:33pm

Try also (for the fun of it)

awk '
FNR == 1                {FC++
                        }
 NR == 1                {AC = ARGC
                         ARGV[ARGC++] = FILENAME
                        }
FNR == NR               {next
                        }
                        {SUM[FNR] += $8
                        }
 FC == AC               {print $0, SUM[FNR] / (AC-1)
                        } 

' file[123]
1    20977000    20977000    A    C    1.00    0,15    15    45 14.6667
1    115829313    115829313    G    A    0.500    6,7    13    99 14.6667

nans · September 4, 2017, 9:34am

Both the solutions worked perfect! Thank you

nans · September 7, 2017, 8:40am

Revisiting this post. Can we condition the command to say,
go through first two columns in each of the three files and if it matches (assuming the files are not in the same order) or even present in a single file (out of three), print the average of the 8th columns in the first file ?

==> File1 <==
1    115829313    115829313    G    A    0.500    6,7    13    99
1    20977000    20977000    A    C    1.00    0,17    17   45

==> File2 <==
1    20977000    20977000    A    C    1.00    0,15    15    45
1    115829313    115829313    G    A    0.500    7,7    14    99

==> File3 <==
1    115829313    115829313    G    A    0.500    7,10    17    99
1    209897000    209897000    T    C    1.00    0,16    16    48

==> Results <==
1    115829313    115829313    G    A    0.500    6,7    13    99   14.66
1    20977000    20977000    A    C    1.00    0,17    17   45   16
1    209897000    209897000    T    C    1.00    0,16    16    48   16

Thank you

RavinderSingh13 · September 7, 2017, 2:43pm

nans:

Revisiting this post. Can we condition the command to say,
go through first two columns in each of the three files and if it matches (assuming the files are not in the same order) or even present in a single file (out of three), print the average of the 8th columns in the first file ?
==> File1 <==
1    115829313    115829313    G    A    0.500    6,7    13    99
1    20977000    20977000    A    C    1.00    0,17    17   45

==> File2 <==
1    20977000    20977000    A    C    1.00    0,15    15    45
1    115829313    115829313    G    A    0.500    7,7    14    99

==> File3 <==
1    115829313    115829313    G    A    0.500    7,10    17    99
1    209897000    209897000    T    C    1.00    0,16    16    48
==> Results <==
1    115829313    115829313    G    A    0.500    6,7    13    99   14.66
1    20977000    20977000    A    C    1.00    0,17    17   45   16
1    209897000    209897000    T    C    1.00    0,16    16    48   16
Thank you

Hello nans,

Could you please try following and let me know if this helps you.

awk '
FNR==NR && FNR==1{
   first_file=FILENAME
}
FNR==1 && val{
   close(val)
}
FNR==NR{
   val=FILENAME;
   c[$2,$3]=$0;
   d[$2,$3]++;
   a[$2,$3]+=$8;
   b[$2,$3]++;
   next
}
!b[$2,$3]++{
   c[$2,$3]=$0
}
{
   d[$2,$3]++;
   a[$2,$3]+=$8;
   val=FILENAME
}
END{
   for(i in c){
       print c,(a/d)
}
}
' SUBSEP=" " File1 File2  File3

Above will only print the output on terminal, if you want to put the output into first Input_file1 named File1 then you could change print c,(a/d) to print c,(a/d) > first_file .

Thanks,
R. Singh