Average values of duplicate rows

Sanchari · June 2, 2014, 10:24am

I have this file input.txt. I want to take average column-wise for the rows having duplicate gene names.

Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

Desired output:

gene_A 4.3 7 8.3
gene_B 2.5 3.5 4.5
gene_C 2 3 4

Thanks in advance

Akshay_Hegde · June 2, 2014, 10:53am

This is by reading same file twice, you can also process this in END block

$ cat file
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

awk '   NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
       !x[$1]++ && FNR>1 \
             {
		for(i=2;i<=NF;i++)
		printf "%s%s",(i==2?"" : OFS),A[$1,i]/C[$1,i];
		printf RS
	     }
    ' OFS="\t" file file

Resulting

Gene Sample_1 Sample_2 Sample_3
4.33333	7	8.33333
2.5	3.5	4.5
2	3	4

---------- Post updated at 09:22 PM ---------- Previous update was at 09:13 PM ----------

This is processing in END block reading file once

 awk '  NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
         END {
		for( i in A)
		{
			split(i,X,SUBSEP)
			if(!(X[1] in x))
			{
				printf X[1] OFS
				for(j=2;j<=NF;j++)
				{
					printf "%s%s",j==2?"":OFS,A[X[1],j]/C[X[1],j]
				}
				printf RS
				x[X[1]]
			}
		}
	     }
       ' OFS="\t" file

---------- Post updated at 09:23 PM ---------- Previous update was at 09:22 PM ----------

If you don't care order use this..

rdrtx1 · June 2, 2014, 2:46pm

try also:

awk '
NR>1{l[$1]=$1; c[$1]++;
   for (i=2; i<=NF; i++) a[$1,i]+=$i;
}
END {
   for (g in l) {
      printf g " ";
      for (i=2; i<=NF; i++) printf ("%.1f ", (a[g,i]/c[g]));
      print "";
   }
}
' infile

MadeInGermany · June 3, 2014, 6:20am

l[$1]=$1 is a useless value; l[$1] alone defines the key (no value).
Or store the length, i.e. allow an individual length for each gene type:

awk '
NR>1 {
   L[$1]=NF; c[$1]++
   for (i=2; i<=NF; i++) a[$1,i]+=$i
}
END {
   for (g in L) {
      printf "%s", g
      for (i=2; i<=L[g]; i++) printf " %s", (a[g,i]/c[g])
      print ""
   }
}
' infile

NB: the %s format allows any cast from a number to a string; awk indeed seems to handle printf "%s\n", number like print number .

Sanchari · July 9, 2014, 4:04am

Hi, I was using your second program, wanted to know how to run this program using a script? Just saving in a .sh file would work ?

SriniShoo · July 9, 2014, 4:28am

yes, it would. Try to pass the file name correctly and handle it in code

RavinderSingh13 · August 23, 2014, 3:46am

Hello,

Following may also help in same.

awk 'NR==FNR && NR>1{a[$1]+=$2;b[$1]++;c[$1]+=$3;d[$1]+=$3;e[$1]+=$4;next} ($1 in a){ {if(s[$1] == ""){{f=a[$1]/b[$1]; g=c[$1]/b[$1]; h=d[$1]/b[$1]; i=e[$1]/b[$1];s[$1]=1}; {print $1 OFS f OFS g OFS i}}}}'  OFS="\t" filename filename

Output will be as follows.

gene_A  4.33333 7       8.33333
gene_B  2.5     3.5     4.5
gene_C  2       3       4

Thanks,
R. Singh