I have this file input.txt. I want to take average column-wise for the rows having duplicate gene names.
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4
Desired output:
gene_A 4.3 7 8.3
gene_B 2.5 3.5 4.5
gene_C 2 3 4
Thanks in advance
This is by reading same file twice, you can also process this in END
block
$ cat file
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4
awk ' NR==1{
print
next
}
FNR==NR \
{
for(i=2;i<=NF;i++)
{
A[$1,i]+=$i
C[$1,i]++
} next
}
!x[$1]++ && FNR>1 \
{
for(i=2;i<=NF;i++)
printf "%s%s",(i==2?"" : OFS),A[$1,i]/C[$1,i];
printf RS
}
' OFS="\t" file file
Resulting
Gene Sample_1 Sample_2 Sample_3
4.33333 7 8.33333
2.5 3.5 4.5
2 3 4
---------- Post updated at 09:22 PM ---------- Previous update was at 09:13 PM ----------
This is processing in END
block reading file once
awk ' NR==1{
print
next
}
FNR==NR \
{
for(i=2;i<=NF;i++)
{
A[$1,i]+=$i
C[$1,i]++
} next
}
END {
for( i in A)
{
split(i,X,SUBSEP)
if(!(X[1] in x))
{
printf X[1] OFS
for(j=2;j<=NF;j++)
{
printf "%s%s",j==2?"":OFS,A[X[1],j]/C[X[1],j]
}
printf RS
x[X[1]]
}
}
}
' OFS="\t" file
---------- Post updated at 09:23 PM ---------- Previous update was at 09:22 PM ----------
If you don't care order use this..
1 Like
try also:
awk '
NR>1{l[$1]=$1; c[$1]++;
for (i=2; i<=NF; i++) a[$1,i]+=$i;
}
END {
for (g in l) {
printf g " ";
for (i=2; i<=NF; i++) printf ("%.1f ", (a[g,i]/c[g]));
print "";
}
}
' infile
1 Like
l[$1]=$1
is a useless value; l[$1]
alone defines the key (no value).
Or store the length, i.e. allow an individual length for each gene type:
awk '
NR>1 {
L[$1]=NF; c[$1]++
for (i=2; i<=NF; i++) a[$1,i]+=$i
}
END {
for (g in L) {
printf "%s", g
for (i=2; i<=L[g]; i++) printf " %s", (a[g,i]/c[g])
print ""
}
}
' infile
NB: the %s
format allows any cast from a number to a string; awk indeed seems to handle printf "%s\n", number
like print number
.
1 Like
Hi, I was using your second program, wanted to know how to run this program using a script? Just saving in a .sh file would work ?
yes, it would. Try to pass the file name correctly and handle it in code
1 Like
Hello,
Following may also help in same.
awk 'NR==FNR && NR>1{a[$1]+=$2;b[$1]++;c[$1]+=$3;d[$1]+=$3;e[$1]+=$4;next} ($1 in a){ {if(s[$1] == ""){{f=a[$1]/b[$1]; g=c[$1]/b[$1]; h=d[$1]/b[$1]; i=e[$1]/b[$1];s[$1]=1}; {print $1 OFS f OFS g OFS i}}}}' OFS="\t" filename filename
Output will be as follows.
gene_A 4.33333 7 8.33333
gene_B 2.5 3.5 4.5
gene_C 2 3 4
Thanks,
R. Singh