a a 1
a b 2
a c 4
b a 2
b b 1
b c 7
c a 4
c b 7
c c 1
Now the desired output would be:
a b c
a 1 2 4
b 2 1 7
c 4 7 1
My data set is similar to it but I have 25 variables(in this example, I have 3 variables a, b and c). Basically output would be a similarity matrix with two same halves. Thanks for the reply.
awk '
{a[$1,$2]=$3; b[$1];}
END{
for(i in b) #print header
printf("\t%s",i);
printf("\n");
for(i in b) { #each row
printf("%s\t",i);
for(j in b) #each column
printf("%d\t", a[i,j]);
printf("\n");
}
}' SUBSEP="\t" input.txt
Not that the rows/columns are not sorted.
But honestly, for this kind of data-mangling job, you are better off using some other data processing tool. E.g. with GNU R and "reshape" package it's piece of cake:
d = read.table("mydata.txt")
library(reshape)
cast(d, V1 ~ V2 )
As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .
With the input file given in the 1st message in this thread, it produces the output:
$ awk '{A[$1,$2]=$3;B[$1]++;C[$2]++}END{for(i in C){s=s?s"\t"i:"\t"i}
print s;
for(i in B){s=i;
for(j in C){s=s"\t"A[j,i]}
print s}}' file
a b c
a 1 2 4
b 2 1 7
c 4 7 1
When I tried your script it printed 0s and 1s instead of column 3 values. Anyway I followed your suggestion to use R and it worked fine for me. Thanks.