compare columns from seven files and print the output

smriti_shridhar · June 2, 2008, 7:37am

Hi guys,
I need some help to come out with a solution . I have seven such files but I am showing only three for convenience.

filea
a5 20
a8 16

fileb
a3 42
a7 14

filec
a5 23
a3 07

The output file shoud contain the data in table form showing first field of each file with their second field(score) in each file.

ID filea fileb filec
a5 20 00 23
a8 16 00 00
a3 00 42 07
a7 00 14 00

Your help is highly appretiated.

-Smriti

era · June 3, 2008, 2:27am

Perhaps the simple solution would be to write a simple script to canonicalize each of those files, so they all have the same labels. Then it's easy to e.g. paste them all side by side, and cut the columns you actually want.

jaduks · June 3, 2008, 7:40am

Not sure if this is the expected output.

$ cat filea
a5 20
a8 16

$ cat fileb
a3 42
a7 14

$ cat filec
a5 23
a3 07

$ cat filea fileb filec > filex

$ cat filex
a5 20
a8 16
a3 42
a7 14
a5 23
a3 07

$ awk '
!arr[$1] {arr[$1] = $0; next}
{arr[$1] = arr[$1] " " $2}
END {for(i in arr) {print arr}}
' filex

a3 42 07
a5 20 23
a7 14
a8 16

or 

$ awk '{Arr[$1]=sprintf("%s %s",Arr[$1],$2)} END {for ( i in Arr) {printf("%s %s\n",i,Arr)}}' filex
a3  42 07
a5  20 23
a7  14
a8  16

//Jadu

smriti_shridhar · June 5, 2008, 12:28am

Thanks era,

I suppose I am not getting what u want to convey. Plz make it more clear and I want to repeat that its important for me to know that second field i.e. the scores are coming from which file in the final output, that's why I want a -- or 00 showing absence of score from a particular file if the ID is repeated.

smriti_shridhar · June 5, 2008, 12:35am

jaduks:

Not sure if this is the expected output.

$ cat filea
a5 20
a8 16

$ cat fileb
a3 42
a7 14

$ cat filec
a5 23
a3 07

$ cat filea fileb filec > filex

$ cat filex
a5 20
a8 16
a3 42
a7 14
a5 23
a3 07

$ awk '
!arr[$1] {arr[$1] = $0; next}
{arr[$1] = arr[$1] " " $2}
END {for(i in arr) {print arr}}
' filex

a3 42 07
a5 20 23
a7 14
a8 16

or 

$ awk '{Arr[$1]=sprintf("%s %s",Arr[$1],$2)} END {for ( i in Arr) {printf("%s %s\n",i,Arr)}}' filex
a3  42 07
a5  20 23
a7  14
a8  16

//Jadu

Thanks for replying,

This won't solve my problem as I need an ordered way where it sholud be clear that which score belongs to which file n if I'll cat that identity will be lost and I wouln't knw if '42' belonged to file a,b or c in the following output.

a3 42 07
a5 20 23
a7 14

your help is really appretiated.

era · June 5, 2008, 2:36am

What I was trying to suggest was that you would change the input files so they have an explicit value for each possible label. So for example filec would become

(Note also the reordering of the fields a3 and a5.)

Once you have that, the rest should be trivial. But maybe modifying the files (or maintaining a modified duplicate for each input file) isn't a very elegant solution.

radoulov · June 5, 2008, 7:46am

Use nawk or /usr/xpg4/bin/awk on Solaris.

awk '{
if (!_[$1]++) id[++n] = $1
fid[FILENAME,$1] = $2
if (FNR == 1) fn[++c] = FILENAME
} END {
  printf "id\t"
  for (i=1; i<=c; i++)
    printf "%s\t", fn
  print
  for (j=1; j<=n; j++) {
    printf "%s\t", id[j]
    for (i=1; i<=c; i++)
      printf "%s\t", (fn SUBSEP id[j]) in fid ? fid[fn SUBSEP id[j]] : "00"  
    print  
    }
}' file*

With your files:

$ head file*
==> filea <==
a5 20
a8 16

==> fileb <==
a3 42
a7 14

==> filec <==
a5 23
a3 07
$ nawk '{
if (!_[$1]++) id[++n] = $1
> if (!_[$1]++) id[++n] = $1
> fid[FILENAME,$1] = $2
> if (FNR == 1) fn[++c] = FILENAME
> } END {
    printf "%s\t", id[j]
>   printf "id\t"
>   for (i=1; i<=c; i++)
>     printf "%s\t", fn
>   print
>   for (j=1; j<=n; j++) {
>     printf "%s\t", id[j]
>     for (i=1; i<=c; i++)
>   printf "%s\t", (fn SUBSEP id[j]) in fid ? fid[fn SUBSEP id[j]] : "00"
> print
> }
> }' file*
id      filea   fileb   filec
a5      20      00      23
a8      16      00      00
a3      00      42      07
a7      00      14      00

smriti_shridhar · June 11, 2008, 12:22am

Thanks Radoulov for your support.

It solved my problem.