Sort and join multiple columns using awk

quincyjones · January 30, 2013, 1:23pm

Is it possible to join all the values after sorting them based on 1st column key and replace empty rows with 0 like below ?

input

a1	0	a1	1	a1	1	a3	1	b2	1
a2	1	a4	1	a2	1	a4	1	c4	1
a3	1	d1	1	a3	1	b1	1	d1	1
a4	1			c4	1	b2	1		
b1	1								
b2	1								
c4	1								
d1	1

output

a1	0	1	1	0	0
a2	1	0	1	0	0
a3	1	0	1	1	0
a4	1	1	0	1	0
b1	1	0	0	1	0
b2	1	0	0	1	1
c4	1	0	1	0	1
d1	1	1	0	0	1

neutronscott · January 30, 2013, 4:20pm

mute@clt:~/temp/quincyjones$ ./script
a1      0       1       1       0       0
a2      1       0       1       0       0
a3      1       0       1       1       0
a4      1       1       0       1       0
b1      1       0       0       1       0
b2      1       0       0       1       1
c4      1       0       1       0       1
d1      1       1       0       0       1

#!/bin/sh
awk 'BEGIN { FS=OFS="\t"; }
{
        if (NF > maxNF) maxNF=NF
        for (i=1;i<=NF;i+=2) {
                if (length($i) < 1) continue #skip blanks
                #new, map it
                if (!name_to_idx[$i]) {
                        idx_to_name[++idx]=$i
                        name_to_idx[$i]=idx
                }
                arr[$i,i]=$(i+1) #make matrix
        }
}
END {
        for (i=1;i<=idx;i++) {
                printf("%s",idx_to_name)
                for (j=1;j<maxNF;j+=2)
                        printf("%s%d", OFS, arr[idx_to_name,j])
                print "" #newline
        }
}' file | sort -k1,1

Yoda · January 30, 2013, 4:52pm

Here is another dirty approach using join command.

Since join command cannot accept more than 2 files at a time, we have to split the orignal file and perform the join operation:

awk -F'\t' '{ print $1,$2 > "file1"; print $3,$4 > "file2"; print $5,$6 > "file3"; print $7,$8 > "file4"; print $9,$10 > "file5"; }' filename
join -a1 -1 1 -2 1 -o 1.1 1.2 2.2 -e "0" file1 file2 > j1
join -a1 -1 1 -2 1 -o 1.1 1.2 1.3 2.2 -e "0" j1 file3 > j2
join -a1 -1 1 -2 1 -o 1.1 1.2 1.3 1.4 2.2 -e "0" j2 file4 > j3
join -a1 -1 1 -2 1 -o 1.1 1.2 1.3 1.4 1.5 2.2 -e "0" j3 file5

Scrutinizer · January 30, 2013, 5:39pm

Assuming the columns are sorted like in the sample, try:

awk '{s=$1; for(i=1; i<=5; i++) {A[$(i*2-1),i]=$(i*2); s=s OFS A[$1,i]+0} print s}' OFS='\t' file

Yoda · January 30, 2013, 6:02pm

Scrutinizer, I see some minor discrepancy in the o/p:

$ awk '{s=$1; for(i=1; i<=5; i++) {A[$(i*2-1),i]=$(i*2); s=s OFS A[$1,i]+0} print s}' OFS='\t' filename
a1      0       1       1       0       0                     
a2      1       0       1       0       0                     
a3      1       0       1       1       0                     
a4      1       1       0       1       0                     
b1      1       0       0       1       0                     
b2      1       0       1       0       1                    
c4      1       1       0       0       1
d1      1       1       0       0       1

EDIT: I noticed discrepancy even in my join command o/p.

EDIT: Got it fixed after setting the field separator to tab spaces -F'\t'

EDIT: Scrutinizer, setting the field separator to tab spaces for your awk code gives correct o/p

Scrutinizer · January 30, 2013, 6:15pm

OK, thanks, so:

awk '{s=$1; for(i=1; i<=5; i++) {A[$(i*2-1),i]=$(i*2); s=s OFS A[$1,i]+0} print s}' FS='\t' OFS='\t' file

quincyjones · January 31, 2013, 9:19am

@Scrutinizer: Is it possible to apply your code even on unsorted file ?
@Neutron: Your code making all 0.* values as 0 ?

@Neutron : Is it possible to define the number of value-columns for a key column, when I run the script? In the above example (1 key and 1 value). This example (1 key 5 value cols)

a1	0	1	1	0	1	a1	1	1	1	0	1	a1	1	1	1	1	1	a1	0	0	0	0	1	a1	1	1	1	1	1	1
a2	1	0	1	0	1	a4	1	1	1	1	1	a4	0	0	1	1	1	a4	0	0	0	0	1	a4	1	1	1	1	1	1
a3	1	0	1	0	1	d1	1	1	1	0	1	d1	1	1	0	0	1	d1	0	0	0	0	1	d1	1	1	1	1	1	1
a4	1	0	1	0	1	c4	1	1	1	0	0	c4	1	1	1	0	0	c4	0	0	0	0	1	c4	1	1	1	1	1	1
b1	1	0	1	0	1																									
b2	1	1	1	0	1																									
c4	1	1	1	0	1																									
d1	1	1	1	0	1

---------- Post updated at 09:19 AM ---------- Previous update was at 04:50 AM ----------

Scrutinizer · January 31, 2013, 10:14am

With unsorted columns you could try this with two passes to the same file:

awk '
  NR==FNR{
    for(i=1; i<=5; i++) A[$(i*2-1),i]=$(i*2)
    next
  }
  {
    s=$1
    for(i=1; i<=5; i++) s=s OFS A[$1,i]+0
    print s
  }
' FS='\t' OFS='\t' infile infile

quincyjones · January 31, 2013, 10:34am

Thanx. Is it possible to apply this to the 2nd example I mentioned > thanx in advance!