Common records using AWK

jacobs.smith · January 30, 2012, 4:32pm

Hi,

To be honest, I am really impressed and amazed at the pace I find solutions for un-solved coding mysteries in this forum.

I have a file like this

input1.txt

x y z 1 2 3 
a b c 4 -3 7
k l m n 0 p
1 2 a b c 4

input2

x y z 9 0 -1
a b c 0 6 9
k l m 8 o p
1 2 a f x 9

Output

x y z 1 2 3 9 0 -1
a b c 4 -3 7 0 6 9
k l m n 0 p 8 o p
1 2 a b c 4 f x 9

The number of columns might change. To make it simple, my final output files should contain the common columns between two files and their respective varying columns side by side.

I tried using the (a[$1}=$2; next) method in awk, but it doesn't work. Can a join do this?

Any helps are appreciated.

Thanks in advance

bartus11 · January 30, 2012, 5:05pm

Try this script:

#!/bin/bash
FILE1="input1.txt"
FILE2="input2.txt"
TR1=`awk '{for (i=1;i<=NF;i++){a=a" "$i}}END{for (i=1;i<=NF;i++){print a}}' $FILE1`
TR2=`awk '{for (i=1;i<=NF;i++){a=a" "$i}}END{for (i=1;i<=NF;i++){print a}}' $FILE2`
TR3=`comm -12 <(echo "$TR1"|sort) <(echo "$TR2"|sort)`
TR3="$TR3\n"`comm -23 <(echo "$TR1"|sort) <(echo "$TR2"|sort)`
TR3="$TR3\n"`comm -13 <(echo "$TR1"|sort) <(echo "$TR2"|sort)`
OUT=`echo -e "$TR3" | awk '{for (i=1;i<=NF;i++){a=a" "$i}}END{for (i=1;i<=NF;i++){print a}}'`
echo "$OUT"

chihung · January 30, 2012, 7:27pm

awk '{
  ind=sprintf("%s %s %s", $1, $2, $3)
  str[ind]=sprintf("%s %s %s %s", str[ind], $4, $5, $6)
}
END {
  for (i in str) {
    printf("%s%s\n", i, str)
  }
}' input1.txt input2.txt

mirni · January 31, 2012, 3:03am

Another awk solution:

awk 'NR==FNR{a[$1,$2,$3]=$0; next} {printf a[$1,$2,$3]; $1=$2=$3=""; gsub(/^ */," ",$0); print $0}' file1 file2

Alternatively:

 cut -d" " -f4- file2 | paste -d" " file1 -

The awk solution works on unsorted files also, whereas the second one assumes sorted inputs.

jacobs.smith · January 31, 2012, 5:08pm

chihung:

awk '{
  ind=sprintf("%s %s %s", $1, $2, $3)
  str[ind]=sprintf("%s %s %s %s", str[ind], $4, $5, $6)
}
END {
  for (i in str) {
   printf("%s%s\n", i, str)
  }
}' input1.txt input2.txt

Works perfectly.

---------- Post updated at 10:10 AM ---------- Previous update was at 10:09 AM ----------

mirni:

Another awk solution:
awk 'NR==FNR{a[$1,$2,$3]=$0; next} {printf a[$1,$2,$3]; $1=$2=$3=""; gsub(/^ */," ",$0); print $0}' file1 file2
Alternatively:
 cut -d" " -f4- file2 | paste -d" " file1 -
The awk solution works on unsorted files also, whereas the second one assumes sorted inputs.

I tried the awk, works great. The cut command does the same but it pastes matches columns. Both works fine but I prefered the awk.

Life wouldn't have been easy without unix.com.

---------- Post updated at 05:08 PM ---------- Previous update was at 10:10 AM ----------

Can someone tell me if there are more then 3 common columns and still I want to match on the first three columns, how do I do it?

If my example changes to

1.txt

x y z 1 2 3 4 5 6 7 8 9
a b c d e f 9 7 8 9 90 1

2.txt

x y z 2 4 5 6 7 8 1 0 0
a b c g h i 9 3 1 4 5 6

Output

x y z 1 2 3 4 5 6 7 8 9 2 4 5 6 7 8 1 0 0
a b c d e f 9 7 8 9 90 1 g h i 9 3 1 4 5 6

I tried the following from the previous response, but no luck

awk '{ ind=sprintf("%s %s %s", $1, $2, $3); str[ind]=sprintf("%s %s %s %s %s %s %s %s", str[ind], $4, $5, $6, $7, $8, $9, $10) } END { for (i in str) { printf("%s%s\n", i, str) } }' 1.txt 2.txt

mirni · January 31, 2012, 10:52pm

In my previous post, you have two solutions.

jacobs.smith · February 1, 2012, 10:34am

I tried both of them. But, the awk script works only for matching first three columns and prints 4th, 5th and 6th columns. But, what I need is to match the first three columns and print all the records no matter how many they are.

The cut command prints each record against each record which was not helpful.

Appreciate your time. Thanks in advance.

mirni · February 1, 2012, 11:39am

What exactly is the output when you run the commands? Can you please post it. Also, what is your system?
It works on Linux with GNU awk and cut:

$ cat 1.txt 
x y z 1 2 3 4 5 6 7 8 9
a b c d e f 9 7 8 9 90 1
$ cat 2.txt 
x y z 2 4 5 6 7 8 1 0 0
a b c g h i 9 3 1 4 5 6
$ awk 'NR==FNR{a[$1,$2,$3]=$0; next} {printf a[$1,$2,$3]; $1=$2=$3=""; gsub(/^ */," ",$0); print $0}' 1.txt 2.txt
x y z 1 2 3 4 5 6 7 8 9 2 4 5 6 7 8 1 0 0
a b c d e f 9 7 8 9 90 1 g h i 9 3 1 4 5 6
$ cut -d" " -f4- 2.txt | paste -d" " 1.txt -
x y z 1 2 3 4 5 6 7 8 9 2 4 5 6 7 8 1 0 0
a b c d e f 9 7 8 9 90 1 g h i 9 3 1 4 5 6

jacobs.smith · February 1, 2012, 4:27pm

Thank You.

Works like a charm.

---------- Post updated at 04:27 PM ---------- Previous update was at 03:07 PM ----------

I was wondering what if there are more than two files.

Any thoughts?

mirni · February 2, 2012, 2:39am

Yes. You could do like this:

awk 'NR==FNR{a[NR]=$0; next} {$1=$2=$3=""; gsub(/^ */," ",$0);  a[FNR]=a[FNR] $0}END{for(i=1; i<=FNR; i++) print a;}' *.txt

or a shell loop:

cp first.txt > out.txt
 for i in *.txt ; do 
   [[ "$i" = first.txt ]] && continue
   cut -d" " -f4- "$i" | paste -d" " out.txt - > tmpFile
   mv tmpFile out.txt
 done

But this the most fun way of doing it -- recursive way:

#!/bin/bash
 
 function myPaste {
     if [[ $# -le 1 ]] ; then
        cat $1
        return
     fi
     arg=$1
     shift
 
     myPaste $* | cut -d" " -f4- | paste -d" " $arg -
 }
 
 myPaste $*

You would save this as myPaste.sh, make it executable and then call it:

./myPaste.sh *.txt

Note that all these above are dumb, they do not check whether the first 3 columns are the same, they just strip the first three column no matter what. The awk solution on 2 files with [$1,$2,$3] works on unsorted files also, and actually matches the first three fields. This could be extended to n files, would just need one extra loop in the END to sort the associative array.