Compare two csv files by two colums and create third file combining data from them.

agb2008 · February 15, 2009, 2:57pm

I've got two large csv text table files with different number of columns each.
I have to compare them based on first two columns and create resulting file
that would in case of matched first two columns include all values from first one and all values (except first two colums) from second one. I searched forums and as well google - and was unable to find solution for that issue.

file1 example:

123 234 678 974 943
345 55 567 21 357
456 732 583 61 75

file2 example:

123 234 582 699 23 11 935
22 467 215 376 87 54
456 732 32 14

Expected resulting file:

123 234 678 974 943 582 699 23 11 935
456 732 583 61 75 32 14

just to compare I've been trying to use following code in awk, that I found in google:

awk 'NR==FNR {++a[$1,$2];next} a[$1,$2]' file1 file2 > file3

but I am stuck with understanding how to add data from second file for matched line...

rubin · February 15, 2009, 3:30pm

You can work on something like this:

awk  'NR==FNR { a[$1 FS $2]=$0; next } $1 FS $2 in a { printf a[$1 FS $2]; sub($1 FS $2,""); print }' file1 file2 > file3

summer_cherry · February 15, 2009, 10:14pm

nawk '{
if (NR==FNR)
	_[$1$2]=$0
else
{
	if (_[$1$2]!=""){
		printf("%s",_[$1$2])
		for(i=3;i<=NF;i++)
			printf(" %s",$i)
		printf "\n"
	}
}
}' file1 file2

agb2008 · February 16, 2009, 12:51am

rubin, summer_cherry,

Thank you very much both - your code help me to resolve issue that I had. If it is possible - could you please comment your code - to get better understanding how this script works... I am still learning awk.

rubin · February 16, 2009, 9:34am

NR==FNR {         # while in the first file, 
a[$1 FS $2]=$0    #+load array a indexed by first field, space ( default FS ), second field;  
next              #+then jump to the next record, without testing the following actions for the current record.  }
$1 FS $2 in a {   # Here starts the processing of file2. If field 1, space, and field 2 are found as indexes in array a, 
printf a[$1FS$2]  #+print the whole record ($0) of file 1 saved above in the array a ; 
sub($1 FS $2,"")  # substitute field 1, field 2 ( including the space in between ) with null string ;
print             # print the whole current record ( what's left ) of the second file.  }

drl · February 16, 2009, 10:26am

Hi.

You can also combine the first 2 fields to treat them as a single entity, use join (a sort is probably necessary), finally divide the first field into 2:

#!/usr/bin/env bash

# @(#) s2       Demonstrate join on 2 adjacent fields.

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) sed join
set -o nounset
echo

FILE1=data1
FILE2=data2

echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

# Pre-process input / sort, join / post-process output.
echo
echo " Results:"
sed 's/ /_/' $FILE1 | sort -k1,1 >t1
sed 's/ /_/' $FILE2 | sort -k1,1 >t2

join -j 1 t1 t2 | sed 's/_/ /'

echo
echo " Expected output:"
cat expected-output.txt

exit 0

Producing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: SunOS, 5.10, i86pc
GNU bash 3.00.16
sed - ( /usr/xpg4/bin/sed Aug 9 2005 )
join - ( /usr/bin/join Jan 22 2005 )

 Data file data1:
123 234 678 974 943
345 55 567 21 357
456 732 583 61 75

 Data file data2:
123 234 582 699 23 11 935
22 467 215 376 87 54
456 732 32 14

 Results:
123 234 678 974 943 582 699 23 11 935
456 732 583 61 75 32 14

 Expected output:
123 234 678 974 943 582 699 23 11 935
456 732 583 61 75 32 14

cheers, drl