Matching by key fields

I have a file (key.dat) that contains two columns:

AA|1234|
BB|567|
CC|8910|

I have another file (extract.dat) that contains some data:

SD|458|John|Smith|
AA|3345|Frank|Williams|
AA|1234|Bill|Garner|
BD|0098|Yu|Lin|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|

I want to compare the two files by the first two columns (inner join), and then print the contents of extract.dat (see below):

AA|1234|Bill|Garner|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|

Any help is most appreciated.
Thanks,

  • CB

Hi.

For the case posted, a modern fixed-string grep, fgrep, suffices:

#!/usr/bin/env bash

# @(#) s1       Demonstrate special case of join, grep used to extract.

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) fgrep
set -o nounset
echo

FILE1=data1
FILE2=data2

echo " Data file $FILE1:"
cat $FILE1

echo
echo " Data file $FILE2:"
cat $FILE2

echo
echo " Results:"
fgrep -f $FILE1 $FILE2

exit 0

Producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.11-x1, i686
Distribution        : Xandros Desktop 3.0.3 Business
GNU bash 2.05b.0
fgrep (GNU grep) 2.5.1

 Data file data1:
AA|1234|
BB|567|
CC|8910|

 Data file data2:
SD|458|John|Smith|
AA|3345|Frank|Williams|
AA|1234|Bill|Garner|
BD|0098|Yu|Lin|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|

 Results:
AA|1234|Bill|Garner|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|

See man fgrep for details ... cheers, drl

man grep

grep -f key.dat extract.dat

Thanks danmero. This works very well for a small dataset. I wonder how well it will perform with 800 million records extract and a 300K records key.

Thanks,

  • CB

Use GNU awk (gawk), New awk (nawk)
or POSIX awk (/usr/xpg4/bin/awk):

awk -F\| '
NR == FNR { _[$1,$2]; next }
($1,$2) in _
' key.dat extract.dat 

Hi, CB.

You might benchmark these approaches to see what is best for your situation,

I would start with the smallest practical set to see what the overhead might be, then perhaps 1, 2, 5, and 10% of the data to see how it is progressing. Many complex solutions tend not to be linear. I think both will be somewhat memory intensive if you have 300K lines in your key.dat file.

Keep us posted ... cheers, drl

PS I used fgrep to avoid any interpretation of special characters in your patterns. The sample you provided used only "|" which is recognized by egrep, but I don't know how representative your sample was.