I have a file (key.dat) that contains two columns:
AA|1234|
BB|567|
CC|8910|
I have another file (extract.dat) that contains some data:
SD|458|John|Smith|
AA|3345|Frank|Williams|
AA|1234|Bill|Garner|
BD|0098|Yu|Lin|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|
I want to compare the two files by the first two columns (inner join), and then print the contents of extract.dat (see below):
AA|1234|Bill|Garner|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|
Any help is most appreciated.
Thanks,
drl
February 1, 2009, 8:16am
2
Hi.
For the case posted, a modern fixed-string grep, fgrep, suffices:
#!/usr/bin/env bash
# @(#) s1 Demonstrate special case of join, grep used to extract.
echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) fgrep
set -o nounset
echo
FILE1=data1
FILE2=data2
echo " Data file $FILE1:"
cat $FILE1
echo
echo " Data file $FILE2:"
cat $FILE2
echo
echo " Results:"
fgrep -f $FILE1 $FILE2
exit 0
Producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.11-x1, i686
Distribution : Xandros Desktop 3.0.3 Business
GNU bash 2.05b.0
fgrep (GNU grep) 2.5.1
Data file data1:
AA|1234|
BB|567|
CC|8910|
Data file data2:
SD|458|John|Smith|
AA|3345|Frank|Williams|
AA|1234|Bill|Garner|
BD|0098|Yu|Lin|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|
Results:
AA|1234|Bill|Garner|
BB|567|Gail|Hansen|
CC|8910|Ken|Nielsen|
See man fgrep for details ... cheers, drl
man grep
grep -f key.dat extract.dat
Thanks danmero. This works very well for a small dataset. I wonder how well it will perform with 800 million records extract and a 300K records key.
Thanks,
Use GNU awk (gawk), New awk (nawk)
or POSIX awk (/usr/xpg4/bin/awk):
awk -F\| '
NR == FNR { _[$1,$2]; next }
($1,$2) in _
' key.dat extract.dat
drl
February 1, 2009, 12:52pm
6
Hi, CB.
You might benchmark these approaches to see what is best for your situation,
I would start with the smallest practical set to see what the overhead might be, then perhaps 1, 2, 5, and 10% of the data to see how it is progressing. Many complex solutions tend not to be linear. I think both will be somewhat memory intensive if you have 300K lines in your key.dat file.
Keep us posted ... cheers, drl
PS I used fgrep to avoid any interpretation of special characters in your patterns. The sample you provided used only "|" which is recognized by egrep, but I don't know how representative your sample was.