Using AWK to match CSV files with duplicate patterns

isuewing · February 9, 2012, 11:54pm

Dear awk users,

I am trying to use awk to match records across two moderately large CSV files. File1 is a pattern file with 173,200 lines, many of which are repeated. The order in which these lines are displayed is important, and I would like to preserve it. File2 is a data file with 456,000 unique lines.

File1.csv:

_Year01
23_01_192001
23_02_192001
23_01_192001
23_04_192001
23_05_192001
23_03_192001
23_02_192001
23_03_192001
23_05_192001
23_04_192001
23_05_192001
23_05_192001
23_06_192001
_192001
25_01_192001
25_02_192001
...

File2.csv:

23,01,192001,0.09,23.40,-0.79,0.,1252.,23_01_192001
23,03,192001,0.79,28.30,-0.63,0.,1110.,23_03_192001
23,04,192002,0.15,37.40,-0.98,0.,748.,23_04_192002
23,06,192002,1.42,38.70,2.78,0.,720.,23_06_192002
23,03,192002,0.54,34.30,-1.05,0.,832.,23_03_192002
23,02,192002,0.54,31.50,-1.04,0.,918.,23_02_192002
23,01,192002,0.77,30.60,-0.82,0.,935.,23_01_192002
23,05,192002,0.65,36.30,-1.00,0.,784.,23_05_192002
23,04,192003,5.18,45.10,0.58,6.,595.,23_04_192003
23,02,192003,5.24,42.30,0.94,0.,682.,23_02_192003

I want to extract the lines in File2 for which column 9 matches the alphanumeric key in File 1, preserving the key order in the latter file. After reading several posts on this and other forums, I tried:

awk -F\, 'FNR==NR{a[$1]=$9;next}{print $0, a[$1]}' File1.csv File2.csv

which just prints out File2.csv, and

awk -F\, 'FNR==NR{a[$1]=$9;next}$9 in a' File1.csv File2.csv

which generates results in the wrong order.

I realize this type of question is a simple and well-documented. I thought I had correctly grasped the logic of the program, but apparently I have not. I would be very grateful for pointers as to where I am going wrong.

-i

ahamed101 · February 10, 2012, 12:03am

Did you try the grep command with -f option?

grep -f File1.csv File2.csv

--ahamed

isuewing · February 10, 2012, 8:49am

Thanks @ahamed101.

I did try grep -f, and there are two problems. I found that a pattern file with duplicate entries found unique matches, thus destroying the order of File1.csv, which I am trying to preserve. The other issue is that grep is notoriously inefficient for this task. For 173k patterns I would need to split File1.csv into chunks in a loop, and use each chunk to search against File2.csv. Even in this case, using grep to search >10k patterns begins to take several seconds. Other posts have profiled similar performance. While this second consideration is not a total deal-breaker, (a) I am going to have to perform a large number of these kinds of matches, (b) with bigger pattern files, awk is *fast*, so it would be great if I could find a more efficient solution.

-i

Scrutinizer · February 10, 2012, 8:58am

It looks like you have file1 and file2 reversed in your awk statement (the first statement seems to be right). Maybe you'd like to filter out the keys that start with "_".