Using AWK to match CSV files with duplicate patterns

Dear awk users,

I am trying to use awk to match records across two moderately large CSV files. File1 is a pattern file with 173,200 lines, many of which are repeated. The order in which these lines are displayed is important, and I would like to preserve it. File2 is a data file with 456,000 unique lines.

File1.csv:

_Year01
23_01_192001
23_02_192001
23_01_192001
23_04_192001
23_05_192001
23_03_192001
23_02_192001
23_03_192001
23_05_192001
23_04_192001
23_05_192001
23_05_192001
23_06_192001
_192001
25_01_192001
25_02_192001
...

File2.csv:

23,01,192001,0.09,23.40,-0.79,0.,1252.,23_01_192001
23,03,192001,0.79,28.30,-0.63,0.,1110.,23_03_192001
23,04,192002,0.15,37.40,-0.98,0.,748.,23_04_192002
23,06,192002,1.42,38.70,2.78,0.,720.,23_06_192002
23,03,192002,0.54,34.30,-1.05,0.,832.,23_03_192002
23,02,192002,0.54,31.50,-1.04,0.,918.,23_02_192002
23,01,192002,0.77,30.60,-0.82,0.,935.,23_01_192002
23,05,192002,0.65,36.30,-1.00,0.,784.,23_05_192002
23,04,192003,5.18,45.10,0.58,6.,595.,23_04_192003
23,02,192003,5.24,42.30,0.94,0.,682.,23_02_192003

I want to extract the lines in File2 for which column 9 matches the alphanumeric key in File 1, preserving the key order in the latter file. After reading several posts on this and other forums, I tried:

awk -F\, 'FNR==NR{a[$1]=$9;next}{print $0, a[$1]}' File1.csv File2.csv

which just prints out File2.csv, and

awk -F\, 'FNR==NR{a[$1]=$9;next}$9 in a' File1.csv File2.csv

which generates results in the wrong order.

I realize this type of question is a simple and well-documented. I thought I had correctly grasped the logic of the program, but apparently I have not. I would be very grateful for pointers as to where I am going wrong.

-i

Did you try the grep command with -f option?

grep -f File1.csv File2.csv

--ahamed

Thanks @ahamed101.

I did try grep -f, and there are two problems. I found that a pattern file with duplicate entries found unique matches, thus destroying the order of File1.csv, which I am trying to preserve. The other issue is that grep is notoriously inefficient for this task. For 173k patterns I would need to split File1.csv into chunks in a loop, and use each chunk to search against File2.csv. Even in this case, using grep to search >10k patterns begins to take several seconds. Other posts have profiled similar performance. While this second consideration is not a total deal-breaker, (a) I am going to have to perform a large number of these kinds of matches, (b) with bigger pattern files, awk is *fast*, so it would be great if I could find a more efficient solution.

-i

It looks like you have file1 and file2 reversed in your awk statement (the first statement seems to be right). Maybe you'd like to filter out the keys that start with "_".