Merging Very large CSV files in Unix

student_007 · April 13, 2012, 2:11am

Hi,

I have two very large CSV files, which I want to merge (equi-join) based on a key (column).

One of the file (say F1) would have ~30 MM records and 700 columns. The other file (~f2) would have same # of records and lesser columns (say 50). I want to create an output file joining on a common column (in F1 and F2).

Something like:

F1=>

Key V1 .. V600
1111 .................
2222 .................
3333 .................

F2 =>

Key L1 .. L50
2222 .................
1111 .................
3333 .................

The merged file would be:

Key V1 .. V600 L1 .. L50
1111 .................
2222 .................
3333 .................

Please note that the files are not sorted.

Any insights would be appreciated.

Thank you!
-V

pravin27 · April 13, 2012, 2:24am

Could this help you ?

 
awk -F"," 'NR==FNR{temp=$1;$1="";a[temp]=$0;next}
a[$1]{print $0,a[$1]}' file2.csv file1.csv

CarloM · April 13, 2012, 3:15am

Try man join ("linux").

EDIT: Although you would need to sort the files...

student_007 · April 13, 2012, 11:54am

Hi Pravin27,

Thank you. I will try the same and circle back with feedback.

Hi CarloM,

I do not want to sort considering the large amount of data.

Thanks again to both of you.

-V