Merging Very large CSV files in Unix


I have two very large CSV files, which I want to merge (equi-join) based on a key (column).

One of the file (say F1) would have ~30 MM records and 700 columns. The other file (~f2) would have same # of records and lesser columns (say 50). I want to create an output file joining on a common column (in F1 and F2).

Something like:


Key V1 .. V600
1111 .................
2222 .................
3333 .................

F2 =>

Key L1 .. L50
2222 .................
1111 .................
3333 .................

The merged file would be:

Key V1 .. V600 L1 .. L50
1111 .................
2222 .................
3333 .................

Please note that the files are not sorted.

Any insights would be appreciated.

Thank you!

Could this help you ?

awk -F"," 'NR==FNR{temp=$1;$1="";a[temp]=$0;next}
a[$1]{print $0,a[$1]}' file2.csv file1.csv

Try man join ("linux").

EDIT: Although you would need to sort the files...

Hi Pravin27,

Thank you. I will try the same and circle back with feedback.

Hi CarloM,

I do not want to sort considering the large amount of data.

Thanks again to both of you.
