Combine common line from 2 Huge files

rochitsharma · May 13, 2010, 3:44am

Hi,

I am having 2 huge files having line count more than 10million. The files look like:

File 1

45905099 2059
942961505 3007
8450875165 7007
615565331 3015
9415586035 9012
9871573 5367
4415655 4011
44415539519 5361
3250659295 4001
5950718618 9367

File 2

44415539519      TQ03      99.86 12-MAY-10 09.36.45.453366 AM
5950718618      ZT04         53 01-MAY-10 02.42.55.600218 PM
94121628      TH04      98.73 11-MAY-10 08.57.42.617615 PM
941488      TZ03      49.86 10-APR-10 07.46.27.920278 PM
4415655      TR03      49.86 10-MAY-10 11.47.39.701701 AM
84224643      TR03      49.86 10-MAY-10 09.58.07.313377 AM
8860320024      TR03      48.86 12-MAY-10 10.00.59.901523 AM
6614414138      TR03      44.86 06-MAY-10 06.59.46.958793 PM
9442381886      TR03      44.86 03-MAY-10 05.01.44.008156 PM
999631410      TR03      45.86 04-APR-10 07.40.31.117461 PM

I need to create an output file containing common(1st column) entries from both files. The sample is :

Output:

44415539519 5361      TQ03      99.86 12-MAY-10 09.36.45.453366 AM
5950718618 9367      ZT04         53 01-MAY-10 02.42.55.600218 PM
4415655 4011      TR03      49.86 10-MAY-10 11.47.39.701701 AM

Please suggest some solution other than join as join utility consumes alot of time and files need to be sorted before applying join.

Thanks & Regards

malcomex999 · May 13, 2010, 4:12am

awk 'NR==FNR{for(i=1;++i<=NF;) _[$1]=_[$1] FS $i;next}$1 in _{print $0,_[$1]}' file2 file1

please use code tags next time for better reading!!!

anon57720281 · May 13, 2010, 9:31am

mate, I doubt you can beat the speed of join, especially if you want
to do it with unsorted files.
Programmatically that would be ridiculously expensive.

sorted would be N1 x N2 searches
unsorted something like:
N1 x N2!

ridiculous.