Merge two files based on a 3rd key file

dips_ag · January 6, 2011, 12:37am

Hi,
I want to merge the two files based on the key file's columns.

The key file:

 
DATE~DATE 
HOUSE~IN_HOUSE 
CUST~IN_CUST 
PRODUCT~PRODUCT 
ADDRESS~CUST_ADDR 
BASIS_POINTS~BASIS_POINTS 
...

The other 2 files are From_file & To_file -
The From_file:

 
DATE|date/time|29|9 
HOUSE|integer|15|0 
CUST|integer|15|0 
PRODUCT|integer|15|0 
ADDRESS|string|256|0 
BASIS_POINTS|number|4|5 
....

The To_file:

DATE|date/time|29|9 
IN_HOUSE|integer|15|0 
IN_CUST|integer|15|0 
PRODUCT|integer|15|0 
CUST_ADDR|string|256|0 
BASIS_POINTS|number|4|5  
...

The o/p file has to be created by matching the left column in the key file to the first column in the from_file and substituting the entire line from the from_file. Similarly for the right column in the key file........to that of the first column in the to_file.

Hence the o/p would look something like this:
OUTPUT:

 
DATE|date/time|29|9~DATE|date/time|29|9 
HOUSE|integer|15|0~IN_HOUSE|integer|15|0
CUST|integer|15|0~IN_CUST|integer|15|0 
PRODUCT|integer|15|0~PRODUCT|integer|15|0 
ADDRESS|string|256|0~CUST_ADDR|string|256|0 
BASIS_POINTS|number|4|5~BASIS_POINTS|number|4|5 
...

I came accross a link Matching string on two files based on match rules. in this forum about the matching strings in 2 files......but I got really confused :(!! Please can someone suggest?

-dips

pravin27 · January 6, 2011, 10:34am

Hi,
Try this,

 awk -F"~" 'NR==FNR{a[$1]++;b[$2]++;next} {split($0,c,"|");if(a[c[1]]){printf $0"~";getline tmp < "to_file"; split(tmp,d,"|");if(b[d[1]]){print tmp}}}' key_file from_file

anurag.singh · January 6, 2011, 10:52am

awk -F\| 'FILENAME=="from_file"{a[$1]=$0;next;}FILENAME=="to_file"{b[$1]=$0;next;}{FS="~";if(a[$1]) $1=a[$1];if(b[$2]) $2=b[$2];}1' OFS="~" from_file to_file key_file

Scrutinizer · January 6, 2011, 11:53am

Small variation:

awk '{sub(/ *$/,x)} NR==FNR{A[$1]=$0;m++;next} NR-FNR==m{B[$1]=$0;next} {print A[$1],B[$2]}' FS=\| OFS=\~ fromfile tofile FS=\~ keyfile

anurag.singh · January 6, 2011, 12:08pm

@Scrutinizer, given that every keyfile entry has reference in from and to files. If so, null check if a[$1] and b[$2] can be removed from my post (Also FILENAME=="key_file" check is not needed).

Scrutinizer · January 6, 2011, 12:19pm

@anurag, there is almost no difference with your solution, that is why I called it a small variation. I removed the trailing spaces which seemed to be present in the input files through the sub statement and used a different way to vary field separators and determining the file.. I think the checks are not needed, since it will produce an empty reference, which is preferable IMO.

DGPickett · January 6, 2011, 12:52pm

Seems a bit fluffy if they have serious amounts of data. I suspect 3-4 sorts, 2 joins and 1 intermediate file would do it in bulk: sort two files, join them, possibly resort the result, sort the third file, join it and the result.

With the right SQL flat file tools, you can just do it as a three way join.

dips_ag · January 7, 2011, 6:18am

Hi,

Thanks for the responses!

Anurag & Scrutinizer - Could you both please explain your code?

-dips

anurag.singh · January 7, 2011, 7:10am

post #3: (by myself)

Set field separator to pipe character i.e. |
If current file read is "from_file", store each record ($0) in array a indexed as 1dt field in that record, and move to next line (This step will execute for all lines in from_file AND ONLY for from_file only)
If current file read is "to_file", store each record ($0) in array b indexed as 1dt field in that record and move to next line (This step will execute for all lines in to_file AND ONLY for to_file only)
Set field separator to tilde (~) character (By this time, both from_file and to_file are processed, and below steps are executed only for 3rd file i.e. key_file)
If array a has non-null value at index $1 (1st field in key_file), set array a value as the 1st field (Override 1st field of this line with value in array a)
If array b has non-null value at index $2 (2nd field in key_file), set array b value as the 2nd field (Override 2nd field of this line with value in array b)
print lines in key_file (modified as per step 5 and 6) having output field separator as ~

Post #4: (By Scrutinizer) [If more than one file is passed to awk then, FNR gives line no in current file only, but NR gives current line from start (as if all files were in one file)]

Set field separator to pipe character i.e. |
Remove all trailing spaces from currrent line [Not done in earlier post]
If current file read is "fromfile" (NR==FNR will be true only for 1st file passed to the command), store line in array A with 1st field as index (As was done eariler) and also increment m (So m will have line no processed so far)
If current file read is "tofile", store line in array B with 1st field as index (As was done eariler). [NR-FNR==m will be true ONLY for 2nd file]
Set field separator to tilde (~) character (It is set after 1st and 2nd file is processed)
While processing lines in 3rd file, print array A value at index of 1st field and print array B value at index of 2nd field having output field separator as ~

Both posts are pretty much same, just implemented in different ways.

dips_ag · January 10, 2011, 3:10am

Hi Anurag,

Thanks for your time and detailed explanation!

-dips