Linux - Join 2 csv files with common key

RichZR · February 2, 2016, 8:40am

Hi,

I am trying to join 2 csv files, to create a 3rd output file with the joined data.

Below is an example of my Input Data:

Input File 1

NAME, FAV_FOOD, FAV_DRINK, ID, GENDER
Bob, Fish, Coke, 1, M
Lisa, Rice, Water, 2, F
Jenny, Noodle, Tea, 3, F
Ken, Pizza, Coffee, 4, M
Lisa, Cake, Milk, 2, F
Ken, Sandwich, Juice, 4, M

Input File 2

NAME, ID, AGE, COUNTRY
Lisa, 2, F, 22, UK 
Ken, 4, M, 12, Iceland
Jenny, 3, F, 18, France
Bob, 1, M, 31, Ireland

The Keys to join are 'Name' and 'ID';
The joined file should look like the below:

Output File

NAME, FAV_FOOD, FAV_DRINK, ID, GENDER, AGE, COUNTRY
Bob, Fish, Coke, 1, M, 31, Ireland
Lisa, Rice, Water, 2, F, 22, UK 
Jenny, Noodle, Tea, 3, F, 18, France
Ken, Pizza, Coffee, 4, M, 12, Iceland
Lisa, Cake, Milk, 2, F, 22, UK 
Ken, Sandwich, Juice, 4, M, 12, Iceland

Please could you suggest how I can perform this efficiently, as I will be joining large volumes of data from both files.

Many Thanks.

RudiC · February 2, 2016, 8:50am

Any attempts/ideas/thoughts from your side?

---------- Post updated at 14:50 ---------- Previous update was at 14:49 ----------

And, better use code tags for data, in lieu of quote tags...

RavinderSingh13 · February 2, 2016, 8:52am

Hello RichZR,

Adding one more query to Rudi's questions. I can see like fields are not always constant in Input_file2 eg. Ken, 4, M, , 12, Iceland and other lines have 5 fields only. So is it the case or it is a typo, could you please confirm on same. Also code tags as per forum rules for Inputs/commands/codes which you use in your posts.

Thanks,
R. Singh

RichZR · February 2, 2016, 9:03am

Hi Ravinder,

Apologies - That was a typo from my side. I have amended this now in my original post. Also amended the Quote section to a Code Section.

---------- Post updated at 02:03 PM ---------- Previous update was at 01:58 PM ----------

Hi Rudi,

I have found the following examples within the forum which I am sure could be tailored towards my requirement, however I am not massively experienced within Awk, to breakdown the following examples into sections I can understand to tweak:

Example 1:

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print $0 FS A[$1]} !($1 in A){print $0}' file1 FS=, file2

Example 2:

awk 'FNR==NR{T[$1]=$3 FS $4; next} {print $0, T[$1]}' file2 file1

Cheers.

RavinderSingh13 · February 2, 2016, 9:09am

Hello RichZR,

Assuming you have missed a field in Input_file2 too, so here I have added field GENDER field into Input_file2 in very first line as follows.

cat Input_file2
NAME, ID, GENDER, AGE, COUNTRY
Lisa, 2, F, 22, UK
Ken, 4, M, 12, Iceland
Jenny, 3, F, 18, France
Bob, 1, M, 31, Ireland

Could you please try following and let me know if this helps you.

awk -F"," 'FNR==NR{A[$1 FS $2]=$4 FS $5;next} ($1 FS $(NF-1) in A){print $0 FS A[$1 FS $(NF-1)]}' Input_file2 Input_file1

Output will be as follows.

NAME, FAV_FOOD, FAV_DRINK, ID, GENDER, AGE, COUNTRY
Bob, Fish, Coke, 1, M, 31, Ireland
Lisa, Rice, Water, 2, F, 22, UK
Jenny, Noodle, Tea, 3, F, 18, France
Ken, Pizza, Coffee, 4, M, 12, Iceland
Lisa, Cake, Milk, 2, F, 22, UK
Ken, Sandwich, Juice, 4, M , 12, Iceland

EDIT: Wanted to add point here if any field is NOT present from Input_file1 which is present in Input_file2 it is going to leave that line.
Also adding an explanation here so that it will be good for you in case you get some same kind of requirement in future, hope this helps.

awk -F","                       ######## Making field separator as comma(,) here.
'FNR==NR{                       ######## Putting condition FNR==NR which will only be TRUE when first file(Input_file2) in this example will be TRUE, where FNR is a awk's variable which will be REST each time a new file is being read and NR will be keep on increasing till the all files being read completely. NR(Number of records)
A[$1 FS $2]=$4 FS $5;           ######## Making an array named A whose index is $1 and field separator and value of $2, then assigning A's value to $4 and $5's value, where $1 is first field of line, $2 is 2nd field, $3 is 3rd field and $4 is fourth field etc.
next}                           ######## here I am using next statement which tells awk NOT to process further statements.
($1 FS $(NF-1) in A){           ######## Now while second file(Input_file1) in this example is being read if $1 field separator and 2nd last column($(NF-1)) if these values are present as an index into array A. then perform the following statements.
print $0 FS A[$1 FS $(NF-1)]    ######## Now print the value of complete line with value of array A whose index is $1 field seprator $(NF-1).
}' Input_file2 Input_file1      ######## mentioning here both of the input files.

Thanks,
R. Singh

RudiC · February 2, 2016, 9:12am

On top of what RavinderSingh13 said, spaces aren't used consistently as well, esp. in the end of lines. So, don't expect a perfect result, either. Howsoever, try

awk '   
FNR == NR       {T[$1,$2] = $0
                 sub ("^" $1 FS $2 FS $3, "", T[$1,$2])
                 next
                }
$1,$4 in T      {print $0  T[$1,$4]
                }
' FS=, file2 file1
NAME, FAV_FOOD, FAV_DRINK, ID, GENDER, COUNTRY
Bob, Fish, Coke, 1, M , 31, Ireland
Lisa, Rice, Water, 2, F , 22, UK
Jenny, Noodle, Tea, 3, F , 18, France
Ken, Pizza, Coffee, 4, M , 12, Iceland
Lisa, Cake, Milk, 2, F , 22, UK
Ken, Sandwich, Juice, 4, M , 12, Iceland

looney · February 2, 2016, 11:39am

Hi RudiC ,
kindly explain below highlighted part , as without using it we are getting the same result.
sub ("^" $1 FS $2 FS $3, "", T[$1,$2])
Thanks,

RudiC · February 2, 2016, 11:52am

That's a safety measure so that string will be replaced at the begin-of-line only.