Merging two files by comparing three fields

Hunter85 · June 6, 2009, 4:06pm

Hi Experts,

I need your timely help. I have a problem with merging two files. Here my situation :

Here I have to compare first three fields from FILE1 with FILE2. If they are equal, I have to append the remaining values from FILE2 with FILE1 to create the output.

FILE1:
Class Category Item Field1 Field2 Field3
Class1 Sports Ball 11 12 13
Class2 Academic Bat 21 22 23
Class3 Academic Pen 31 32 33
Class4 Gift Birthday 41 42 43

FILE2:
Class Category Item Field4 Field5
Class1 Sports Ball 14 15
Class2 Academic Bat 24 25
Class3 Academic Pen 34 35
Class5 Books Maths 54 55

OUTPUT FILE (FILE3) :

Class Category Item Field1 Field2 Field3 Field4 Field5
Class1 Sports Ball 11 12 13 14 15
Class2 Academic Bat 21 22 23 24 25
Class3 Academic Pen 31 32 33 34 35
Class4 Gift Birthday 41 42 43 0 0

Your valuable suggestions are highly appreciated. Thankyou very much.

PS : If it contains only one field to compare from both file, "Join" command would help to create the output.

join -a1 -e "0" FILE1 FILE2

durden_tyler · June 6, 2009, 4:40pm

One way to do it with perl:

$
$ cat file1
Class1 Sports Ball 11 12 13
Class2 Academic Bat 21 22 23
Class3 Academic Pen 31 32 33
Class4 Gift Birthday 41 42 43
$
$ cat file2
Class1 Sports Ball 14 15
Class2 Academic Bat 24 25
Class3 Academic Pen 34 35
Class5 Books Maths 54 55
$
$ perl -ne 'BEGIN {open(F,"file2"); while(<F>){split; $x{$_[0].":".$_[1].":".$_[2]}=" $_[3] $_[4]"} close(F)}
>   { chomp; split; $y=$_[0].":".$_[1].":".$_[2]; print $_,defined $x{$y}?$x{$y}:" 0 0","\n" }' file1
Class1 Sports Ball 11 12 13 14 15
Class2 Academic Bat 21 22 23 24 25
Class3 Academic Pen 31 32 33 34 35
Class4 Gift Birthday 41 42 43 0 0
$
$

tyler_durden

Hunter85 · June 8, 2009, 10:10am

Tyler_durden : You are a cool man. Got o/p with small concern.
One doubt...

If the fields contain spaces in their values i,e

if class1 is "Class One"
class 2 is "Class Two-2"

OR Birthday may be "Birth Day"

the script is not working. But I need the script to be working as the same even if the fieldnames contains spaces.

To remove space from the field I may use : sed 's/ //g'

Guide me how to use this inside perl. Thankyou very much.

durden_tyler · June 10, 2009, 5:18pm

If the field name contains spaces, then they are essentially two fields and not one field. The way the perl script worked earlier was:

(1) Loop through file2 and create a hash with key-value pairs as follows:

%x = ("Class1 Sports Ball",   "11 12 13",
      "Class2 Academic Bat",  "21 22 23",
      "Class3 Academic Pen",  "31 32 33",
      "Class4 Gift Birthday", "41 42 43");

(2) Now loop through file2, tokenize the input line, form the key using the first 3 tokens, print then entire line and then print the value of the key for this line. If value doesn't exist print " 0 0".

Now, for step (1) above, the hash key is formed using this expression:

$_[0].":".$_[1].":".$_[2]

So, for this input line of file2:

Class1 Sports Ball 11 12 13

the values would get assigned as follows:

$_[0] = "Class1"
$_[1] = "Sports"
$_[2] = "Ball"

But for a line like this in file2:

Class One Sports Ball 11 12 13

the first three values would get assigned as follows:

$_[0] = "Class"
$_[1] = "One"
$_[2] = "Sports"

Now, this would work if:

(a) the keys remain unique in both the files, and

(b) you tweak the script so that the key values are:

" $_[4] $_[5]"

instead of this:

" $_[3] $_[4]"

(You push the tokens one place while forming the key value because of that extra token.)

Here's the example of the revised code for the revised data:

$
$ cat file1
Class One Sports Ball 14 15
Class Two Academic Bat 24 25
Class Three Academic Pen 34 35
Class Four Books Maths 54 55
$
$ cat file2
Class One Sports Ball 11 12 13
Class Two Academic Bat 21 22 23
Class Three Academic Pen 31 32 33
Class Four Gift Birthday 41 42 43
$
$ perl -ne 'BEGIN {open(F,"file2");
>                  while(<F>){split; $x{$_[0].":".$_[1].":".$_[2]}=" $_[4] $_[5]"}
>                  close(F)}
>   { chomp; split;
>     $y=$_[0].":".$_[1].":".$_[2]; print $_,defined $x{$y}?$x{$y}:" 0 0","\n"
>   }' file1
Class One Sports Ball 14 15 11 12
Class Two Academic Bat 24 25 21 22
Class Three Academic Pen 34 35 31 32
Class Four Books Maths 54 55 0 0
$
$

Of course, I hope you could see the limitation of this approach.

You must know how many tokens would be created after the split and how they would be divided into keys and values.

As long as you know that, and are able to create keys and values consistently (after splitting), it might work well for your files.

But if you are thinking that that script would work for both these sets of files:

SET 1:
$
$ cat file1
Class1 Sports Ball 11 12 13
Class2 Academic Bat 21 22 23
Class3 Academic Pen 31 32 33
Class4 Gift Birthday 41 42 43
$
$ cat file2
Class1 Sports Ball 14 15
Class2 Academic Bat 24 25
Class3 Academic Pen 34 35
Class5 Books Maths 54 55
$
$

and

SET 2:
$
$ cat file1
Class One Sports Ball 14 15
Class Two Academic Bat 24 25
Class Three Academic Pen 34 35
Class Four Books Maths 54 55
$
$ cat file2
Class One Sports Ball 11 12 13
Class Two Academic Bat 21 22 23
Class Three Academic Pen 31 32 33
Class Four Gift Birthday 41 42 43
$

then you are mistaken; because tokens 1, 2 and 3 form the UNIQUE key in the first set of files, whereas tokens 1, 2, 3 and 4 form the UNIQUE key in the second set of files.

For cases like these, you may want to use a regex to split and create key-value pairs for hash.

Hope that helps,
tyler_durden