Help with merge data based on similarity

patrick87 · December 19, 2011, 5:30am

Input_file

data1    USA    100    ASE
data3    UK    20    GWQR
data4    Brazil    40    QWE
data2    Scotland    60    THWE
data5    USA    40    QWERR

Reference_file

USA    12312    34532
1324    Brazil    23321
231    3421    Scotland
342    34235    UK
231    141    England

Desired_output_file:

data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532
data4    Brazil    40    QWE    1324    Brazil    23321
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

I would like to print out those content that shared between column 2 of Input_file with column 1, 2, 3 of reference_file.
Below is the way I deal with it:

Step 1: merge the share info between column 2 of input_file with column 1 of reference_file:
perl -e ' $col1=1; $col2=0; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp1.txt
data1    USA    100    ASE    USA    12312    34532
data5    USA    40    QWERR    USA    12312    34532

Step 2: merge the share info between column 2 of input_file with column 2 of reference_file:
perl -e ' $col1=1; $col2=1; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp2.txt 
data4    Brazil    40    QWE    1324    Brazil    23321

Step 3: merge the share info between column 2 of input_file with column 3 of reference_file:
perl -e ' $col1=1; $col2=2; ($f1,$f2)=@ARGV; open(F2,$f2); while  (<F2>) { s/\r?\n//; @F=split /\t/, $_; $line2{$F[$col2]} .= "$_\n"  }; $count2 = $.; open(F1,$f1); while (<F1>) { s/\r?\n//; @F=split  /\t/, $_; $x = $line2{$F[$col1]}; if ($x) { $num_changes = ($x =~  s/^/$_\t/gm); print $x; $merged += $num_changes } } warn "\nJoining $f1  column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2  lines\nMerged file: $merged lines\n"; ' Input_file Reference_file >  tmp3.txt 
data3    UK    20    GWQR    342    34235    UK
data2    Scotland    60    THWE    231    3421    Scotland

Concetate all tmp*.txt together:
cat tmp1.txt tmp2.txt tmp3.txt > Desired_output_file.txt

It seems like awk "if..else...else if" condition able to save the progress time?
Thanks for any advice.

Skrynesaver · December 19, 2011, 6:33am

As you're using perl already, why not read the reference file into a hash keyed on the non-numeric field in each record and then use that to create the new records as you read the input file?

$ perl -Mstrict -e '
my %index;
open (my $ref,"<","reference.dat");
while (<$ref>){
   chomp;
   my $record=$_;
   my @record=split(/\s+/,$record);
   for my $field (@record){
      if ($field !~ /^\d+$/){
         $index{$field} = $record;
      }
   }
}
close $ref;
open(my $in, "<","input.dat");
while(<$in>){
   chomp;
   my @fields=split /\s+/,$_;
   print "$_ $index{$fields[1]}\n";
}'