Perl - multiple keys and merging two files

Lokesha · September 17, 2013, 7:18pm

Hi,

I'm not a regular coder but some times I write some basic perl script, hence Perl is bit difficult for me :).

I'm merging two files a.txt and b.txt into c.txt:

a.txt
------
x001;frtb70;xyz;109
x001;frvt65;sec;239
x003;wqax34;jul;659
x004;yhud43;yhn;760

b.txt
------
x001;abcd80;xyz;193
x001;crrp28;xse;456
x002;lmno10;xyz;784
x002;jfds65;jfd;739
x002;juop88;jup;879
x003;yulo90;rem;542
x003;kihl98;dnt;312
x004;urel25;ewb;342


c.txt [output]
------
x001;frtb70;xyz;109
x001;frvt65;sec;239
x002;lmno10;xyz;784
x002;jfds65;jfd;739
x003;wqax34;jul;659
x004;yhud43;yhn;760

[/COLOR]

Only condition is: I need all the lines from a.txt into c.txt.
But while selecting lines from b.txt into c.txt, first I need to look into a.txt. If the line is already present in a.txt, then I shouldn't consider that b.txt line while writing into c.txt [output]. In all the files, we can consider first column as key, but it may contain duplicates. That is becoming challenge for me.

Below are the script I've writen. problem is, as I'm using hash for both input files, its not considering the lines which has same key value. But I should use all a.txt eventhough keys are same. Same is true for b.txt, except it should skip the lines, if the key is already present in a.txt.

#!/usr/bin/env perl

sub prepareHash {
	#my ($in_file, $primary_Key, $delimiter) = @_;
	my $in_file   = shift;
	my $key       = shift;
	my $delimiter = shift;
	
  my @line_tokens;
  my %FILE_Hash;
  open( IN_FILE, "< $in_file" ) or die "Can't open $in_file : $!";
	  
  while (<IN_FILE>) {
     my $in_line = $_;
     chomp($in_line);
     @line_tokens = split(/$delimiter/, $in_line);
	   $FILE_Hash{$line_tokens[$key]} = $in_line; 
  }
  
  close IN_FILE;

  return %FILE_Hash;
}

my $input1 = "/export/home/a.txt";
my $input2 = "/export/home/b.txt";
my $output = "/export/home/c.txt";

my %A_Hash  = prepareHash($input1, 0 , ";" );
my %B_Hash  = prepareHash($input2, 0 , ";" );

open( OUT_FILE, "> $c.txt" ) or die "Can't open $c.txt : $!";

for my $a_key ( sort keys %A_Hash ) {
   $a_key =~ s/\s+$//;
   my $a_line = $A_Hash{$a_key};
   print OUT_FILE $a_line . "\n";
}

  # Compare OBL and REPOOBL. Only write extra REPOOBL lines which are not in OBL into BOND file
  for my $b_key ( sort keys %B_Hash ) {
     $b_key =~ s/\s+$//;
     
     if ( ! exists $A_Hash{$b_key} ) {
      my $b_line = $B_Hash{$b_key};
      print OUT_FILE $b_line . "\n";
     } else {
      print "$B_Hash{$b_key} is the already writen into c.txt using a.txt, hence skipping\n";
     }
  }

close OUT_FILE;

Can any of you help me please?

durden_tyler · September 17, 2013, 11:19pm

$ 
$ cat a.txt
x001;frtb70;xyz;109
x001;frvt65;sec;239
x003;wqax34;jul;659
x004;yhud43;yhn;760
$ 
$ cat b.txt
x001;abcd80;xyz;193
x001;crrp28;xse;456
x002;lmno10;xyz;784
x002;jfds65;jfd;739
x002;juop88;jup;879
x003;yulo90;rem;542
x003;kihl98;dnt;312
x004;urel25;ewb;342
$ 
$ 
$ perl -F";" -lane 'if ($ARGV eq "a.txt") { push @{$x{$F[0]}},$_ }
                    else { push @{$y{$F[0]}},$_ if not defined $x{$F[0]} }
                    END {
                      @x {keys %y} = values %y;
                      foreach $k (sort keys %x) { print foreach (@{$x{$k}}) }
                    }' a.txt b.txt
x001;frtb70;xyz;109
x001;frvt65;sec;239
x002;lmno10;xyz;784
x002;jfds65;jfd;739
x002;juop88;jup;879
x003;wqax34;jul;659
x004;yhud43;yhn;760
$ 
$

Lokesha · September 18, 2013, 1:00am

Hi durden_tyler, Thank you very much for your reply.

But, the mentioned piece of code is very high level for me. I need it in a script instead of running it on command line. How can I convert your code line into script?

Thanks & Regards,
Lokesha

royalibrahim · September 18, 2013, 8:15am

Somehow this code gives the expected output But still figuring out how I got the output even, when I am not specifying the delimiter ';'

perl -lane '$hash{@F[0]} = $_; END { foreach (sort keys %hash) {print $hash{$_}}}' b.txt a.txt > c.txt

Lokesha · September 18, 2013, 8:17am

Thanks royalibrahim,

But I need it in a perl script instead of running on command line.
Can you help me?

Regards.

in2nix4life · September 18, 2013, 10:07am

Here's a basic variation in script form. Hope it helps.

#!/usr/bin/perl
#

use strict;

# vars we need
my $file_a = "a.txt";
my $file_b = "b.txt";
my $file_c = "c.txt";
my %HASH;
my @FILEA;
my @FILEB;
my @UNIQUE;

# open a.txt and b.txt in read mode and c.txt in append mode
open(FILEA, "<$file_a") or die "Unable to open $file_a.\n";
open(FILEB, "<$file_b") or die "Unable to open $file_b.\n";
open(FILEC, ">>$file_c") or die "Unable to write to $file_c.\n";

# store a.txt and b.txt into arrays
@FILEA = <FILEA>;
@FILEB = <FILEB>;

# write the contents of a.txt to c.txt
foreach(@FILEA) {
    print FILEC $_;
}

# map the contents of a.txt to a hash
%HASH = map{$_ => 1} @FILEA;

# use grep function to parse out lines that exist
# in both a.txt and b.txt
@UNIQUE = grep(! defined $HASH{$_}, @FILEB);

# write the results to c.txt
foreach(@UNIQUE) {
    print FILEC $_;
}

# close files
close(FILEA);
close(FILEB);
close(FILEC);

# done
exit(0);

Lokesha · September 18, 2013, 11:03am

Thanks in2nix4life.

The problem with your script is below piece of code line:

@UNIQUE = grep(! defined $HASH{$_}, @FILEB);

I thinks the above code matching for entire line. As each line in a.txt varies when compared to b.txt, the given script is simply combining both file contents into output file 'c.txt' as the entire line of a.txt not matches with b.txt.

We need to only match for the first field of b.txt with a.txt. If the first field varies then it has to write inside the output file c.txt.

input file: a.txt
------------------
x001;frtb70;xyz;109
x001;frvt65;sec;239
m003;wqax34;jul;659
y004;yhud43;yhn;760


input file: b.txt
------------------
x001;abcd80;xyz;193
x001;crrp28;xse;456
p002;lmno10;xyz;784
p002;jfds65;jfd;739
p002;juop88;jup;879
m003;yulo90;rem;542
m003;kihl98;dnt;312
y004;urel25;ewb;342


expected output file: c.txt
---------------------------
x001;frtb70;xyz;109
x001;frvt65;sec;239
p002;lmno10;xyz;784
p002;jfds65;jfd;739
p002;juop88;jup;879
m003;wqax34;jul;659
y004;yhud43;yhn;760

Selecting output lines based on first field of input files are important here and I'm failing there. Any idea will be much useful.

Thanks.

durden_tyler · September 18, 2013, 12:28pm

By processing files "a.txt" and "b.txt" separately,
splitting their inputs on delimiter ";"
and adding the relevant "push" statements
and then merging the hashes
and finally iterating through the hash and printing the result.

Lokesha · September 18, 2013, 6:07pm

Thanks for the help & reply.

Now, we are discussing about below script:

#!/usr/bin/perl

use strict;

# vars we need
my $file_a = "a.txt";
my $file_b = "b.txt";
my $file_c = "c.txt";
my %HASH;
my @FILEA;
my @FILEB;
my @UNIQUE;

# open a.txt and b.txt in read mode and c.txt in append mode
open(FILEA, "<$file_a") or die "Unable to open $file_a.\n";
open(FILEB, "<$file_b") or die "Unable to open $file_b.\n";
open(FILEC, ">>$file_c") or die "Unable to write to $file_c.\n";

# store a.txt and b.txt into arrays
@FILEA = <FILEA>;
@FILEB = <FILEB>;

# write the contents of a.txt to c.txt
foreach(@FILEA) {
    print FILEC $_;
}

# map the contents of a.txt to a hash
%HASH = map{$_ => 1} @FILEA;

# use grep function to parse out lines that exist
# in both a.txt and b.txt
@UNIQUE = grep(! defined $HASH{$_}, @FILEB);

# write the results to c.txt
foreach(@UNIQUE) {
    print FILEC $_;
}

# close files
close(FILEA);
close(FILEB);
close(FILEC);

# done
exit(0);

The problem the above script is "below piece of code line":

@UNIQUE = grep(! defined $HASH{$_}, @FILEB);

I thinks the above code matching for entire line.
As each line in a.txt varies when compared to b.txt, the given script is simply combining
both file contents into output file 'c.txt' as the entire line of a.txt not matches with b.txt.

We need to only match for the first field of b.txt with a.txt.
If the first field varies then it has to write inside the output file c.txt.

input file: a.txt
------------------

x001;frtb70;xyz;109
x001;frvt65;sec;239
m003;wqax34;jul;659
y004;yhud43;yhn;760

input file: b.txt
------------------

x001;abcd80;xyz;193
x001;crrp28;xse;456
p002;lmno10;xyz;784
p002;jfds65;jfd;739
p002;juop88;jup;879
m003;yulo90;rem;542
m003;kihl98;dnt;312
y004;urel25;ewb;342

expected output file: c.txt
---------------------------

x001;frtb70;xyz;109
x001;frvt65;sec;239
p002;lmno10;xyz;784
p002;jfds65;jfd;739
p002;juop88;jup;879
m003;wqax34;jul;659
y004;yhud43;yhn;760

Selecting output lines based on first field of input files are important here and I'm failing there. Any idea will be much useful.

Thanks a lot.

[COLOR="\#738fbf"]---------- Post updated at 03:37 AM ---------- Previous update was at 01:53 AM ----------

Here is the final script which is 100% working for the requirement.

#!/usr/bin/env perl

use strict;
use warnings;

my $afile   = '/home/home/a.csv';
my $bfile   = '/home/home/b.csv';
my $outfile = '/home/home/c.csv';

my %a_hash;

my @a_array;
my @b_array;

open( FILE_C, "> $outfile" ) or die "Can't open $outfile : $!";

open( FILE_A, "< $afile" ) or die "Can't open $afile : $!";
while ( my $aline = <FILE_A> ) {
    chomp $aline;
    $aline =~ /^(.+?);/; 
    $a_hash{$1} = 1; 
    print FILE_C $aline . "\n"; 
}

open( FILE_B, "< $bfile" ) or die "Can't open $bfile : $!";
while ( my $bline = <FILE_B> ) {
    chomp $bline;
    $bline =~ /^(.+?);/;
    if ( not exists $a_hash{$1} ) {
        print FILE_C $bline . "\n";
    }
}

close FILE_A;
close FILE_B;
close FILE_C;