How to compare two files using UNIX?

gpsridhar · October 19, 2010, 7:30am

I have two files which have primary key(s) for each row.

I need to compare both the files and produce the output in the following format.

 
Primary key(s),file1 value,file2 value.

Both the input files will be comma separated files.

I have accomplished this using perl, but it is taking around two hours to generate the ouput for files with 50000 lines.

Can it be done using shell or awk in a much faster way?

jim_mcnamara · October 19, 2010, 7:37am

The answer is that you probably have unnecessaily complex perl code; yes, awk, perl or some other languages all work well. We have searches like that on 10M rows that take less than 2 minutes.

Please give us a few sample rows, sample output, and point out the field that is the primary key.

gpsridhar · October 19, 2010, 10:29am

My Sample input files :

File1

 
10,807,3616114150,1,RVRSL,X,NA,2615574
11,199,607440419,0,ORGNL,1,NA,157844
12,807,25004011647,1,RVRSL,X,NA,1411925

File2

 
10,807,3616114150,1,RVRSL,X,NA2,2615574
11,199,607440419,0,ORGNL,1,NA,157344
12,807,25004011647,1,RVR2SL,X,NA,1411925

Here, the first two columns are primary keys.

Sample output :

for the first row alone(Primary key combination is 10,807) :

 
10,807,3616114150,3616114150,Y
10,807,1,1,Y
10,807,RVRSL,RVRSL,Y
10,807,X,X,Y
10,807,NA,NA2,N
10,807,2615574,2615574,Y

The last Y or N indicates whether there is a mismatch for that column.

---------- Post updated at 07:59 PM ---------- Previous update was at 06:57 PM ----------

This is my perl code :

It would be great if anyone can even tune this and make it perform better or provide a faster solution using awk/shell.

 
#!/usr/bin/perl
use warnings;
my $input_file1 = shift;
my $input_file2 = shift;
my $primary_col;
my $starttime = shift;
#my $delim = shift;
open FIRFILE, $input_file1 or die "Unable to open file: [$!]";
open SECFILE, $input_file2 or die "Unable to open file: [$!]";
open(OUTFILE,"+>compareoutput_$starttime.txt") or die "Can't Create File!!";
$primary_col = `head -1 pk_count.txt`; # To get the number of primary keys
# Start with empty hashes
my %firHash = ();
my %secHash = ();
#print "First File:\n";
# Fill the first hash
while (<FIRFILE>)
{
                @fileColumns1 = split(/,/);
  #@fileColumns1 = split(/$delim/);
                my $size1 = @fileColumns1;
                #my $ident = $fileColumns1[0];
  my $ident2 = join(",", @fileColumns1[0..$primary_col-1]);
  #my $ident2 = join("$delim", @fileColumns1[0..$primary_col-1]);
                $firHash{$ident2} = $_;
}
# Fill the second hash
while (<SECFILE>)
{
  @fileColumns2 = split(/,/);
  #@fileColumns2 = split(/$delim/);
  my $size2 = @fileColumns2;
  #my $ident = $fileColumns2[0];
  my $ident2 = join(",", @fileColumns2[0..$primary_col-1]);
  #my $ident2 = join("$delim", @fileColumns2[0..$primary_col-1]);
                $secHash{$ident2} = $_;
}
foreach $key1 (sort keys %firHash)
{
foreach $key2 (sort keys %secHash)
{
if ($key1 eq $key2)
{
 my @file2 = split(/,/, $secHash{$key2});
 #my @file2 = split(/$delim/, $secHash{$key2});
 delete $file2[0];
 my @file1 = split(/,/, $firHash{$key1});
 #my @file1 = split(/$delim/, $firHash{$key1});
 delete $file1[0];
 my $len = @file1;
 for ($count=$primary_col; $count<$len; $count++)
 {
 chomp $file1[$count];
 chomp $file2[$count];
 my $column_num = $count+$primary_col-1;
  if ( $file1[$count] ne $file2[$count])
  {
  print OUTFILE $key1,",",$column_num,",",$file1[$count],",",$file2[$count],",N","\n";
  #print OUTFILE $key1,$delim,$column_num,$delim,$file1[$count],$delim,$file2[$count],$delim,"N","\n";
  }
  else
  {
  print OUTFILE $key1,",",$column_num,",",$file1[$count],",",$file2[$count],",Y","\n";
  #print OUTFILE $key1,$delim,$column_num,$delim,$file1[$count],$delim,$file2[$count],$delim,"Y","\n";
  }
 }
}
}
}

vgersh99 · October 19, 2010, 10:43am

nawk -f gp.awk file1 file2

gp.awk:

BEGIN {
  FS=OFS=","
}
{ idx = $1 OFS $2 }
FNR==NR { f1[idx]=$0;next}
idx in f1 {
  n=split(f1[idx],a,FS)
  for(i=3;i<=NF;i++)
    print idx,a, $i, (a==$i)?"Y":"N"
}

gpsridhar · October 21, 2010, 5:17am

@vgersh99 :
You code worked perfect and it ran in just a few seconds for 100k file. That was great. Thanks a lot.

I do have to make two changes to this code, which i tried but could not complete it.

First : The number of primary keys is not a constant one. It will be stored in a file.

Reference to my perl code:

 
$primary_col = `head -1 pk_count.txt`; # To get the number of primary keys

Im not sure how to change this in your code.

Second :
I also need to have a value denoting the number of columns in the output.
Taking my previous example, the output will look like :

10,807,1,3616114150,3616114150,Y
10,807,2,1,1,Y
10,807,3,RVRSL,RVRSL,Y
10,807,4,X,X,Y
10,807,5,NA,NA2,N
10,807,6,2615574,2615574,Y

where 10,807 is the primary key and the number next to it is just a sequence. It has to increment from 1 to n for each primary key

vgersh99 · October 21, 2010, 9:46am

something along these lines.
# default 2 keys
nawk -f gp.awk file1 file2
#
# with 3 keys
nawk -v keys=3 -f gp.awk file1 file2

gp.awk:

BEGIN {
  FS=OFS=","
  if (!keys) keys=2
}
{ for(i=1;i<=keys;i++) idx=(i==1)?$i:idx OFS $i }
FNR==NR { f1[idx]=$0;next}
idx in f1 {
  n=split(f1[idx],a,FS)
  seq=0
  for(i=keys+1;i<=NF;i++)
    print idx,++seq,a, $i, (a==$i)?"Y":"N"
}

gpsridhar · October 22, 2010, 3:19am

Thanks a lot vgersh99... Your code worked perfectly fine.. It produced the output under one minute for 100k file..