best method to compare 2 big files in unix

rashmisb · January 15, 2011, 1:45pm

Hi ,

I have a requirement to compare 2 files which can contain 40 million or more records and more than 20 fields to compare .
Currently I am using awk scripting , and since awk has a memory issue, I am not able to process file more than 10 million records.

Any suggestions or pointers to change my logic which can help me. I thought of splitting the files into 1 million each , but then , I would miss few records in doing soo.

Thanks in advance.
Rashmi

itkamaraj · January 15, 2011, 2:16pm

Is that fgrep help you ?

fgrep -v -f file2 file1 >file3

This will output file3 containing all lines from file1 that are not in file2.

methyl · January 15, 2011, 6:24pm

What Operating System and version do you have?
What Shell do you use?

Does "wc -l filename" give an accurate answer for the number of lines? This tells me whether there is at least a chance of processing the file in unix Shell.
Are the files sorted to a simple order, or are they in random order?

Exactly how big are these files according to "ls -la" ? The size is probably more important than the number of records, but the maximum size of any record is also important.
If any file is larger than 2Gb it may be impossible to process with basic unix Shell commands (depends on the version of Shell).

Are the files definitely unix text files suitable for processing in unix Shell?

When comparing these files, are you just interested in whether they are different?

If you are trying to do more sophisticated processing on the differences, do you have a high-level programming language such as Oracle and the ability to write applications to process the data?

Don't forget to post sample input, expected processing, and sample expected output.

Ps. "fgrep" is not even vaguely suited to this task.
Pps. If "awk" fails, please post the "awk" script along with the environmental and numerical facts.

rashmisb · January 16, 2011, 10:11am

Hi,
I am working on Sun Microsystems Inc.SunOS 5.10 version.
My file is a output of a sql query which fetchs records of more than 40 million records from the oracle database
from 2 different systems. I need to compare these 2 files, by finding which fields are not matching and the missing
records from file 1 and file2.The 1st colmn whld be displayed as it is and if the fields match , then it shld put Y , else N in the output file.

for ex
file 1
--------

field1|field2|field3|
abc|123|234
def|345|456
hij|567|678

file2
---------

field1|field2|field3|
abc|890|234
hij|567|658

output file

field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N

the code I am using righ now. I would be sorting the file before I start processing, here the control_file will tell
me which fields I need to compare.if the fields match , then it shld put Y , else N in the output file.

Thanks in advance

Corona688 · January 16, 2011, 10:38am

Sorting is probably going to be the biggest overhead by far, and I don't see a way to avoid it...

m.d.ludwig · January 16, 2011, 3:24pm

Assuming the two files are sorted by the first field (an sql query can do that), my proposal in PERL:

use strict;
use warnings;

$\ = "\n";
$, = '|';

if (@ARGV < 2) {
    print "USAGE: $0 <file1> <file2>";
    exit 1;
}

my $inputfile1 = shift @ARGV;
open F1, '<', $inputfile1 or die $inputfile1;

my $inputfile2 = shift @ARGV;
open F2, '<', $inputfile2 or die $inputfile2;

my $h1 = <F1>; chomp $h1;
my $h2 = <F2>; chomp $h2;

if ($h1 eq $h2) {
    print $h1;
}
else {
    print STDERR "$0: different headers\n";
    exit 1;
}

my $k1 = undef; my @F1 = ();
my $k2 = undef; my @F2 = ();

while (1) {
    unless (defined $k1) { $_ = <F1>; last unless defined $_; chomp; ($k1, @F1) = split /\|/; }
    unless (defined $k2) { $_ = <F2>; last unless defined $_; chomp; ($k2, @F2) = split /\|/; }

    if ($k1 lt $k2) { print $k1, @F1; $k1 = undef; next; }
    if ($k2 lt $k1) { print $k2, @F2; $k2 = undef; next; }

    print $k1, map { $_ eq shift @F2 ? 'Y' : 'N' } @F1;

    $k1 = undef;
    $k2 = undef;
}

if (defined $k1) { print $k1, @F1; } while (<F1>) { chomp; print; } 
if (defined $k2) { print $k2, @F2; } while (<F2>) { chomp; print; }

rashmisb · January 17, 2011, 9:12pm

Thanks, can you please explain me the code, as I don;t know perl scripting.Is there a way to do it unix, would reading the file using fopen consume less memory and do the comparison and then write to the output file.

Please advice

methyl · January 18, 2011, 6:23pm

Lateral thought.
It seems weird to export data on this scale when you have Oracle.
Is there a geography or network problem or Oracle version problem with making both databases available to one Oracle program?
Failing that, is there space in one of the databases to load the comparative data into a temporary table?
Failing that, can a single Oracle program read the opposing flat file (which has been exported in key sequence order) and compare with its own database?
Failing that, can you create a new database to load the two files purely for comparison?
Bottom line: Do you have an Oracle programmer?