Hi Everyone,
I have two files i.e. one file2 contains today's data and the other file1 contains Yesterday's data.
The data in the files contains 226 columns and the data for the coulums separated by a Pipe "|" delimiter.
Now, I have 4 Primary keys (coulumns) by which I have to compare file2 and file1 and generate a diff file that is the data that is not present in file2 but present in file1 and vice versa.
The sample of the data that contains in file1 is ( 1 record):
Since I have 4 primary keys, I though of doing the code using Hashes.
Could someone please help me out. I am looking for a desperate help on this.
Really appreciated if someone could help out with some ideas.
Really appreciate your time in writing the awk script.
But each of these files have 500K records.
But I have managed to write a perl script for the diff file. i.e.
#!/usr/local/bin/perl
$self = $0;
$self =~ s!^.*/!!;
#
$[ = 1; # = number of first index into arrays and strings
#
$FIELD_SEPARATOR = '\t';
$FIELD_NUMBER_LIST =('38','82');
$field_separator = $FIELD_SEPARATOR;
$field_number_list = $FIELD_NUMBER_LIST;
#
while (@ARGV)
{
$_ = shift;
if (/^-F$/) { $field_separator = shift; }
elsif (/^-L$/) { $field_number_list = shift; }
elsif (/^-F.+$/) { $field_separator = substr($_,$[+2); }
elsif (/^-L.+$/) { $field_number_list = substr($_,$[+2); }
#else { push(@filename, $_); }
}
#
$file_a = 'file1';
$file_b = 'file2';
#
unless (($file_a ne "") && (-f $file_a))
{
die "Error: Can't find file '$file_a'!\n";
}
unless (($file_b ne "") && (-f $file_b))
{
die "Error: Can't find file '$file_b'!\n";
}
#
@index_list = split(/,/, $field_number_list);
#
# Scan first file, Pass 1:
open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n";
#
while (<FILE_A>)
{
chop if /\n$/;
undef $key;
undef @field;
@field = split(/$field_separator/o);
foreach $index (@index_list)
{
if (defined $key)
{
$key .= "\n" . $field[$index];
}
else
{
$key = $field[$index];
}
}
$intersection{$key} = 1;
}
#
close(FILE_A);
# Scan second file, Pass 1:
#
$empty_intersection = 1;
#
open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n";
#
while (<FILE_B>)
{
chop if /\n$/;
undef $key;
undef @field;
@field = split(/$field_separator/o);
foreach $index (@index_list)
{
if (defined $key)
{
$key .= "\n" . $field[$index];
}
else
{
$key = $field[$index];
}
}
$code = $intersection{$key};
if ($code == 1)
{
$intersection{$key} = 3;
$empty_intersection = 0;
}
else
{
if ($code != 3) { $intersection{$key} = 2; }
}
}
#
close(FILE_B);
#
# Prepare output file names:
$file_a_1 = $file_a . '.1';
#
# Scan first file, Pass 2:
#
open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n";
open(FILE_A_1, ">$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_A>)
{
chop if /\n$/;
undef $key;
undef @field;
@field = split(/$field_separator/o);
foreach $index (@index_list)
{
if (defined $key)
{
$key .= "\n" . $field[$index];
}
else
{
$key = $field[$index];
}
}
if ($intersection{$key} == 3)
{
#
}
else
{
print FILE_A_1 $_, "\n";
}
}
#
close(FILE_A);
close(FILE_A_1);
#
# Scan second file, Pass 2:
#
open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n";
open(FILE_A_1, ">>$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_B>)
{
chop if /\n$/;
undef $key;
undef @field;
@field = split(/$field_separator/o);
foreach $index (@index_list)
{
if (defined $key)
{
$key .= "\n" . $field[$index];
}
else
{
$key = $field[$index];
}
}
if ($intersection{$key} == 3)
{
# }
else
{
print FILE_A_1 $_, "\n";
}
}
#
close(FILE_B);
close(FILE_A_1);
#
# Display results:
#
printf("The Diff file created '%s'\n\n", $file_a_1);
#
The above code works perfectly for generating the diff file i.e. depending upon the primary keys (here taken 2) the outfile contains the records that exists in file1 but not in file2 and the records that exists in file2 but not in file1.
Now,
I need to compare the whole record(line) if the primary keys in file1 matches with the primary keys in file2. If both the lines are equal then discard else write to the outfile.
Could someone please help me out in order to the above step.
File2.diff --> Records that exists in File2 but not in File1
same1.txt --> Records that exists in both the files'
Here,
The file size of the same.txt and same1.txt is different.
--> 506108009 Jun 15 01:50 same.txt
--> 505878904 Jun 15 01:52 same1.txt
So, is there anyway that we can capture the records if the Primary Keys ($38 and $82) of both the files matches but the data in the other columns is not the same.
To solve the above issue, I thought once we have the files same.txt and same1.txt we should compare the number of characters in the a line in File1 with number of characters in a line in file2 , if there is a change then write to a file else discard (do nothing or leave it)
Could you please share any ideas to solve the above issue. It would be really grateful.