The word in BOLD is the difference part between input file 1 and input file 2.
I would like to export the word that only shown in input file 1 but not shown in input file 2 into another output file.
Thanks.
By comparing both file, XXXXXX and XXXXXXXXXXX in input file 2 is originally represent
"SGFDSG" and "KFPSDKFSPFS" in input file 1
"SGFDSG" and "KFPSDKFSPFS" is the data that I plan to appear in desired output file.
Thanks.
More a discussion post than a serious attempt at a commercial solution. Here is a chronically inefficient method which works for small files.
Assumes that there are no "tab" or "space" characters in either input file because we use "tab" to separate the results of the "paste" statement (which just placed the corresponding lines from file1.txt and file2.txt side-by-side).
#!/bin/ksh
paste file1.txt file.txt | awk '{print $1,$2}' | while read line1 line2
do
if [ "${line1}" = "${line2}" ]
then
printf "\n" # Output blank line
continue
fi
#
counter=0
echo "${line1}"|fold -w 1|while read char1
do
counter=`expr ${counter} + 1`
char2=`echo "${line2}"|cut -c${counter}`
if [ "${char1}" = "${char2}" ]
then
printf " " # Single space
else
printf "${char1}"
fi
done
printf "\n" # Newline
done
./scriptname
SGFDSG KFPSDKFSPF
NDSFNS
EQWRQERFWP
RWPOI
Hi, expert.
You are right.
I would like to find out what is the XXXXXX in file 2 based on the data source of file 1.
Do you have any idea to solve the problem?
Thanks first.
---------- Post updated at 12:12 AM ---------- Previous update was at 12:10 AM ----------
I just try using the awk command. It seems like taken long time when my input file 1 and input file 2 is a very huge file, eg. 1GB
Any perl language expert got better idea or solutions?
Thanks, Scrutinizer.
I just try your awk command. It worked fine
I found out that it required huge memory if I'm dealing with comparing two huge file (>1GB)
Do you have any better idea to figure out this problem?
---------- Post updated at 09:03 PM ---------- Previous update was at 09:01 PM ----------
Thanks, rdcwayx.
Your awk command worked fine
If I'm dealing with comparing two huge data file, do you have any suggestion to reduce the memory required by the awk command?
That surprises me. The little program shouldn't store more than about two times two lines at any time in its internal variables... Are you sure the application is using that memory and it is not caching by the OS, like for example is the case on Linux and which is in fact free memory? How long are the lines? How did you determine the memory use?
This example would work in bash/ksh93 only on most OS. But you can always first prepare input files using the fold command and then use the those files as input....
use strict;
use warnings;
use File::Basename;
my $NAME = basename $0;
$\ = "\n";
$, = '';
$" = '';
if (2 != @ARGV) {
print STDERR 'USAGE: ', $NAME, '<file1> <file2>';
exit 1;
}
my $F1 = shift @ARGV;
my $F2 = shift @ARGV;
my $len = length($F1) > length($F2) ? length($F1) : length($F2);
my $fmt = "\%-${len}s(%d): \%s\n";
open F1, '<', $F1 or die $F1;
open F2, '<', $F2 or die $F2;
my $L1;
my $L2;
while (1) {
$L1 = <F1>;
$L2 = <F2>;
last unless defined $L1 && defined $L2;
if ($L1 eq $L2) {
print '';
next;
}
chomp $L1;
chomp $L2;
my @L1 = split //, $L1;
my @L2 = split //, $L2;
my @R = ();
while (0 < @L1 || 0 < @L2 ) {
my $c1 = shift @L1; $c1 = ' ' unless defined $c1;
my $c2 = shift @L2; $c2 = '' unless defined $c2;
push @R, $c1 eq $c2 ? ' ' : $c1;
}
print @R;
}
# if file-1 is longer than file-2
while (defined $L1) {
chomp $L1;
print $L1;
$L1 = <F1>;
}
# if file-2 is longer than file-1
while (defined $L2) {
print ' ' x length($L2);
$L2 = <F2>;
}
It was not a typo that $c1 is being assigned to a space if not defined and that $c2 is being assigned to an empty string if not defined. The former causes a space to be added to the resultant if line 1 is shorter than line2, the latter causes the value of $c1 to be added if line 2 is shorter than line 1.
A billion+ characters will take some time to process.