Help with script or command to differentiate difference between two input file?

perl_beginner · December 26, 2010, 2:11am

I got two file write now.
Input file 1:

>data_1
DSFDFDSGFDSGSGEGTRTRERPOYIORPGKKGDSPKFSDKFPSDKFSPFS
>data_34
WEEREREWREWOIQOPIEPDSKLFNDSFNSKNCASKJHDAFHAOUDFEOWWIOUFEWIUEWIRUEWIRUEWIORUEWOREWR
>data_21
ASDASDQWEQWRQERFWPOTGIUWEIPOFIOFDSNFKSJDNFSKDHFKDSJHFKDSJHF
>data_4
SDFDSFREREWPTIOEPOTIERWPOI
.
.

Input file 2:

>data_1
DSFDFDXXXXXXSGEGTRTRERPOYIORPGKKGDSPKFSDXXXXXXXXXXS
>data_34
WEEREREWREWOIQOPIEPDSKLFXXXXXXKNCASKJHDAFHAOUDFEOWWIOUFEWIUEWIRUEWIRUEWIORUEWOREWR
>data_21
ASDASDQWXXXXXXXXXXOTGIUWEIPOFIOFDSNFKSJDNFSKDHFKDSJHFKDSJHF
>data_4
SDFDSFREREWPTIOEPOTIEXXXXX
.
.

Desired output

      SGFDSG                            KFPSDKFSPFS

                        NDSFNS

        EQWRQERFWP

                    RWPOI
.
.

The word in BOLD is the difference part between input file 1 and input file 2.
I would like to export the word that only shown in input file 1 but not shown in input file 2 into another output file.
Thanks.

expert · December 26, 2010, 8:14am

do you want the letters with XXXXXX in second file matched words from First file ..

is that is what you want ??

perl_beginner · December 26, 2010, 9:19am

Hi expert, I want to print out the data that XXXXXX in file 2 originally represent what in file 1.
eg.
Input file 1

>data_1
DSFDFDSGFDSGSGEGTRTRERPOYIORPGKKGDSPKFSDKFPSDKFSPFS

Input file 2

>data_1
DSFDFDXXXXXXSGEGTRTRERPOYIORPGKKGDSPKFSDXXXXXXXXXXS

By comparing both file, XXXXXX and XXXXXXXXXXX in input file 2 is originally represent
"SGFDSG" and "KFPSDKFSPFS" in input file 1
"SGFDSG" and "KFPSDKFSPFS" is the data that I plan to appear in desired output file.
Thanks.

methyl · December 26, 2010, 7:16pm

More a discussion post than a serious attempt at a commercial solution. Here is a chronically inefficient method which works for small files.
Assumes that there are no "tab" or "space" characters in either input file because we use "tab" to separate the results of the "paste" statement (which just placed the corresponding lines from file1.txt and file2.txt side-by-side).

#!/bin/ksh
paste file1.txt file.txt | awk '{print $1,$2}' | while read line1 line2
do
        if [ "${line1}" = "${line2}" ]
        then
                printf "\n"             # Output blank line
                continue
        fi
        #
        counter=0
        echo "${line1}"|fold -w 1|while read char1
        do
                counter=`expr ${counter} + 1`
                char2=`echo "${line2}"|cut -c${counter}`
                if [ "${char1}" = "${char2}" ]
                then
                        printf " "      # Single space
                else
                        printf "${char1}"
                fi
        done
        printf "\n"     # Newline
done


./scriptname

      SGFDSG                            KFPSDKFSPF 

                        NDSFNS                                                  

        EQWRQERFWP                                         

                     RWPOI

Hmm. I got "KFPSDKFSPF" not "KFPSDKFSPFS".

drl · December 26, 2010, 8:04pm

Hi.

In whatever language one wishes: read a character from each file (or read lines, then step through the characters). Apply the following logic:

file1 file2 matchX? result
a     b     no      space->output
c     X     yes     c->output

@methyl:
I got the same off-by-one result. I think the picture template of Xs from the OP for that line is wrong ... cheers, drl

perl_beginner · December 31, 2010, 12:12am

Hi, expert.
You are right.
I would like to find out what is the XXXXXX in file 2 based on the data source of file 1.
Do you have any idea to solve the problem?
Thanks first.

---------- Post updated at 12:12 AM ---------- Previous update was at 12:10 AM ----------

I just try using the awk command. It seems like taken long time when my input file 1 and input file 2 is a very huge file, eg. 1GB
Any perl language expert got better idea or solutions?

rdcwayx · December 31, 2010, 1:01am

awk '{n=split($0,a,"");getline < "file2"; split($0,b,"");
      for (i=1;i<=n;i++) printf (a==b)?" ":a;printf "\n"}' file1

Scrutinizer · December 31, 2010, 3:54am

See if this works faster:

awk -F '' '{getline s<f;split(s,T);for(i=1;i<=NF;i++)if($i==T)$i=" "}1' OFS= f=file2 file1

Try mawk instead of awk if you have that available...

perl_beginner · January 2, 2011, 9:03pm

Thanks, Scrutinizer.
I just try your awk command. It worked fine
I found out that it required huge memory if I'm dealing with comparing two huge file (>1GB)
Do you have any better idea to figure out this problem?

---------- Post updated at 09:03 PM ---------- Previous update was at 09:01 PM ----------

Thanks, rdcwayx.
Your awk command worked fine
If I'm dealing with comparing two huge data file, do you have any suggestion to reduce the memory required by the awk command?

Scrutinizer · January 3, 2011, 12:15am

That surprises me. The little program shouldn't store more than about two times two lines at any time in its internal variables... Are you sure the application is using that memory and it is not caching by the OS, like for example is the case on Linux and which is in fact free memory? How long are the lines? How did you determine the memory use?

perl_beginner · January 3, 2011, 1:32am

Some of the read length is around 10,000,000 or more.
Huge memory taken by the awk program is shown when I key in the "top" at the bash shell

Scrutinizer · January 3, 2011, 2:39am

That is a bit much. Perhaps you could introduce a couple of linefeeds and limit the line length to for example 80 characters:

awk '{getline s<f;split(s,T);for(i=1;i<=NF;i++)if($i==T)$i=" "}1' FS= OFS= f=<(fold -w80 file2) <(fold -w80 file1)

This example would work in bash/ksh93 only on most OS. But you can always first prepare input files using the fold command and then use the those files as input....

perl_beginner · January 3, 2011, 4:52am

Thanks for your advice, Scrutinizer.

m.d.ludwig · January 3, 2011, 8:03am

As I like PERL:

use strict;
use warnings;
use File::Basename;

my $NAME = basename $0;

$\ = "\n";
$, = '';
$" = '';

if (2 != @ARGV) {
    print STDERR 'USAGE: ', $NAME, '<file1> <file2>';
    exit 1;
}

my $F1 = shift @ARGV;
my $F2 = shift @ARGV;

my $len = length($F1) > length($F2) ? length($F1) : length($F2);
my $fmt = "\%-${len}s(%d): \%s\n";

open F1, '<', $F1 or die $F1;
open F2, '<', $F2 or die $F2;

my $L1;
my $L2;

while (1) {
    $L1 = <F1>;
    $L2 = <F2>; 

    last unless defined $L1 && defined $L2;

    if ($L1 eq $L2) {
        print '';
    next;
    }

    chomp $L1;
    chomp $L2;

    my @L1 = split //, $L1;
    my @L2 = split //, $L2;

    my @R = ();

    while (0 < @L1 || 0 < @L2 ) {
        my $c1 = shift @L1; $c1 = ' ' unless defined $c1;
        my $c2 = shift @L2; $c2 = ''  unless defined $c2;

    push @R, $c1 eq $c2 ? ' ' : $c1;
    }

    print @R;
}

# if file-1 is longer than file-2

while (defined $L1) {
    chomp $L1;
    print $L1;
    $L1 = <F1>;
}

# if file-2 is longer than file-1

while (defined $L2) {
    print ' ' x length($L2);
    $L2 = <F2>;
}

It was not a typo that $c1 is being assigned to a space if not defined and that $c2 is being assigned to an empty string if not defined. The former causes a space to be added to the resultant if line 1 is shorter than line2, the latter causes the value of $c1 to be added if line 2 is shorter than line 1.

A billion+ characters will take some time to process.

perl_beginner · January 4, 2011, 2:52am

Thanks, m.d.ludwig
Your perl script worked faster and required lesser memory taken