Help with script or command to differentiate difference between two input file?

I got two file write now.
Input file 1:

>data_1
DSFDFDSGFDSGSGEGTRTRERPOYIORPGKKGDSPKFSDKFPSDKFSPFS
>data_34
WEEREREWREWOIQOPIEPDSKLFNDSFNSKNCASKJHDAFHAOUDFEOWWIOUFEWIUEWIRUEWIRUEWIORUEWOREWR
>data_21
ASDASDQWEQWRQERFWPOTGIUWEIPOFIOFDSNFKSJDNFSKDHFKDSJHFKDSJHF
>data_4
SDFDSFREREWPTIOEPOTIERWPOI
.
.

Input file 2:

>data_1
DSFDFDXXXXXXSGEGTRTRERPOYIORPGKKGDSPKFSDXXXXXXXXXXS
>data_34
WEEREREWREWOIQOPIEPDSKLFXXXXXXKNCASKJHDAFHAOUDFEOWWIOUFEWIUEWIRUEWIRUEWIORUEWOREWR
>data_21
ASDASDQWXXXXXXXXXXOTGIUWEIPOFIOFDSNFKSJDNFSKDHFKDSJHFKDSJHF
>data_4
SDFDSFREREWPTIOEPOTIEXXXXX
.
.

Desired output

      SGFDSG                            KFPSDKFSPFS

                        NDSFNS

        EQWRQERFWP

                    RWPOI
.
.

The word in BOLD is the difference part between input file 1 and input file 2.
I would like to export the word that only shown in input file 1 but not shown in input file 2 into another output file.
Thanks.

do you want the letters with XXXXXX in second file matched words from First file ..

is that is what you want ??

Hi expert, I want to print out the data that XXXXXX in file 2 originally represent what in file 1.
eg.
Input file 1

>data_1
DSFDFDSGFDSGSGEGTRTRERPOYIORPGKKGDSPKFSDKFPSDKFSPFS

Input file 2

>data_1
DSFDFDXXXXXXSGEGTRTRERPOYIORPGKKGDSPKFSDXXXXXXXXXXS

By comparing both file, XXXXXX and XXXXXXXXXXX in input file 2 is originally represent
"SGFDSG" and "KFPSDKFSPFS" in input file 1
"SGFDSG" and "KFPSDKFSPFS" is the data that I plan to appear in desired output file.
Thanks.

More a discussion post than a serious attempt at a commercial solution. Here is a chronically inefficient method which works for small files.
Assumes that there are no "tab" or "space" characters in either input file because we use "tab" to separate the results of the "paste" statement (which just placed the corresponding lines from file1.txt and file2.txt side-by-side).

#!/bin/ksh
paste file1.txt file.txt | awk '{print $1,$2}' | while read line1 line2
do
        if [ "${line1}" = "${line2}" ]
        then
                printf "\n"             # Output blank line
                continue
        fi
        #
        counter=0
        echo "${line1}"|fold -w 1|while read char1
        do
                counter=`expr ${counter} + 1`
                char2=`echo "${line2}"|cut -c${counter}`
                if [ "${char1}" = "${char2}" ]
                then
                        printf " "      # Single space
                else
                        printf "${char1}"
                fi
        done
        printf "\n"     # Newline
done


./scriptname

      SGFDSG                            KFPSDKFSPF 

                        NDSFNS                                                  

        EQWRQERFWP                                         

                     RWPOI

Hmm. I got "KFPSDKFSPF" not "KFPSDKFSPFS".

1 Like

Hi.

In whatever language one wishes: read a character from each file (or read lines, then step through the characters). Apply the following logic:

file1 file2 matchX? result
a     b     no      space->output
c     X     yes     c->output

@methyl:
I got the same off-by-one result. I think the picture template of Xs from the OP for that line is wrong ... cheers, drl

Hi, expert.
You are right.
I would like to find out what is the XXXXXX in file 2 based on the data source of file 1.
Do you have any idea to solve the problem?
Thanks first.

---------- Post updated at 12:12 AM ---------- Previous update was at 12:10 AM ----------

I just try using the awk command. It seems like taken long time when my input file 1 and input file 2 is a very huge file, eg. 1GB
Any perl language expert got better idea or solutions?

awk '{n=split($0,a,"");getline < "file2"; split($0,b,"");
      for (i=1;i<=n;i++) printf (a==b)?" ":a;printf "\n"}' file1
1 Like

See if this works faster:

awk -F '' '{getline s<f;split(s,T);for(i=1;i<=NF;i++)if($i==T)$i=" "}1' OFS= f=file2 file1

Try mawk instead of awk if you have that available...

1 Like

Thanks, Scrutinizer.
I just try your awk command. It worked fine :slight_smile:
I found out that it required huge memory if I'm dealing with comparing two huge file (>1GB)
Do you have any better idea to figure out this problem?

---------- Post updated at 09:03 PM ---------- Previous update was at 09:01 PM ----------

Thanks, rdcwayx.
Your awk command worked fine :slight_smile:
If I'm dealing with comparing two huge data file, do you have any suggestion to reduce the memory required by the awk command?

That surprises me. The little program shouldn't store more than about two times two lines at any time in its internal variables... Are you sure the application is using that memory and it is not caching by the OS, like for example is the case on Linux and which is in fact free memory? How long are the lines? How did you determine the memory use?

Some of the read length is around 10,000,000 or more.
Huge memory taken by the awk program is shown when I key in the "top" at the bash shell :frowning:

That is a bit much. Perhaps you could introduce a couple of linefeeds and limit the line length to for example 80 characters:

awk '{getline s<f;split(s,T);for(i=1;i<=NF;i++)if($i==T)$i=" "}1' FS= OFS= f=<(fold -w80 file2) <(fold -w80 file1)

This example would work in bash/ksh93 only on most OS. But you can always first prepare input files using the fold command and then use the those files as input....

1 Like

Thanks for your advice, Scrutinizer.

As I like PERL:

use strict;
use warnings;
use File::Basename;

my $NAME = basename $0;

$\ = "\n";
$, = '';
$" = '';

if (2 != @ARGV) {
    print STDERR 'USAGE: ', $NAME, '<file1> <file2>';
    exit 1;
}

my $F1 = shift @ARGV;
my $F2 = shift @ARGV;

my $len = length($F1) > length($F2) ? length($F1) : length($F2);
my $fmt = "\%-${len}s(%d): \%s\n";

open F1, '<', $F1 or die $F1;
open F2, '<', $F2 or die $F2;

my $L1;
my $L2;

while (1) {
    $L1 = <F1>;
    $L2 = <F2>; 

    last unless defined $L1 && defined $L2;

    if ($L1 eq $L2) {
        print '';
    next;
    }

    chomp $L1;
    chomp $L2;

    my @L1 = split //, $L1;
    my @L2 = split //, $L2;

    my @R = ();

    while (0 < @L1 || 0 < @L2 ) {
        my $c1 = shift @L1; $c1 = ' ' unless defined $c1;
        my $c2 = shift @L2; $c2 = ''  unless defined $c2;

    push @R, $c1 eq $c2 ? ' ' : $c1;
    }

    print @R;
}

# if file-1 is longer than file-2

while (defined $L1) {
    chomp $L1;
    print $L1;
    $L1 = <F1>;
}

# if file-2 is longer than file-1

while (defined $L2) {
    print ' ' x length($L2);
    $L2 = <F2>;
}

It was not a typo that $c1 is being assigned to a space if not defined and that $c2 is being assigned to an empty string if not defined. The former causes a space to be added to the resultant if line 1 is shorter than line2, the latter causes the value of $c1 to be added if line 2 is shorter than line 1.

A billion+ characters will take some time to process.

1 Like

Thanks, m.d.ludwig
Your perl script worked faster and required lesser memory taken :slight_smile: