Finding character mismatch position in two strings

etherite · July 15, 2008, 10:32am

Hello,

I would like to find an efficient way to compare a pair of strings that differ at one position, and return the difference and position.

For example:

String1 123456789

String2 123454789

returning something - position 6, 6/4

Thanks in advance,

Mike

joeyg · July 15, 2008, 10:54am

I think it has an option to process byte by byte; seems to be what you are looking for.

jim_mcnamara · July 15, 2008, 11:31am

or awk

echo 12345678 12345478 | \
awk ' BEGIN {pos=0}
    {
     max=(length($1) >= length($2))? length($1): length($2)     
     for(i=1; pos == 0 && i <= max; i++)
     {
     	 v1=substr($1, i, 1)  
     	 v2=substr($2, i, 1) 
     	 if(v1 != v2){ pos=i } 
     }     
    }
    END { if(pos) {printf("%d %d/%d\n", pos, v1, v2) }}'

etherite · July 15, 2008, 12:21pm

An awk solution is great! Thanks Jim.

I've also just found cmp in the GNU DiffUtilities package, but yours is pretty much what I was looking for.

etherite · July 21, 2008, 6:49am

Oh bother!

It turns out that I didn't fully explain what I was trying to do. Jim's solution works for a single pair of strings that I wish to compare, however I actually have a file with pairs of strings on each line. I would like to carry out the comparison on each line in turn. Jim's awk script just checks the first line.

Sorry if I am being dumb about this.

Mike

era · July 21, 2008, 8:45am

Here's a minor adaptation of jim's script. It prints the line number and the offset, or nothing if both tokens are identical.

awk '{ pos=0
     max=(length($1) >= length($2))? length($1): length($2)     
     for(i=1; pos == 0 && i <= max; i++)
     {
     	 v1=substr($1, i, 1)  
     	 v2=substr($2, i, 1) 
     	 if(v1 != v2) printf "%i: %d %d/%d\n", NR, pos, v1, v2
     }
    }' filename