Perl:Regex for Search and Replace that has a flexible match

Hi,

I'm trying to match the front and back of a sequence. It works when there is an exact match (obviously), but I need the regex to be more flexible. When we get strings of nucleotides sometimes their prefixes and suffixes aren't exact matches. Sometimes there will be an extra letter and sometimes a letter will be missing or sometimes both.

For example if I was trying to match the string "Imhungry" in the front of a string and replace it with nothing I would use the following code.

$sequence =~ s/^.*?Imhungry//s;

This works great, but I need help writing some flexibility in the regex where I could also capture instances where
[1] single letter is missing eg."Imungry" or "mungry".
[2] a single letter is added (any letter) eg. "Immhungry" or Imhungryy"
[3] both eg. "Imhungyy" or "Immungryy" *notice this last example has two single letter duplications and one deletion

Thanks!

If this is too absurd let me know.

With a wildcard character I think I can do this.

$sequence =~ s/^.*?I{0,2}m{0,2}h{0,2}u{0,2}n{0,2}g{0,2}r{0,2}y{0,2}//s;

There are transforms like soundex that nullify spelling differences.

Regex that tolerates missing or extra every byte of key gets too loose, fast. You might construct an extended regex where for a n byte key, bytes 1 through n only are , so it matches n-1 bytes.\, e.g., for 'abcd', 'a*bcd|ab*cd|abc*d|abcd'.

I suppose you could write a scoring system for how many extra or missing in key match, and sort by the score, cut off at an 80% score or something.

1 Like

I think you are right. The regex I wrote is too loose. I'm going to give the extended regex a try and then decided if I should use a scoring system. Thanks for responding.