q with Perl Regex

For a programming exercise, I am mean to design a Perl script that detects double letters in a text file.

I tried the following expressions


# Check for any double letter within the alphabet

/[a-zA-Z]+/

# Check for any repetition of an alphanumeric character

/\w+/

Im aware that the + means to search for one or more occurences of that character, however trying both of these did not meet the requirements of my program.

Also

/[a-zA-Z]{1}/

did not prove to be helpful as well

After doing some searching, I stumbled across the correct form of the regex for the double letter case. It turned out to be

/(.)\1/

Now I know that . refers to any single character and the \1 refers to the first character in the line being read (if s/..../.... is being used), but Im still puzzled as to why /(.)\1/ works instead of /[a-zA-Z]+/ for the case of double letters ?

many thanks
James

  • Incorrect text removed *

/[a-zA-Z]+/ only means matching a contiguous sequence of letters, so not only 'AA' or 'zz' will match, 'Az' will match too.

\1 is a backreference to what is matched in the parenthesis in the regexp. So /(.)\1/ finds a double occurance of whatever (.) matched. It is similar to $1 but is used inside the regexp. It is discussed in some detail here:

perlretut - perldoc.perl.org

That is not correct. Using \1 is perfectly good perl code. \1 and $1 really have two seperate uses. See the link I posted in my previous post. A short test shows they do not do the same thing:

$_ = 'foobar';
if (/(.)$1/) {
   print "\$1 = $1","\n";
}	
if (/(.)\1/) {
   print "\\1 = $1";
}

output:

$1 = f
\1 = o

Thanks everyone for your messages.

Also I found that re-reading my notes in better detail was very helpful !

this does not work:

/[a-zA-Z]+/

because it means one or more of the characters inside the square brackets, any of the characters, in any order. You want to find two of the same character repeated in a string, not one or more of any character inside the [] brackets.

Interesting and thoughtful question. You use "(" and ")" to mark (remember) a pattern and recall the remembered pattern with "\" followed by a single digit (back reference).

In your particular case, "(.)\1" means remember a character and recall the character.

You can extend this method to find words with multiple double letters. '(.)\1(.)\2(.)\3' will match any word with three double letters, e.g. bookkeeper.

Ok I misunderstood. Thanks for bringing it up. Removed the incorrect paragraph from the original post to avoid distracting others viewing the thread.

It's good that there is someone kicking at your back and says "you're wrong". Isn't it? :wink:

The logic can be extended further to find "double" words, like cancan and booboo:

/\b(\w\w\w)\1\b/

or repeated words, like "the the":


/\b(\w\w\w)\s\1\b/

and so on and so on...... (three repeated words) :wink: