matched characters - regular expression

jianma · June 16, 2011, 12:10pm

Hi,

I read the book of <<unix shell programming>>. The regular expression ^\(.\)\1 matches the first character on the line and stores it in register 1. Then the expression matches whatever is stored in the register 1, as specified by the \1. The net effect of this regular expression is to match the first two characters on a line if they are both the same character.

I don't fully understand this regular expression, especially "Then the expression matches whatever is stored in the register 1, as specified by the \1." Can someone explain it in detail and more clearly.

Thanks!

Skrynesaver · June 16, 2011, 12:18pm

If you surround a block within a regular expression with escaped parenthesis (or un-escaped parenthesis when using Perl compatible regex eg. egrep) you are asking the regex parser to remember what was just matched and store it as the next back reference. Thus

 grep '^\(.\)\1' file

will match any character at the start of a line (the . character)
and store it in the first back reference which can be addressed as \1.
a more obvious example might be where we had a file with records of the form
user homeNode
and we wished to create an internal mailing list

sed s'/^([^ ]+) ([^ ]+)$/\1@\2/' users_file.txt

This would print out a series of email addresses

Then again re-reading that I'm not sure if it helps, the following may show it more clearly, assuming you've looked at alternation

Another example would be if you wished to match all of a html tag that could contain a ">" character in it (eg. <img alt="Next>" src="/images/next_button.gif"/>)

<([^>]+|(["'])[^\2]+\2)+>

Here we match anything that is not a ">" character, or anything that is a quoted string which uses either single or double quotes. We capture the quote type in the second set of parenthesis, (the first being the alternation), and then keep matching characters until the end of the quoted string is marked by the quote we previously matched.

jianma · June 16, 2011, 5:05pm

Hi Skrynesaver,

Thanks!