regexp: really tired, guru help needed...

ulrith · November 23, 2010, 8:40am

Hello all! Please help me with the following complex regexp which works in egrep:

egrep '\"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9]{4}\/([^\.]+\.[a-z]{3})[^\"]*\"' my-file.html

But silently does not work in sed:

sed "s/\"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9]{4}\/([^\.]+\.[a-z]{3})[^\"]*\"/aaa/g" my-file.html

or

sed 's/\"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9]{4}\/([^\.]+\.[a-z]{3})[^\"]*\"/aaa/g' my-file.html

Just can't figure out what's wrong...

Scrutinizer · November 23, 2010, 8:47am

Try:

sed 's|"http://ccc\.bbb\.com/documents/0000/[0-9]\{4\}/[^.][^.]*\.[a-z]\{3\}[^"]*"|aaa|g' infile

ulrith · November 23, 2010, 8:55am

It works!!! Thank you!

Scrutinizer · November 23, 2010, 8:58am

Could you give a sample of your input? (I adjusted my post, because you changed the name to ccc.bbb.com later BTW)

durden_tyler · November 23, 2010, 9:03am

sed 's/"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9][0-9][0-9][0-9]\/[^.][^.]*\.[a-z][a-z][a-z][^"]*"/aaa/g' my-file.html

tyler_durden

bakunin · November 23, 2010, 9:07am

I suppose your problem are some unescaped grouping constructs:

Your regex:

"/\"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9]{4}\/([^\.]+\.[a-z]{3})[^\"] \"/

probably correct:

"/\"http:\/\/ccc\.bbb\.com\/documents\/0000\/[0-9]\{4\}\/\([^\.]+\.[a-z]\{3\}\)[^\"] \"/

If you want to repeat the previous regex part several times you have to use "regex\{<n>\}", not "regex{<n>}" and grouping has to be done with "\(<regex>\)", not "(<regex>)".

Furthermore, the "+" is not part of POSIX regexps, but some (usually GNU/Linux-) utilities understand it as "one or more", similar to "" meaning "zero or more". I am not entirely sure how you meant it, but you would be probably better off making this clear by either escaping the "+" our using "<regex><regex>" instead of "<regex>+" if you want to match one or more instances of <regex>.

I hope this helps.

bakunin

ulrith · November 23, 2010, 9:18am

Thank you all guys. You are really fast.

Scrutinizer · November 23, 2010, 9:40am

FWIW, the egrep could be shortened to:

grep '"http://ccc\.bbb\.com/documents/0000/[0-9]{4}/[^.]+\.[a-z]{3}[^"]*"'