Python re.search vs re.sub

metallica1973 · January 27, 2016, 5:58pm

I am having trouble understanding why these two commands differ with one producing the desire results and the other not. An example:

capture_str = 'xserver-xorg-video-qxl-dbg (0.1.1-2+b2 [s390x], 0.1.1-2+b1 [amd64, armel, armhf, i386, mips, mipsel, powerpc], 0.1.1-2 [arm64, ppc64el]) X.Org X server -- QXL display driver (debugging symbols)'

re.search(r'(?<=\[s390x\]\, ).*', capture_str).group(0)
'0.1.1-2+b1 [amd64, armel, armhf, i386, mips, mipsel, powerpc], 0.1.1-2 [arm64, ppc64el]) X.Org X server -- QXL display driver (debugging symbols)'

re.sub(r'(?<=\[s390x\]\, ).*', '', capture_str)
'xserver-xorg-video-qxl-dbg (0.1.1-2+b2 [s390x], '

I am truly confused at why "re.sub" doesnt perform a positive lookbehind that re.search can do. It appears to be doing the opposite with the same regex. What is the difference?

durden_tyler · January 31, 2016, 9:14pm

They are working just as expected. For a moment, disregard the fact that your regex is a look-behind assertion.

The "search" method searches for the pattern and displays it. Since you have greedy search (.*), it displays the string till the end.
The "sub" method substitutes the part of the string that matches the pattern by "nothing" (zero-length string). So, what is left is the remaining part of the string before the matched pattern and that is returned.

Here's an example in my python REPL:

>>> import re
>>> 
>>> x = "The rain in Spain falls mainly in the plain"
>>> 
>>> re.search(r'in.*', x).group(0)
'in in Spain falls mainly in the plain'
>>>

I search for a string that starts with "in" and extends as long as it has to i.e. till the end.
The matched part of the string x is in red below:

The rain in Spain falls mainly in the plain

and that is returned.

Now, if I use the same regex in the "sub()" method, then it means that I want to substitute the matched part by something. Python does the substitution and returns the resultant string. Here's the test with the same string; I substitute the part that matches the regex by "#":

>>> 
>>> x
'The rain in Spain falls mainly in the plain'
>>> 
>>> re.sub(r'in.*', '#', x)
'The ra#'
>>> 
>>>

You put a null string instead of '#', hence the matched part got chopped off.

The results of these methods are "opposite" of each other because of the specific regex that you used. It matches a part of the string and goes on till the end. If you replace that by a null string, then the remainder is the part *before* the matched string.

If you had used a non-greedy regex, then the results would not have been "opposite".
An example:

>>> x
'The rain in Spain falls mainly in the plain'
>>> 
>>> re.search(r"Spain", x).group(0)
'Spain'
>>> 
>>> re.sub(r"Spain", "USA", x)
'The rain in USA falls mainly in the plain'
>>>

The look-behind assertion simply ensures that the string matching the assertion is not a part of the actual match.
So if my regex is "(?<=a)in.*" then it matches "in" and everything after it, provided it has "a" before it.
But "a" is not part of the match. Hence it is not returned.

>>> x
'The rain in Spain falls mainly in the plain'
>>> 
>>> re.search(r'(?<=a)in.*', x).group(0)
'in in Spain falls mainly in the plain'
>>>

And if I substitute the red part above by null string, then the remainder will be "The ra" as seen below:

>>> x
'The rain in Spain falls mainly in the plain'
>>> 
>>> re.sub(r'(?<=a)in.*', '', x)
'The ra'
>>> 
>>>

Again, "a" is not part of the matched string, hence it is not replaced.

Hope that helps.

metallica1973 · February 1, 2016, 11:11am

Simply Awesome. Many thanks