re.sub unexpected behaviour

Javier Collado · Jul 6, 2010

Hello,

Let's imagine that we have a simple function that generates a
replacement for a regular expression:

def process(match):
return match.string

If we use that simple function with re.sub using a simple pattern and
a string we get the expected output:
re.sub('123', process, '123')
'123'

However, if the string passed to re.sub contains a trailing new line
character, then we get an extra new line character unexpectedly:
re.sub(r'123', process, '123\n')
'123\n\n'

If we try to get the same result using a replacement string, instead
of a function, the strange behaviour cannot be reproduced:
re.sub(r'123', '123', '123')
'123'

re.sub('123', '123', '123\n')
'123\n'

Is there any explanation for this? If I'm skipping something when
using a replacement function with re.sub, please let me know.

Best regards,
Javier

Steven D'Aprano · Jul 6, 2010

Hello,

Let's imagine that we have a simple function that generates a
replacement for a regular expression:

def process(match):
return match.string

If we use that simple function with re.sub using a simple pattern and a
string we get the expected output:
re.sub('123', process, '123')
'123'

However, if the string passed to re.sub contains a trailing new line
character, then we get an extra new line character unexpectedly:
re.sub(r'123', process, '123\n')
'123\n\n'

I don't know why you say it is unexpected. The regex "123" matched the
first three characters of "123\n". Those three characters are replaced by
a copy of the string you are searching "123\n", which gives "123\n\n"
exactly as expected.

Perhaps these examples might help:
'HellHello World WHello Worldrld'

Here's a simplified pure-Python equivalent of what you are doing:

def replace_with_match_string(target, s):
n = s.find(target)
if n != -1:
s = s[:n] + s + s[n+len(target):]
return s

If we try to get the same result using a replacement string, instead of
a function, the strange behaviour cannot be reproduced: re.sub(r'123',
'123', '123')
'123'

re.sub('123', '123', '123\n')
'123\n'

The regex "123" matches the first three characters of "123\n", which is
then replaced by "123", giving "123\n", exactly as expected.
'Hell123 W123rld'

Javier Collado · Jul 6, 2010

Thanks for your answers. They helped me to realize that I was
mistakenly using match.string (the whole string) when I should be
using math.group(0) (the whole match).

Best regards,
Javier

re.sub and variables	4	Aug 12, 2010
Using re.sub with %s	0	Aug 18, 2010
Using a function for regular expression substitution	5	Aug 29, 2010
re.sub does not replace all occurences	3	Aug 7, 2007
Iteration within re.sub()?	3	Dec 14, 2004
small problem with re.sub	1	Jan 31, 2008
readline trick needed	19	Oct 13, 2012
re.sub problem	4	Mar 31, 2006

re.sub unexpected behaviour

Javier Collado

Steven D'Aprano

Javier Collado

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads