re.sub unexpected behaviour

J

Javier Collado

Hello,

Let's imagine that we have a simple function that generates a
replacement for a regular expression:

def process(match):
return match.string

If we use that simple function with re.sub using a simple pattern and
a string we get the expected output:
re.sub('123', process, '123')
'123'

However, if the string passed to re.sub contains a trailing new line
character, then we get an extra new line character unexpectedly:
re.sub(r'123', process, '123\n')
'123\n\n'

If we try to get the same result using a replacement string, instead
of a function, the strange behaviour cannot be reproduced:
re.sub(r'123', '123', '123')
'123'

re.sub('123', '123', '123\n')
'123\n'

Is there any explanation for this? If I'm skipping something when
using a replacement function with re.sub, please let me know.

Best regards,
Javier
 
S

Steven D'Aprano

Hello,

Let's imagine that we have a simple function that generates a
replacement for a regular expression:

def process(match):
return match.string

If we use that simple function with re.sub using a simple pattern and a
string we get the expected output:
re.sub('123', process, '123')
'123'

However, if the string passed to re.sub contains a trailing new line
character, then we get an extra new line character unexpectedly:
re.sub(r'123', process, '123\n')
'123\n\n'

I don't know why you say it is unexpected. The regex "123" matched the
first three characters of "123\n". Those three characters are replaced by
a copy of the string you are searching "123\n", which gives "123\n\n"
exactly as expected.

Perhaps these examples might help:
'HellHello World WHello Worldrld'


Here's a simplified pure-Python equivalent of what you are doing:

def replace_with_match_string(target, s):
n = s.find(target)
if n != -1:
s = s[:n] + s + s[n+len(target):]
return s


If we try to get the same result using a replacement string, instead of
a function, the strange behaviour cannot be reproduced: re.sub(r'123',
'123', '123')
'123'

re.sub('123', '123', '123\n')
'123\n'

The regex "123" matches the first three characters of "123\n", which is
then replaced by "123", giving "123\n", exactly as expected.
'Hell123 W123rld'
 
J

Javier Collado

Thanks for your answers. They helped me to realize that I was
mistakenly using match.string (the whole string) when I should be
using math.group(0) (the whole match).

Best regards,
Javier
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top