re.sub() backreference bug?

J

jemminger

using this code:

import re
s = 'HelloWorld19-FooBar'
s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
s = re.sub('-', '_', s)
s = s.lower()
print "s: %s" % s

i expect to get:
hello_world19_foo_bar

but instead i get:
hell☺_☻orld19_fo☺_☻ar

(in case the above doesn't come across the same, it's:
hellX_Yorld19_foX_Yar, where X is a white smiley face and Y is a black
smiley face !!)

is this a bug, or am i doing something wrong?

tested on
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
on win32

and
Python 2.4.4c0 (#2, Jul 30 2006, 15:43:58) [GCC 4.1.2 20060715
(prerelease) (Debian 4.1.1-9)] on linux2
 
T

Tim Chase

s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
i expect to get:
hello_world19_foo_bar

but instead i get:
hell☺_☻orld19_fo☺_☻ar


Looks like you need to be using "raw" strings for your
replacements as well:

s = re.sub(r'([A-Z]+)([A-Z][a-z])', r"\1_\2", s)
s = re.sub(r'([a-z\d])([A-Z])', r"\1_\2", s)

This should allow the backslashes to be parsed as backslashes,
not as escape-sequences (which in this case are likely getting
interpreted as octal numbers)

-tkc
 
J

John Machin

using this code:

import re
s = 'HelloWorld19-FooBar'
s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
s = re.sub('-', '_', s)
s = s.lower()
print "s: %s" % s

i expect to get:
hello_world19_foo_bar

but instead i get:
hell☺_☻orld19_fo☺_☻ar

(in case the above doesn't come across the same, it's:
hellX_Yorld19_foX_Yar, where X is a white smiley face and Y is a black
smiley face !!)

is this a bug, or am i doing something wrong?

Tim's given you the solution to the problem: with the re module,
*always* use raw strings in regexes and substitution strings.

Here's a simple diagnostic tool that you can use when the visual
presentation of a result leaves you wondering [did you get smiley faces
on Windows in IDLE? on Linux?]:

|>>> print repr(s)
'hell\x01_\x02orld19_fo\x01_\x02ar'
|>>> print "s: %r" % s
s: 'hell\x01_\x02orld19_fo\x01_\x02ar'

HTH,
John
 
T

Tim Chase

Tim's given you the solution to the problem: with the re module,
*always* use raw strings in regexes and substitution strings.


"always" is so...um...carved in stone. One can forego using raw
strings if one prefers having one's strings looked like they were
trampled by a stampede of creatures with backslash-shaped hooves...

uh...yeah...stick with raw strings. :)

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top