re.sub() backreference bug?

Discussion in 'Python' started by jemminger@gmail.com, Aug 17, 2006.

  1. Guest

    using this code:

    import re
    s = 'HelloWorld19-FooBar'
    s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
    s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
    s = re.sub('-', '_', s)
    s = s.lower()
    print "s: %s" % s

    i expect to get:
    hello_world19_foo_bar

    but instead i get:
    hell☺_☻orld19_fo☺_☻ar

    (in case the above doesn't come across the same, it's:
    hellX_Yorld19_foX_Yar, where X is a white smiley face and Y is a black
    smiley face !!)

    is this a bug, or am i doing something wrong?

    tested on
    Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)]
    on win32

    and
    Python 2.4.4c0 (#2, Jul 30 2006, 15:43:58) [GCC 4.1.2 20060715
    (prerelease) (Debian 4.1.1-9)] on linux2
    , Aug 17, 2006
    #1
    1. Advertising

  2. Tim Chase Guest

    > s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
    > s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
    > i expect to get:
    > hello_world19_foo_bar
    >
    > but instead i get:
    > hell☺_☻orld19_fo☺_☻ar



    Looks like you need to be using "raw" strings for your
    replacements as well:

    s = re.sub(r'([A-Z]+)([A-Z][a-z])', r"\1_\2", s)
    s = re.sub(r'([a-z\d])([A-Z])', r"\1_\2", s)

    This should allow the backslashes to be parsed as backslashes,
    not as escape-sequences (which in this case are likely getting
    interpreted as octal numbers)

    -tkc
    Tim Chase, Aug 17, 2006
    #2
    1. Advertising

  3. John Machin Guest

    wrote:
    > using this code:
    >
    > import re
    > s = 'HelloWorld19-FooBar'
    > s = re.sub(r'([A-Z]+)([A-Z][a-z])', "\1_\2", s)
    > s = re.sub(r'([a-z\d])([A-Z])', "\1_\2", s)
    > s = re.sub('-', '_', s)
    > s = s.lower()
    > print "s: %s" % s
    >
    > i expect to get:
    > hello_world19_foo_bar
    >
    > but instead i get:
    > hell☺_☻orld19_fo☺_☻ar
    >
    > (in case the above doesn't come across the same, it's:
    > hellX_Yorld19_foX_Yar, where X is a white smiley face and Y is a black
    > smiley face !!)
    >
    > is this a bug, or am i doing something wrong?
    >


    Tim's given you the solution to the problem: with the re module,
    *always* use raw strings in regexes and substitution strings.

    Here's a simple diagnostic tool that you can use when the visual
    presentation of a result leaves you wondering [did you get smiley faces
    on Windows in IDLE? on Linux?]:

    |>>> print repr(s)
    'hell\x01_\x02orld19_fo\x01_\x02ar'
    |>>> print "s: %r" % s
    s: 'hell\x01_\x02orld19_fo\x01_\x02ar'

    HTH,
    John
    John Machin, Aug 18, 2006
    #3
  4. Tim Chase Guest

    > Tim's given you the solution to the problem: with the re module,
    > *always* use raw strings in regexes and substitution strings.



    "always" is so...um...carved in stone. One can forego using raw
    strings if one prefers having one's strings looked like they were
    trampled by a stampede of creatures with backslash-shaped hooves...

    uh...yeah...stick with raw strings. :)

    -tkc
    Tim Chase, Aug 18, 2006
    #4
  5. thanks - that's the trick.

    On 8/17/06, Tim Chase <> wrote:
    > Looks like you need to be using "raw" strings for your
    > replacements as well:
    >
    > s = re.sub(r'([A-Z]+)([A-Z][a-z])', r"\1_\2", s)
    > s = re.sub(r'([a-z\d])([A-Z])', r"\1_\2", s)
    >
    > This should allow the backslashes to be parsed as backslashes,
    > not as escape-sequences (which in this case are likely getting
    > interpreted as octal numbers)
    >
    > -tkc
    >
    jeff emminger, Aug 18, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. paulm

    Newbie backreference question

    paulm, Jun 30, 2005, in forum: Python
    Replies:
    6
    Views:
    385
    paulm
    Jul 1, 2005
  2. Fredrik Lundh

    backreference in regexp

    Fredrik Lundh, Jan 31, 2006, in forum: Python
    Replies:
    2
    Views:
    351
    =?ISO-8859-1?Q?Sch=FCle_Daniel?=
    Jan 31, 2006
  3. Ben
    Replies:
    2
    Views:
    883
  4. abdulet
    Replies:
    2
    Views:
    538
    abdulet
    Oct 23, 2009
  5. Lawrence D'Oliveiro

    Death To Sub-Sub-Sub-Directories!

    Lawrence D'Oliveiro, May 5, 2011, in forum: Java
    Replies:
    92
    Views:
    2,020
    Lawrence D'Oliveiro
    May 20, 2011
Loading...

Share This Page