'\\' in regex affects the following parenthesis?

Discussion in 'Python' started by voxiac@gmail.com, Apr 22, 2007.

  1. Guest

    Could someone tell me why:
    >>> import re
    >>> p = re.compile('\\.*\\(.*)')


    Fails with message:

    Traceback (most recent call last):
    File "<pyshell#12>", line 1, in <module>
    re.compile('\\dir\\(file)')
    File "C:\Python25\lib\re.py", line 180, in compile
    return _compile(pattern, flags)
    File "C:\Python25\lib\re.py", line 233, in _compile
    raise error, v # invalid expression
    error: unbalanced parenthesis

    I thought '\\' should just be interpreted as a single '\' and not
    affect anything afterwards...

    The script 'redemo.py' shipped with Python by default is just fine
    about this regex however.
     
    , Apr 22, 2007
    #1
    1. Advertising

  2. Paul McGuire Guest

    On Apr 21, 6:56 pm, wrote:
    > Could someone tell me why:
    >
    > >>> import re
    > >>> p = re.compile('\\.*\\(.*)')

    >
    > Fails with message:
    >
    > Traceback (most recent call last):
    > File "<pyshell#12>", line 1, in <module>
    > re.compile('\\dir\\(file)')
    > File "C:\Python25\lib\re.py", line 180, in compile
    > return _compile(pattern, flags)
    > File "C:\Python25\lib\re.py", line 233, in _compile
    > raise error, v # invalid expression
    > error: unbalanced parenthesis
    >
    > I thought '\\' should just be interpreted as a single '\' and not
    > affect anything afterwards...
    >
    > The script 'redemo.py' shipped with Python by default is just fine
    > about this regex however.


    You are getting overlap between the Python string literal \\ escaping
    and re's \\ escaping. In a Python string literal '\\' gets collapsed
    down to '\', so to get your desired result, you would need to double-
    double every '\', as in:

    p = re.compile('\\\\.*\\\\(.*)')

    Ugly, no? Fortunately, Python has a special form for string literals,
    called "raw" which suppresses Python's processing of \'s for escaping
    - I think this was done expressly to help simplify entering re
    strings. To use raw format for a string literal, just precede the
    opening quotation mark with an r. Here is your original string, using
    a raw literal:

    p = re.compile(r'\\.*\\(.*)')

    This will compile ok.

    (Sometimes these literals are referred to as "raw strings" - I think
    this is confusing because new users think this is a special type of
    string type, different from str. This creates the EXACT SAME type of
    str; the r just tells the compiler/interpreter to handle the quoted
    literal a little differently. So I prefer to call them "raw
    literals".)

    -- Paul
     
    Paul McGuire, Apr 22, 2007
    #2
    1. Advertising

  3. John Machin Guest

    On Apr 22, 9:56 am, wrote:
    > Could someone tell me why:
    >
    > >>> import re
    > >>> p = re.compile('\\.*\\(.*)')


    Short answer: *ALWAYS* use raw strings for regexes in Python source
    files.

    Long answer:

    '\\.*\\(.*)' is equivalent to
    r'\.*\(.*)'

    So what re.compile is seeing is:

    \. -- a literal dot or period or full stop (not a metacharacter)
    * -- meaning 0 or more occurrences of the dot
    \( -- a literal left parenthesis
    .. -- dot metacharacter meaning any character bar a newline
    * -- meaning 0 or more occurences of almost anything
    ) -- a right parenthesis grouping metacharacter; a bit lonely hence
    the exception.

    What you probably want is:

    \\ -- literal backslash
    ..* -- any stuff
    \\ -- literal backslash
    (.*) -- grouped (any stuff)


    >
    > Fails with message:
    >
    > Traceback (most recent call last):
    > File "<pyshell#12>", line 1, in <module>
    > re.compile('\\dir\\(file)')
    > File "C:\Python25\lib\re.py", line 180, in compile
    > return _compile(pattern, flags)
    > File "C:\Python25\lib\re.py", line 233, in _compile
    > raise error, v # invalid expression
    > error: unbalanced parenthesis
    >
    > I thought '\\' should just be interpreted as a single '\' and not
    > affect anything afterwards...


    The second and third paragraphs of the re docs (http://docs.python.org/
    lib/module-re.html) cover this:
    """
    Regular expressions use the backslash character ("\") to indicate
    special forms or to allow special characters to be used without
    invoking their special meaning. This collides with Python's usage of
    the same character for the same purpose in string literals; for
    example, to match a literal backslash, one might have to write '\\\\'
    as the pattern string, because the regular expression must be "\\",
    and each backslash must be expressed as "\\" inside a regular Python
    string literal.

    The solution is to use Python's raw string notation for regular
    expression patterns; backslashes are not handled in any special way in
    a string literal prefixed with "r". So r"\n" is a two-character string
    containing "\" and "n", while "\n" is a one-character string
    containing a newline. Usually patterns will be expressed in Python
    code using this raw string notation.
    """

    Recommended reading: http://www.amk.ca/python/howto/regex/regex.html#SECTION000420000000000000000

    >
    > The script 'redemo.py' shipped with Python by default is just fine
    > about this regex however.


    That's because you are typing the regex into a Tkinter app. Likewise
    if you were reading the regex from (say) a config file or were typing
    it to a raw_input call. The common factor is that you are not passing
    it through an extra level of backslash processing.

    HTH,
    John
     
    John Machin, Apr 22, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Davide
    Replies:
    0
    Views:
    1,378
    Davide
    Oct 30, 2003
  2. =?Utf-8?B?anVubGlh?=
    Replies:
    2
    Views:
    577
    =?Utf-8?B?anVubGlh?=
    May 20, 2005
  3. milkyway

    Miminmize window affects the page

    milkyway, Sep 27, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    327
    David Ross
    Sep 27, 2005
  4. Replies:
    3
    Views:
    771
    Reedick, Andrew
    Jul 1, 2008
  5. azza

    printf affects following printf/s

    azza, Oct 17, 2010, in forum: C Programming
    Replies:
    0
    Views:
    433
Loading...

Share This Page