python3 raw strings and \u escapes

Discussion in 'Python' started by rurpy@yahoo.com, May 30, 2012.

  1. Guest

    In python2, "\u" escapes are processed in raw unicode
    strings. That is, ur'\u3000' is a string of length 1
    consisting of the IDEOGRAPHIC SPACE unicode character.

    In python3, "\u" escapes are not processed in raw strings.
    r'\u3000' is a string of length 6 consisting of a backslash,
    'u', '3' and three '0' characters.

    This breaks a lot of my code because in python 2
    re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    but in python 3 (the result of running 2to3),
    re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

    I can remove the "r" prefix from the regex string but then
    if I have other regex backslash symbols in it, I have to
    double all the other backslashes -- the very thing that
    the r-prefix was invented to avoid.

    Or I can leave the "r" prefix and replace something like
    r'[ \u3000]' with r'[  ]'. But that is confusing because
    one can't distinguish between the space character and
    the ideographic space character. It also a problem if a
    reader of the code doesn't have a font that can display
    the character.

    Was there a reason for dropping the lexical processing of
    \u escapes in strings in python3 (other than to add another
    annoyance in a long list of python3 annoyances?)

    And is there no choice for me but to choose between the two
    poor choices I mention above to deal with this problem?
     
    , May 30, 2012
    #1
    1. Advertising

  2. Am 30.05.2012 08:52 schrieb :

    > This breaks a lot of my code because in python 2
    > re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    > but in python 3 (the result of running 2to3),
    > re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
    >
    > I can remove the "r" prefix from the regex string but then
    > if I have other regex backslash symbols in it, I have to
    > double all the other backslashes -- the very thing that
    > the r-prefix was invented to avoid.
    >
    > Or I can leave the "r" prefix and replace something like
    > r'[ \u3000]' with r'[  ]'. But that is confusing because
    > one can't distinguish between the space character and
    > the ideographic space character. It also a problem if a
    > reader of the code doesn't have a font that can display
    > the character.
    >
    > Was there a reason for dropping the lexical processing of
    > \u escapes in strings in python3 (other than to add another
    > annoyance in a long list of python3 annoyances?)


    Probably it is more consequent. Alas, it makes the whole stuff
    incompatible to Py2.

    But if you think about it: why allow for \u if \r, \n etc. are
    disallowed as well?


    > And is there no choice for me but to choose between the two
    > poor choices I mention above to deal with this problem?


    There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
    but should do the trick...


    Thomas
     
    Thomas Rachel, May 30, 2012
    #2
    1. Advertising

  3. Guest

    On 05/30/2012 05:54 AM, Thomas Rachel wrote:
    > Am 30.05.2012 08:52 schrieb :
    >
    >> This breaks a lot of my code because in python 2
    >> re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    >> but in python 3 (the result of running 2to3),
    >> re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
    >>
    >> I can remove the "r" prefix from the regex string but then
    >> if I have other regex backslash symbols in it, I have to
    >> double all the other backslashes -- the very thing that
    >> the r-prefix was invented to avoid.
    >>
    >> Or I can leave the "r" prefix and replace something like
    >> r'[ \u3000]' with r'[  ]'. But that is confusing because
    >> one can't distinguish between the space character and
    >> the ideographic space character. It also a problem if a
    >> reader of the code doesn't have a font that can display
    >> the character.
    >>
    >> Was there a reason for dropping the lexical processing of
    >> \u escapes in strings in python3 (other than to add another
    >> annoyance in a long list of python3 annoyances?)

    >
    > Probably it is more consequent. Alas, it makes the whole stuff
    > incompatible to Py2.
    >
    > But if you think about it: why allow for \u if \r, \n etc. are
    > disallowed as well?


    Maybe the blame is elsewhere then... If the re module
    interprets (in a regex string) the 2-character string
    consisting of r'\' followed by 'n' as a single newline
    character, then why wasn't re changed for Python 3 to
    interpret the 6-character string, r'\u3000' as a single
    unicode character to correspond with Python's lexer no
    longer doing that (as it did in Python 2)?

    >> And is there no choice for me but to choose between the two
    >> poor choices I mention above to deal with this problem?

    >
    > There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
    > but should do the trick...


    I guess the "+"s could be left out allowing something
    like,

    '[ \u3000]' r'\w+ \d{3}'

    but I'll have to try it a little; maybe just doubling
    backslashes won't be much worse. I did that for years
    in Perl and lived through it.
     
    , May 30, 2012
    #3
  4. Guest

    On 05/30/2012 10:46 AM, Terry Reedy wrote:
    > On 5/30/2012 2:52 AM, wrote:
    >> In python2, "\u" escapes are processed in raw unicode
    >> strings. That is, ur'\u3000' is a string of length 1
    >> consisting of the IDEOGRAPHIC SPACE unicode character.

    >
    > That surprised me until I rechecked the fine manual and found:
    >
    > "When an 'r' or 'R' prefix is present, a character following a backslash
    > is included in the string without change, and all backslashes are left
    > in the string."
    >
    > "When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U'
    > prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed
    > while all other backslashes are left in the string."
    >
    > When 'u' was removed in Python 3, a choice had to be made and the first
    > must have seemed to be the obvious one, or perhaps the automatic one.
    >
    > In 3.3, 'u' is being restored. I have inquired on pydev list whether the
    > difference above should also be restored, and mentioned this thread.


    As mentioned is a different message, another option might
    be to leave raw strings as is (more consistent since all
    backslashes are treated the same) and have the "re" module
    un-escape "\uxxxx" (and similar) literals in regex string
    (also more consistent since that's what it does with '\\n',
    '\\t', etc.)

    I do realize though that this may have back-compatibilty
    problems that makes it impossible to do.
     
    , May 30, 2012
    #4
  5. jmfauth Guest

    On 30 mai, 13:54, Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-
    > wrote:
    > Am 30.05.2012 08:52 schrieb :
    >
    >
    >
    > > This breaks a lot of my code because in python 2
    > >        re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    > > but in python 3 (the result of running 2to3),
    > >        re.split (r'[\u3000]', 'A\u3000A' ) ==>  ['A\u3000A']

    >
    > > I can remove the "r" prefix from the regex string but then
    > > if I have other regex backslash symbols in it, I have to
    > > double all the other backslashes -- the very thing that
    > > the r-prefix was invented to avoid.

    >
    > > Or I can leave the "r" prefix and replace something like
    > > r'[ \u3000]' with r'[  ]'.  But that is confusing because
    > > one can't distinguish between the space character and
    > > the ideographic space character.  It also a problem if a
    > > reader of the code doesn't have a font that can display
    > > the character.

    >
    > > Was there a reason for dropping the lexical processing of
    > > \u escapes in strings in python3 (other than to add another
    > > annoyance in a long list of python3 annoyances?)

    >
    > Probably it is more consequent. Alas, it makes the whole stuff
    > incompatible to Py2.
    >
    > But if you think about it: why allow for \u if \r, \n etc. are
    > disallowed as well?
    >
    > > And is there no choice for me but to choose between the two
    > > poor choices I mention above to deal with this problem?

    >
    > There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
    > but should do the trick...
    >
    > Thomas


    I suggest to take the problem differently. Python 3
    succeeded to put order in the missmatch of the "coding
    of the characters" Python 2 was proposing.

    In your case, the

    >>> import unicodedata as ud
    >>> ud.name('\u3000')

    'IDEOGRAPHIC SPACE'

    "character" (in fact a unicode code point), is just
    a "character" as a

    >>> ud.name('a')

    'LATIN SMALL LETTER A'

    The code point / unicode logic, Python 3 proposes and follows,
    becomes just straightforward.

    >>> s = 'a\u3000é\u3000€'
    >>> s.split('\u3000')

    ['a', 'é', '€']
    >>>
    >>> import re
    >>> re.split('\u3000', s)

    ['a', 'é', '€']


    The backslash, used as "real backslash", remains what it
    really was in Python 2. Note, the absence of r'...' .

    >>> s = 'a\\b\\c'
    >>> print(s)

    a\b\c
    >>> s.split('\\')

    ['a', 'b', 'c']
    >>> re.split('\\\\', s)

    ['a', 'b', 'c']

    >>> hex(ord('\\'))

    '0x5c'
    >>> re.split('\u005c\u005c', s)

    ['a', 'b', 'c']

    jmf
     
    jmfauth, May 30, 2012
    #5
  6. jmfauth Guest

    On 30 mai, 08:52, "" <> wrote:
    > In python2, "\u" escapes are processed in raw unicode
    > strings.  That is, ur'\u3000' is a string of length 1
    > consisting of the IDEOGRAPHIC SPACE unicode character.
    >
    > In python3, "\u" escapes are not processed in raw strings.
    > r'\u3000' is a string of length 6 consisting of a backslash,
    > 'u', '3' and three '0' characters.
    >
    > This breaks a lot of my code because in python 2
    >       re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    > but in python 3 (the result of running 2to3),
    >       re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
    >
    > I can remove the "r" prefix from the regex string but then
    > if I have other regex backslash symbols in it, I have to
    > double all the other backslashes -- the very thing that
    > the r-prefix was invented to avoid.
    >
    > Or I can leave the "r" prefix and replace something like
    > r'[ \u3000]' with r'[  ]'.  But that is confusing because
    > one can't distinguish between the space character and
    > the ideographic space character.  It also a problem if a
    > reader of the code doesn't have a font that can display
    > the character.
    >
    > Was there a reason for dropping the lexical processing of
    > \u escapes in strings in python3 (other than to add another
    > annoyance in a long list of python3 annoyances?)
    >
    > And is there no choice for me but to choose between the two
    > poor choices I mention above to deal with this problem?



    I suggest to take the problem differently. Python 3
    succeeded to put order in the missmatch of the "coding
    of the characters" Python 2 was proposing.

    The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash)
    "characters" (in fact unicode code points) are just (normal)
    "characters". The backslash, used as an escaping command,
    keeps its function.

    Note the absence of r'...'

    >>> s = 'a\u3000é\u3000€'
    >>> s.split('\u3000')

    ['a', 'é', '€']
    >>>
    >>> import re
    >>> re.split('\u3000', s)

    ['a', 'é', '€']


    >>> s = 'a\\b\\c'
    >>> print(s)

    a\b\c
    >>> s.split('\\')

    ['a', 'b', 'c']
    >>> re.split('\\\\', s)

    ['a', 'b', 'c']

    >>> hex(ord('\\'))

    '0x5c'
    >>> re.split('\u005c\u005c', s)

    ['a', 'b', 'c']

    jmf
     
    jmfauth, May 31, 2012
    #6
  7. Guest

    On 05/30/2012 09:07 AM, wrote:
    > On 05/30/2012 05:54 AM, Thomas Rachel wrote:
    >> Am 30.05.2012 08:52 schrieb :
    >>
    >>> This breaks a lot of my code because in python 2
    >>> re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
    >>> but in python 3 (the result of running 2to3),
    >>> re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']
    >>>
    >>> I can remove the "r" prefix from the regex string but then
    >>> if I have other regex backslash symbols in it, I have to
    >>> double all the other backslashes -- the very thing that
    >>> the r-prefix was invented to avoid.
    >>>
    >>> Or I can leave the "r" prefix and replace something like
    >>> r'[ \u3000]' with r'[  ]'. But that is confusing because
    >>> one can't distinguish between the space character and
    >>> the ideographic space character. It also a problem if a
    >>> reader of the code doesn't have a font that can display
    >>> the character.
    >>>
    >>> Was there a reason for dropping the lexical processing of
    >>> \u escapes in strings in python3 (other than to add another
    >>> annoyance in a long list of python3 annoyances?)

    >>
    >> Probably it is more consequent. Alas, it makes the whole stuff
    >> incompatible to Py2.
    >>
    >> But if you think about it: why allow for \u if \r, \n etc. are
    >> disallowed as well?

    >
    > Maybe the blame is elsewhere then... If the re module
    > interprets (in a regex string) the 2-character string
    > consisting of r'\' followed by 'n' as a single newline
    > character, then why wasn't re changed for Python 3 to
    > interpret the 6-character string, r'\u3000' as a single
    > unicode character to correspond with Python's lexer no
    > longer doing that (as it did in Python 2)?
    >
    >>> And is there no choice for me but to choose between the two
    >>> poor choices I mention above to deal with this problem?

    >>
    >> There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
    >> but should do the trick...

    >
    > I guess the "+"s could be left out allowing something
    > like,
    >
    > '[ \u3000]' r'\w+ \d{3}'
    >
    > but I'll have to try it a little; maybe just doubling
    > backslashes won't be much worse. I did that for years
    > in Perl and lived through it.


    Just for some closure, there are many places in my code
    that I had/have to track down and change. But the biggest
    problem so far is a lexer module that is structured as many
    dozens of little functions, each with a docstring that is
    a regex string.

    The only way I found change these and maintain sanity was
    to go through them and remove the "r" prefix from any strings
    that contain "\unnnn" literals, and then double any other
    backslashes in the string.

    Since these are docstrings, creating them with executable
    code was awkward, and using adjacent string concatenation
    led to a very confusing mix of string styles. Strings that
    used concatenation often had a single logical regex structure
    (eg a character set "[...]") split between two strings.
    The extra quote characters were as visually confusing as
    doubled backslashes in many cases.

    Strings with doubled backslashes, although harder to read
    were, were much easier to edit reliably and in their way,
    more regular. It does make this module look very Perlish
    though... :)
     
    , May 31, 2012
    #7
  8. Guest

    On 05/31/2012 03:10 PM, Chris Angelico wrote:
    > On Fri, Jun 1, 2012 at 6:28 AM, <> wrote:
    >> ... a lexer module that is structured as many
    >> dozens of little functions, each with a docstring that is
    >> a regex string.

    >
    > This may be a good opportunity to take a step back and ask yourself:
    > Why so many functions, each with a regular expression in its
    > docstring?


    Because that's the way David Beazley designed Ply?
    http://dabeaz.com/ply/

    Personally, I think it's an abuse of docstrings but
    he never asked me for my opinion...
     
    , May 31, 2012
    #8
  9. This is a related question.

    I perform an octal dump on a file:
    $ od -cx file
    0000000 h e l l o w o r l d \n
    6568 6c6c 206f 6f77 6c72 0a64

    I want to output the names of those characters:
    $ python3
    Python 3.2.3 (default, May 19 2012, 17:01:30)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import unicodedata
    >>> unicodedata.name("\u0068")

    'LATIN SMALL LETTER H'
    >>> unicodedata.name("\u0065")

    'LATIN SMALL LETTER E'

    But, how to do this programatically:
    >>> first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
    >>> first_two_letters

    '6568'
    >>> first_letter = "00" + first_two_letters[2:]
    >>> first_letter

    '0068'

    Now what?
     
    Jason Friedman, Jun 16, 2012
    #9
  10. MRAB Guest

    On 16/06/2012 00:42, Jason Friedman wrote:
    > This is a related question.
    >
    > I perform an octal dump on a file:
    > $ od -cx file
    > 0000000 h e l l o w o r l d \n
    > 6568 6c6c 206f 6f77 6c72 0a64
    >
    > I want to output the names of those characters:
    > $ python3
    > Python 3.2.3 (default, May 19 2012, 17:01:30)
    > [GCC 4.6.3] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import unicodedata
    >>>> unicodedata.name("\u0068")

    > 'LATIN SMALL LETTER H'
    >>>> unicodedata.name("\u0065")

    > 'LATIN SMALL LETTER E'
    >
    > But, how to do this programatically:
    >>>> first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
    >>>> first_two_letters

    > '6568'
    >>>> first_letter = "00" + first_two_letters[2:]
    >>>> first_letter

    > '0068'
    >
    > Now what?


    >>> hex_code = "65"
    >>> unicodedata.name(chr(int(hex_code, 16)))

    'LATIN SMALL LETTER E'
     
    MRAB, Jun 16, 2012
    #10
  11. >> This is a related question.
    >>
    >> I perform an octal dump on a file:
    >> $ od -cx file
    >> 0000000   h   e   l   l   o       w   o   r   l   d  \n
    >>            6568    6c6c    206f    6f77    6c72    0a64
    >>
    >> I want to output the names of those characters:
    >> $ python3
    >> Python 3.2.3 (default, May 19 2012, 17:01:30)
    >> [GCC 4.6.3] on linux2
    >> Type "help", "copyright", "credits" or "license" for more information.
    >>>>>
    >>>>>  import unicodedata
    >>>>>  unicodedata.name("\u0068")

    >>
    >> 'LATIN SMALL LETTER H'
    >>>>>
    >>>>>  unicodedata.name("\u0065")

    >>
    >> 'LATIN SMALL LETTER E'
    >>
    >> But, how to do this programatically:
    >>>>>
    >>>>>  first_two_letters = "6568    6c6c    206f    6f77    6c72
    >>>>>  0a64".split()[0]
    >>>>>  first_two_letters

    >>
    >> '6568'
    >>>>>
    >>>>>  first_letter = "00" + first_two_letters[2:]
    >>>>>  first_letter

    >>
    >> '0068'
    >>
    >> Now what?


    >>>> hex_code = "65"
    >>>> unicodedata.name(chr(int(hex_code, 16)))

    > 'LATIN SMALL LETTER E'


    Very helpful, thank you MRAB.

    The finished product: http://pastebin.com/4egQcke2.
     
    Jason Friedman, Jun 16, 2012
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nagarajan
    Replies:
    4
    Views:
    322
    Nagarajan
    Aug 23, 2007
  2. bvdp

    unescape escapes in strings

    bvdp, Feb 23, 2009, in forum: Python
    Replies:
    4
    Views:
    4,942
  3. Sisyphus

    POD docs and ANSI escapes

    Sisyphus, Sep 27, 2003, in forum: Perl Misc
    Replies:
    3
    Views:
    100
    Sisyphus
    Sep 28, 2003
  4. pbd22

    escapes and JSON

    pbd22, Dec 4, 2007, in forum: Javascript
    Replies:
    5
    Views:
    145
    slebetman
    Dec 6, 2007
  5. Andrew Berg
    Replies:
    0
    Views:
    347
    Andrew Berg
    Jun 16, 2012
Loading...

Share This Page