python3 raw strings and \u escapes


R

rurpy

In python2, "\u" escapes are processed in raw unicode
strings. That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.

In python3, "\u" escapes are not processed in raw strings.
r'\u3000' is a string of length 6 consisting of a backslash,
'u', '3' and three '0' characters.

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?
 
Ad

Advertisements

T

Thomas Rachel

Am 30.05.2012 08:52 schrieb (e-mail address removed):
This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...


Thomas
 
R

rurpy

Am 30.05.2012 08:52 schrieb (e-mail address removed):
This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

Maybe the blame is elsewhere then... If the re module
interprets (in a regex string) the 2-character string
consisting of r'\' followed by 'n' as a single newline
character, then why wasn't re changed for Python 3 to
interpret the 6-character string, r'\u3000' as a single
unicode character to correspond with Python's lexer no
longer doing that (as it did in Python 2)?
And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

I guess the "+"s could be left out allowing something
like,

'[ \u3000]' r'\w+ \d{3}'

but I'll have to try it a little; maybe just doubling
backslashes won't be much worse. I did that for years
in Perl and lived through it.
 
R

rurpy

That surprised me until I rechecked the fine manual and found:

"When an 'r' or 'R' prefix is present, a character following a backslash
is included in the string without change, and all backslashes are left
in the string."

"When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U'
prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed
while all other backslashes are left in the string."

When 'u' was removed in Python 3, a choice had to be made and the first
must have seemed to be the obvious one, or perhaps the automatic one.

In 3.3, 'u' is being restored. I have inquired on pydev list whether the
difference above should also be restored, and mentioned this thread.

As mentioned is a different message, another option might
be to leave raw strings as is (more consistent since all
backslashes are treated the same) and have the "re" module
un-escape "\uxxxx" (and similar) literals in regex string
(also more consistent since that's what it does with '\\n',
'\\t', etc.)

I do realize though that this may have back-compatibilty
problems that makes it impossible to do.
 
J

jmfauth

Am 30.05.2012 08:52 schrieb (e-mail address removed):


This breaks a lot of my code because in python 2
       re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
       re.split (r'[\u3000]', 'A\u3000A' ) ==>  ['A\u3000A']
I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.
Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'.  But that is confusing because
one can't distinguish between the space character and
the ideographic space character.  It also a problem if a
reader of the code doesn't have a font that can display
the character.
Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?
And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

There is a 3rd one: use   r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

Thomas

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

In your case, the
'IDEOGRAPHIC SPACE'

"character" (in fact a unicode code point), is just
a "character" as a
'LATIN SMALL LETTER A'

The code point / unicode logic, Python 3 proposes and follows,
becomes just straightforward.
s = 'a\u3000é\u3000€'
s.split('\u3000') ['a', 'é', '€']

import re
re.split('\u3000', s)
['a', 'é', '€']


The backslash, used as "real backslash", remains what it
really was in Python 2. Note, the absence of r'...' .
s = 'a\\b\\c'
print(s) a\b\c
s.split('\\') ['a', 'b', 'c']
re.split('\\\\', s)
['a', 'b', 'c']
['a', 'b', 'c']

jmf
 
J

jmfauth

In python2, "\u" escapes are processed in raw unicode
strings.  That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.

In python3, "\u" escapes are not processed in raw strings.
r'\u3000' is a string of length 6 consisting of a backslash,
'u', '3' and three '0' characters.

This breaks a lot of my code because in python 2
      re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
      re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'.  But that is confusing because
one can't distinguish between the space character and
the ideographic space character.  It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?


I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash)
"characters" (in fact unicode code points) are just (normal)
"characters". The backslash, used as an escaping command,
keeps its function.

Note the absence of r'...'
s = 'a\u3000é\u3000€'
s.split('\u3000') ['a', 'é', '€']

import re
re.split('\u3000', s)
['a', 'é', '€']

s = 'a\\b\\c'
print(s) a\b\c
s.split('\\') ['a', 'b', 'c']
re.split('\\\\', s)
['a', 'b', 'c']
['a', 'b', 'c']

jmf
 
Ad

Advertisements

R

rurpy

Am 30.05.2012 08:52 schrieb (e-mail address removed):
This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

Maybe the blame is elsewhere then... If the re module
interprets (in a regex string) the 2-character string
consisting of r'\' followed by 'n' as a single newline
character, then why wasn't re changed for Python 3 to
interpret the 6-character string, r'\u3000' as a single
unicode character to correspond with Python's lexer no
longer doing that (as it did in Python 2)?
And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

I guess the "+"s could be left out allowing something
like,

'[ \u3000]' r'\w+ \d{3}'

but I'll have to try it a little; maybe just doubling
backslashes won't be much worse. I did that for years
in Perl and lived through it.

Just for some closure, there are many places in my code
that I had/have to track down and change. But the biggest
problem so far is a lexer module that is structured as many
dozens of little functions, each with a docstring that is
a regex string.

The only way I found change these and maintain sanity was
to go through them and remove the "r" prefix from any strings
that contain "\unnnn" literals, and then double any other
backslashes in the string.

Since these are docstrings, creating them with executable
code was awkward, and using adjacent string concatenation
led to a very confusing mix of string styles. Strings that
used concatenation often had a single logical regex structure
(eg a character set "[...]") split between two strings.
The extra quote characters were as visually confusing as
doubled backslashes in many cases.

Strings with doubled backslashes, although harder to read
were, were much easier to edit reliably and in their way,
more regular. It does make this module look very Perlish
though... :)
 
R

rurpy

This may be a good opportunity to take a step back and ask yourself:
Why so many functions, each with a regular expression in its
docstring?

Because that's the way David Beazley designed Ply?
http://dabeaz.com/ply/

Personally, I think it's an abuse of docstrings but
he never asked me for my opinion...
 
J

Jason Friedman

This is a related question.

I perform an octal dump on a file:
$ od -cx file
0000000 h e l l o w o r l d \n
6568 6c6c 206f 6f77 6c72 0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.'LATIN SMALL LETTER E'

But, how to do this programatically:
first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
first_two_letters '6568'
first_letter = "00" + first_two_letters[2:]
first_letter
'0068'

Now what?
 
M

MRAB

This is a related question.

I perform an octal dump on a file:
$ od -cx file
0000000 h e l l o w o r l d \n
6568 6c6c 206f 6f77 6c72 0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.'LATIN SMALL LETTER E'

But, how to do this programatically:
first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
first_two_letters '6568'
first_letter = "00" + first_two_letters[2:]
first_letter
'0068'

Now what?
'LATIN SMALL LETTER E'
 
Ad

Advertisements

J

Jason Friedman

This is a related question.
I perform an octal dump on a file:
$ od -cx file
0000000   h   e   l   l   o       w   o   r   l   d  \n
           6568    6c6c    206f    6f77    6c72    0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 import unicodedata
 unicodedata.name("\u0068")

'LATIN SMALL LETTER H'
 unicodedata.name("\u0065")

'LATIN SMALL LETTER E'

But, how to do this programatically:
 first_two_letters = "6568    6c6c    206f    6f77    6c72
 0a64".split()[0]
 first_two_letters
'6568'

 first_letter = "00" + first_two_letters[2:]
 first_letter

'0068'

Now what?
hex_code = "65"
unicodedata.name(chr(int(hex_code, 16)))
'LATIN SMALL LETTER E'

Very helpful, thank you MRAB.

The finished product: http://pastebin.com/4egQcke2.
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top