python3 raw strings and \u escapes

rurpy · May 30, 2012

In python2, "\u" escapes are processed in raw unicode
strings. That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.

In python3, "\u" escapes are not processed in raw strings.
r'\u3000' is a string of length 6 consisting of a backslash,
'u', '3' and three '0' characters.

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

Thomas Rachel · May 30, 2012

Am 30.05.2012 08:52 schrieb (e-mail address removed):

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

Thomas

rurpy · May 30, 2012

Am 30.05.2012 08:52 schrieb (e-mail address removed):

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Click to expand...

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

Maybe the blame is elsewhere then... If the re module
interprets (in a regex string) the 2-character string
consisting of r'\' followed by 'n' as a single newline
character, then why wasn't re changed for Python 3 to
interpret the 6-character string, r'\u3000' as a single
unicode character to correspond with Python's lexer no
longer doing that (as it did in Python 2)?

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

Click to expand...

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

I guess the "+"s could be left out allowing something
like,

'[ \u3000]' r'\w+ \d{3}'

but I'll have to try it a little; maybe just doubling
backslashes won't be much worse. I did that for years
in Perl and lived through it.

rurpy · May 30, 2012

That surprised me until I rechecked the fine manual and found:

"When an 'r' or 'R' prefix is present, a character following a backslash
is included in the string without change, and all backslashes are left
in the string."

"When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U'
prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed
while all other backslashes are left in the string."

When 'u' was removed in Python 3, a choice had to be made and the first
must have seemed to be the obvious one, or perhaps the automatic one.

In 3.3, 'u' is being restored. I have inquired on pydev list whether the
difference above should also be restored, and mentioned this thread.

As mentioned is a different message, another option might
be to leave raw strings as is (more consistent since all
backslashes are treated the same) and have the "re" module
un-escape "\uxxxx" (and similar) literals in regex string
(also more consistent since that's what it does with '\\n',
'\\t', etc.)

I do realize though that this may have back-compatibilty
problems that makes it impossible to do.

jmfauth · May 30, 2012

Am 30.05.2012 08:52 schrieb (e-mail address removed):

This breaks a lot of my code because in python 2
Â Â Â Â re.split (ur'[\u3000]', u'A\u3000A') ==>Â [u'A', u'A']
but in python 3 (the result of running 2to3),
Â Â Â Â re.split (r'[\u3000]', 'A\u3000A' ) ==> Â ['A\u3000A']

Click to expand...

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Click to expand...

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. Â But that is confusing because
one can't distinguish between the space character and
the ideographic space character. Â It also a problem if a
reader of the code doesn't have a font that can display
the character.

Click to expand...

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Click to expand...

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

Click to expand...

There is a 3rd one: use Â r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

Thomas

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

In your case, the
'IDEOGRAPHIC SPACE'

"character" (in fact a unicode code point), is just
a "character" as a
'LATIN SMALL LETTER A'

The code point / unicode logic, Python 3 proposes and follows,
becomes just straightforward.

s = 'a\u3000Ã©\u3000â‚¬'
s.split('\u3000') ['a', 'Ã©', 'â‚¬']

import re
re.split('\u3000', s)

Click to expand...

Click to expand...

['a', 'Ã©', 'â‚¬']

The backslash, used as "real backslash", remains what it
really was in Python 2. Note, the absence of r'...' .

s = 'a\\b\\c'
print(s) a\b\c
s.split('\\') ['a', 'b', 'c']
re.split('\\\\', s)

Click to expand...

Click to expand...

['a', 'b', 'c']
['a', 'b', 'c']

jmf

jmfauth · May 31, 2012

In python2, "\u" escapes are processed in raw unicode
strings. Â That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.

In python3, "\u" escapes are not processed in raw strings.
r'\u3000' is a string of length 6 consisting of a backslash,
'u', '3' and three '0' characters.

This breaks a lot of my code because in python 2
Â Â Â re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
Â Â Â re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. Â But that is confusing because
one can't distinguish between the space character and
the ideographic space character. Â It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

I suggest to take the problem differently. Python 3
succeeded to put order in the missmatch of the "coding
of the characters" Python 2 was proposing.

The 'IDEOGRAPHIC SPACE' and 'REVERSE SOLIDUS' (backslash)
"characters" (in fact unicode code points) are just (normal)
"characters". The backslash, used as an escaping command,
keeps its function.

Note the absence of r'...'

s = 'a\u3000Ã©\u3000â‚¬'
s.split('\u3000') ['a', 'Ã©', 'â‚¬']

import re
re.split('\u3000', s)

Click to expand...

Click to expand...

['a', 'Ã©', 'â‚¬']

s = 'a\\b\\c'
print(s) a\b\c
s.split('\\') ['a', 'b', 'c']
re.split('\\\\', s)

Click to expand...

Click to expand...

['a', 'b', 'c']
['a', 'b', 'c']

jmf

rurpy · May 31, 2012

Am 30.05.2012 08:52 schrieb (e-mail address removed):

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[ ã€€]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

Click to expand...

Probably it is more consequent. Alas, it makes the whole stuff
incompatible to Py2.

But if you think about it: why allow for \u if \r, \n etc. are
disallowed as well?

Click to expand...

Maybe the blame is elsewhere then... If the re module
interprets (in a regex string) the 2-character string
consisting of r'\' followed by 'n' as a single newline
character, then why wasn't re changed for Python 3 to
interpret the 6-character string, r'\u3000' as a single
unicode character to correspond with Python's lexer no
longer doing that (as it did in Python 2)?

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

Click to expand...

There is a 3rd one: use r'[ ' + '\u3000' + ']'. Not very nice to read,
but should do the trick...

Click to expand...

I guess the "+"s could be left out allowing something
like,

'[ \u3000]' r'\w+ \d{3}'

but I'll have to try it a little; maybe just doubling
backslashes won't be much worse. I did that for years
in Perl and lived through it.

Just for some closure, there are many places in my code
that I had/have to track down and change. But the biggest
problem so far is a lexer module that is structured as many
dozens of little functions, each with a docstring that is
a regex string.

The only way I found change these and maintain sanity was
to go through them and remove the "r" prefix from any strings
that contain "\unnnn" literals, and then double any other
backslashes in the string.

Since these are docstrings, creating them with executable
code was awkward, and using adjacent string concatenation
led to a very confusing mix of string styles. Strings that
used concatenation often had a single logical regex structure
(eg a character set "[...]") split between two strings.
The extra quote characters were as visually confusing as
doubled backslashes in many cases.

Strings with doubled backslashes, although harder to read
were, were much easier to edit reliably and in their way,
more regular. It does make this module look very Perlish
though...

rurpy · May 31, 2012

This may be a good opportunity to take a step back and ask yourself:
Why so many functions, each with a regular expression in its
docstring?

Because that's the way David Beazley designed Ply?
http://dabeaz.com/ply/

Personally, I think it's an abuse of docstrings but
he never asked me for my opinion...

Jason Friedman · Jun 16, 2012

This is a related question.

I perform an octal dump on a file:
$ od -cx file
0000000 h e l l o w o r l d \n
6568 6c6c 206f 6f77 6c72 0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.'LATIN SMALL LETTER E'

But, how to do this programatically:

first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
first_two_letters '6568'
first_letter = "00" + first_two_letters[2:]
first_letter

Click to expand...

Click to expand...

'0068'

Now what?

MRAB · Jun 16, 2012

This is a related question.

I perform an octal dump on a file:
$ od -cx file
0000000 h e l l o w o r l d \n
6568 6c6c 206f 6f77 6c72 0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.'LATIN SMALL LETTER E'

But, how to do this programatically:

first_two_letters = "6568 6c6c 206f 6f77 6c72 0a64".split()[0]
first_two_letters '6568'
first_letter = "00" + first_two_letters[2:]
first_letter

Click to expand...

Click to expand...

'0068'

Now what?

'LATIN SMALL LETTER E'

Jason Friedman · Jun 16, 2012

This is a related question.

I perform an octal dump on a file:
$ od -cx file
0000000 h e l l o w o r l d \n
6568 6c6c 206f 6f77 6c72 0a64

I want to output the names of those characters:
$ python3
Python 3.2.3 (default, May 19 2012, 17:01:30)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import unicodedata
unicodedata.name("\u0068")

Click to expand...

'LATIN SMALL LETTER H'

unicodedata.name("\u0065")

Click to expand...

'LATIN SMALL LETTER E'

But, how to do this programatically:

first_two_letters = "6568 6c6c 206f 6f77 6c72
0a64".split()[0]
first_two_letters
'6568'

first_letter = "00" + first_two_letters[2:]
first_letter

Click to expand...

'0068'

Now what?

hex_code = "65"
unicodedata.name(chr(int(hex_code, 16)))

Click to expand...

Click to expand...

'LATIN SMALL LETTER E'

Very helpful, thank you MRAB.

The finished product: http://pastebin.com/4egQcke2.

sys.stdout and Python3	0	Nov 23, 2013
unescape escapes in strings	4	Feb 23, 2009
Why Python3	12	Jun 28, 2010
Unicode raw string containing \u	3	Oct 28, 2007
autoconf tools and python3 3m 3dm	1	Jan 21, 2014
Raw Unicode docstring	3	Nov 16, 2010
Raw strings and escaping	7	Oct 3, 2006
Embedding a literal "\u" in a unicode raw string.	7	Feb 25, 2008

python3 raw strings and \u escapes

rurpy

Thomas Rachel

rurpy

rurpy

jmfauth

jmfauth

rurpy

rurpy

Jason Friedman

MRAB

Jason Friedman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads