What encoding does u'...' syntax use?

Ron Garret · Feb 20, 2009

I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)µ

(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

rg

Stefan Behnel · Feb 20, 2009

Ron said:
I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)
µ

(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

You are mixing up console output and internal data representation. What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

BTW, Unicode is not an encoding. Wikipedia will tell you more.

Stefan

Stefan Behnel · Feb 20, 2009

Stefan said:
What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

The "seems to" is misleading. The example doesn't actually tell you
anything about the encoding used by your console, except that it can
display non-ASCII characters.

Stefan

Ron Garret · Feb 20, 2009

Stefan Behnel said:
You are mixing up console output and internal data representation. What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

BTW, Unicode is not an encoding. Wikipedia will tell you more.

Yes, I know that. But every concrete representation of a unicode string
has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding? It can't be ascii. So what is
it?

Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't. Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1'). But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

rg

Terry Reedy · Feb 20, 2009

Ron said:
I would have thought that the answer would be: the default encoding
(duh!) But empirically this appears not to be the case:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0:
ordinal not in range(128)

The unicode function is usually used to decode bytes read from *external
sources*, each of which can have its own encoding. So the function
(actually, developer crew) refuses to guess and uses the ascii common
subset.

ï¿½

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).

(That last character shows up as a micron sign despite the fact that my
default encoding is ascii, so it seems to me that that unicode string
must somehow have picked up a latin-1 encoding.)

I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).

Matthew Woodcraft · Feb 20, 2009

Ron Garret said:
Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't. Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1'). But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error
as calling unicode('\xb5')?

There is no encoding involved other than ascii, only processing of a
backslash escape.

The backslash escape '\xb5' is converted to the unicode character whose
ordinal number is B5h. This gives the same result as
"\xb5".decode("latin-1") because the unicode numbering is the same as
the 'latin-1' numbering in that range.

-M-

Martin v. Löwis · Feb 20, 2009

Yes, I know that. But every concrete representation of a unicode string

has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
as u"\N{MICRO SIGN}". By "the same", I mean "the very same".

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) "ascii". Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.

Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).

But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin

Martin v. LÃ¶wis · Feb 20, 2009

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).

I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac" denotes
U+20AC; the same holds for any other u'\uXXXX').

HTH,
Martin

Ron Garret · Feb 20, 2009

Unicode literals are *in the source file*, which can only have one
encoding (for a given source file).

I think latin-1 was the default without a coding cookie line. (May be
uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac" denotes
U+20AC; the same holds for any other u'\uXXXX').

HTH,
Martin[/QUOTE]

Ah, that makes sense. Thanks!

rg

Ron Garret · Feb 20, 2009

Yes, I know that. But every concrete representation of a unicode string
has to have an encoding associated with it, including unicode strings
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

Put this another way: I would have thought that when the Python parser
parses "u'\xb5'" it would produce the same result as calling
unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
as u"\N{MICRO SIGN}". By "the same", I mean "the very same".

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) "ascii". Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.

Instead it seems to produce the same
result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).

But my default encoding
is not latin-1, it's ascii. So where is the Python parser getting its
encoding from? Why does parsing "u'\xb5'" not produce the same error as
calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin[/QUOTE]

OK, I think I get it now. Thanks!

rg

Terry Reedy · Feb 20, 2009

Martin v. LÃ¶wis wrote:
mehow have picked up a latin-1 encoding.)

It is, but that's irrelevant for the example. In the source

u'\xb5'

all characters are ASCII (i.e. all of "letter u", "single
quote", "backslash", "letter x", "letter b", "digit 5").
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

I think I understand now that the coding cookie only matters if I use an
editor that actually stores *non-ascii* bytes in the file for the Python
parser to interpret.

Aahz · Feb 21, 2009

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
countless threads about the distinction between UTF and UCS?

Thorsten Kampe · Feb 21, 2009

* "Martin v. LÃ¶wis" (Sat, 21 Feb 2009 00:15:08 +0100)

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
slight difference to UTF-16/UTF-32).

Thorsten

Denis Kasak · Feb 21, 2009

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
slight difference to UTF-16/UTF-32).

I wouldn't call the difference that slight, especially between UTF-16
and UCS-2, since the former can encode all Unicode code points, while
the latter can only encode those in the BMP.

Martin v. Löwis · Feb 21, 2009

My question is: what is that encoding?

Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
countless threads about the distinction between UTF and UCS?

You are not misremembering. I personally never found them conclusive,
and, with PEP 261, I think, calling the 2-byte version "UCS-2" is
incorrect.

Regards,
Martin

Martin v. LÃ¶wis · Feb 21, 2009

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a

I wouldn't call the difference that slight, especially between UTF-16
and UCS-2, since the former can encode all Unicode code points, while
the latter can only encode those in the BMP.

Indeed. As Python *can* encode all characters even in 2-byte mode
(since PEP 261), it seems clear that Python's Unicode representation
is *not* strictly UCS-2 anymore.

Regards,
Martin

Denis Kasak · Feb 21, 2009

Indeed. As Python *can* encode all characters even in 2-byte mode
(since PEP 261), it seems clear that Python's Unicode representation
is *not* strictly UCS-2 anymore.

Since we're already discussing this, I'm curious - why was UCS-2
chosen over plain UTF-16 or UTF-8 in the first place for Python's
internal storage?

Adam Olsen · Feb 21, 2009

Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the
countless threads about the distinction between UTF and UCS?

Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates. We target Unicode
5.1.

If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character. Lo and behold,
that's actually what current python does in some places. It's not
pretty.

See bugs #3297 and #3672.

Martin v. LÃ¶wis · Feb 21, 2009

Indeed. As Python *can* encode all characters even in 2-byte mode

Since we're already discussing this, I'm curious - why was UCS-2
chosen over plain UTF-16 or UTF-8 in the first place for Python's
internal storage?

You mean, originally? Originally, the choice was only between UCS-2
and UCS-4; choice was in favor of UCS-2 because of size concerns.
UTF-8 was ruled out easily because it doesn't allow constant-size
indexing; UTF-16 essentially for the same reason (plus there was
no point to UTF-16, since there were no assigned characters outside
the BMP).

Regards,
Martin

Denis Kasak · Feb 21, 2009

You mean, originally? Originally, the choice was only between UCS-2
and UCS-4; choice was in favor of UCS-2 because of size concerns.
UTF-8 was ruled out easily because it doesn't allow constant-size
indexing; UTF-16 essentially for the same reason (plus there was
no point to UTF-16, since there were no assigned characters outside
the BMP).

Yes, I failed to realise how long ago the unicode data type was
implemented originally.

Thanks for the explanation.

Encoding trouble when script called from application	0	Jan 14, 2014
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
Is there a way to change the default string encoding?	4	Aug 20, 2007
[email protected]	0	Jan 14, 2014
Unicode/ascii encoding nightmare	19	Nov 6, 2006
What the \xc2\xa0 ?!!	1	Sep 7, 2010
encoding problem	11	Dec 19, 2008

What encoding does u'...' syntax use?

Ron Garret

Stefan Behnel

Stefan Behnel

Ron Garret

Terry Reedy

Matthew Woodcraft

Martin v. Löwis

Martin v. LÃ¶wis

Ron Garret

Ron Garret

Terry Reedy

Aahz

Thorsten Kampe

Denis Kasak

Martin v. Löwis

Martin v. LÃ¶wis

Denis Kasak

Adam Olsen

Martin v. LÃ¶wis

Denis Kasak

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads