Unicode literals and byte string interpretation.

Fletcher Johnson · Oct 27, 2011

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“ã‚Œã¯ in the 'shift-jis' encoding.

David Riley · Oct 27, 2011

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“ã‚Œã¯ in the 'shift-jis' encoding.

Try it and see! One test case is worth a thousand words. And Python has an interactive interpreter.

- Dave

Chris Angelico · Oct 27, 2011

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“ã‚Œã¯ in the 'shift-jis' encoding.

Encodings define how characters are represented in bytes. I think
probably what you're looking for is a byte string with those hex
values in it, which you can then turn into a Unicode string:
u'\u3053\u308c\u306f'

The u'....' notation is for Unicode strings, which are not encoded in
any way. The last line of the above is a valid way of entering that
string in your source code, identifying Unicode characters by their
codepoints.

ChrisA

Steven D'Aprano · Oct 28, 2011

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string?

It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:

Dec Hex Char
130 0x82 Â‚
177 0xb1 Â±
130 0x82 Â‚
234 0xea Ãª
130 0x82 Â‚
205 0xcd Ã

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.

Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

None of the above. It assumes nothing. It takes a string of characters,
end of story.

For reference the string is ã“ã‚Œã¯ in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:
.... print ord(c), hex(ord(c)), c
....
12371 0x3053 ã“
12428 0x308c ã‚Œ
12399 0x306f ã¯

You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:
ã“ã‚Œã¯

If you get the encoding wrong, you will get the wrong characters:
ë†‚îª‚ì¶‚

If you start with the Unicode characters, you can encode it into various
byte strings:
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'

Fletcher Johnson · Nov 1, 2011

It doesn't, because there is no byte-string. You have created aUnicode
object from aliteralstring ofunicodecharacters, not bytes. Those
characters are:

Dec Hex Â Char
130 0x82 â€š
177 0xb1 Â±
130 0x82 â€š
234 0xea Ãª
130 0x82 â€š
205 0xcd Ã

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.

None of the above. It assumes nothing. It takes a string of characters,
end of story.

No it is not. The way to get aunicodeliteralwith those characters is
to use aunicode-aware editor or terminal:

... Â Â print ord(c), hex(ord(c)), c
...
12371 0x3053 ã“
12428 0x308c ã‚Œ
12399 0x306f ã¯

You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it intounicode:

ã“ã‚Œã¯

If you get the encoding wrong, you will get the wrong characters:

ë†‚îª‚ì¶‚

If you start with theUnicodecharacters, you can encode it into various
byte strings:

'\x82\xb1\x82\xea\x82\xcd'>>> s.encode('utf-8')

'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'

Thanks Steven. You are right. I was confusing characters with bytes.

portable unicode literals	4	Oct 15, 2012
Unicode escapes and String literals?	24	Dec 13, 2012
byte count unicode string	2	Sep 20, 2006
idle 2.x and unicode literals	0	Apr 2, 2010
Questions on various string literals in c++0x	1	Dec 7, 2010
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009
unicode string literals and "u" prefix	6	Nov 8, 2004

Unicode literals and byte string interpretation.

Fletcher Johnson

David Riley

Chris Angelico

Steven D'Aprano

Fletcher Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads