Unicode literals and byte string interpretation.

F

Fletcher Johnson

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“れ㯠in the 'shift-jis' encoding.
 
D

David Riley

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“れ㯠in the 'shift-jis' encoding.

Try it and see! One test case is worth a thousand words. And Python has an interactive interpreter. :)


- Dave
 
C

Chris Angelico

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string? Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

For reference the string is ã“れ㯠in the 'shift-jis' encoding.

Encodings define how characters are represented in bytes. I think
probably what you're looking for is a byte string with those hex
values in it, which you can then turn into a Unicode string:
u'\u3053\u308c\u306f'

The u'....' notation is for Unicode strings, which are not encoded in
any way. The last line of the above is a valid way of entering that
string in your source code, identifying Unicode characters by their
codepoints.

ChrisA
 
S

Steven D'Aprano

If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
this creation process interpret the bytes in the byte string?

It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:

Dec Hex Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Ã

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.

Does it
assume the string represents a utf-16 encoding, at utf-8 encoding,
etc...?

None of the above. It assumes nothing. It takes a string of characters,
end of story.
For reference the string is ã“れ㯠in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:
.... print ord(c), hex(ord(c)), c
....
12371 0x3053 ã“
12428 0x308c れ
12399 0x306f ã¯


You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:
ã“れã¯


If you get the encoding wrong, you will get the wrong characters:
놂춂


If you start with the Unicode characters, you can encode it into various
byte strings:
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'
 
F

Fletcher Johnson

It doesn't, because there is no byte-string. You have created aUnicode
object from aliteralstring ofunicodecharacters, not bytes. Those
characters are:

Dec Hex  Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Ã

Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.


None of the above. It assumes nothing. It takes a string of characters,
end of story.


No it is not. The way to get aunicodeliteralwith those characters is
to use aunicode-aware editor or terminal:


...     print ord(c), hex(ord(c)), c
...
12371 0x3053 ã“
12428 0x308c れ
12399 0x306f ã¯

You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it intounicode:


ã“れã¯

If you get the encoding wrong, you will get the wrong characters:


놂춂

If you start with theUnicodecharacters, you can encode it into various
byte strings:


'\x82\xb1\x82\xea\x82\xcd'>>> s.encode('utf-8')

'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'

Thanks Steven. You are right. I was confusing characters with bytes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top