Unicode literals and byte string interpretation.

Discussion in 'Python' started by Fletcher Johnson, Oct 28, 2011.

  1. If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
    this creation process interpret the bytes in the byte string? Does it
    assume the string represents a utf-16 encoding, at utf-8 encoding,

    For reference the string is ã“れ㯠in the 'shift-jis' encoding.
    Fletcher Johnson, Oct 28, 2011
    1. Advertisements

  2. Fletcher Johnson

    David Riley Guest

    Try it and see! One test case is worth a thousand words. And Python has an interactive interpreter. :)

    - Dave
    David Riley, Oct 28, 2011
    1. Advertisements

  3. Encodings define how characters are represented in bytes. I think
    probably what you're looking for is a byte string with those hex
    values in it, which you can then turn into a Unicode string:

    The u'....' notation is for Unicode strings, which are not encoded in
    any way. The last line of the above is a valid way of entering that
    string in your source code, identifying Unicode characters by their

    Chris Angelico, Oct 28, 2011
  4. It doesn't, because there is no byte-string. You have created a Unicode
    object from a literal string of unicode characters, not bytes. Those
    characters are:

    Dec Hex Char
    130 0x82 ‚
    177 0xb1 ±
    130 0x82 ‚
    234 0xea ê
    130 0x82 ‚
    205 0xcd Ã

    Don't be fooled that all of the characters happen to be in the range
    0-255, that is irrelevant.

    None of the above. It assumes nothing. It takes a string of characters,
    end of story.
    No it is not. The way to get a unicode literal with those characters is
    to use a unicode-aware editor or terminal:
    .... print ord(c), hex(ord(c)), c
    12371 0x3053 ã“
    12428 0x308c れ
    12399 0x306f ã¯

    You are confusing characters with bytes. I believe that what you are
    thinking of is the following: you start with a byte string, and then
    decode it into unicode:

    If you get the encoding wrong, you will get the wrong characters:

    If you start with the Unicode characters, you can encode it into various
    byte strings:
    Steven D'Aprano, Oct 28, 2011
  5. Thanks Steven. You are right. I was confusing characters with bytes.
    Fletcher Johnson, Nov 1, 2011
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.