Unicode literals and byte string interpretation.

Discussion in 'Python' started by Fletcher Johnson, Oct 28, 2011.

  1. If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
    this creation process interpret the bytes in the byte string? Does it
    assume the string represents a utf-16 encoding, at utf-8 encoding,
    etc...?

    For reference the string is ã“れ㯠in the 'shift-jis' encoding.
     
    Fletcher Johnson, Oct 28, 2011
    #1
    1. Advertisements

  2. Fletcher Johnson

    David Riley Guest

    Try it and see! One test case is worth a thousand words. And Python has an interactive interpreter. :)


    - Dave
     
    David Riley, Oct 28, 2011
    #2
    1. Advertisements

  3. Encodings define how characters are represented in bytes. I think
    probably what you're looking for is a byte string with those hex
    values in it, which you can then turn into a Unicode string:
    u'\u3053\u308c\u306f'

    The u'....' notation is for Unicode strings, which are not encoded in
    any way. The last line of the above is a valid way of entering that
    string in your source code, identifying Unicode characters by their
    codepoints.

    ChrisA
     
    Chris Angelico, Oct 28, 2011
    #3
  4. It doesn't, because there is no byte-string. You have created a Unicode
    object from a literal string of unicode characters, not bytes. Those
    characters are:

    Dec Hex Char
    130 0x82 ‚
    177 0xb1 ±
    130 0x82 ‚
    234 0xea ê
    130 0x82 ‚
    205 0xcd Ã

    Don't be fooled that all of the characters happen to be in the range
    0-255, that is irrelevant.

    None of the above. It assumes nothing. It takes a string of characters,
    end of story.
    No it is not. The way to get a unicode literal with those characters is
    to use a unicode-aware editor or terminal:
    .... print ord(c), hex(ord(c)), c
    ....
    12371 0x3053 ã“
    12428 0x308c れ
    12399 0x306f ã¯


    You are confusing characters with bytes. I believe that what you are
    thinking of is the following: you start with a byte string, and then
    decode it into unicode:
    ã“ã‚Œã¯


    If you get the encoding wrong, you will get the wrong characters:
    놂춂


    If you start with the Unicode characters, you can encode it into various
    byte strings:
    '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'
     
    Steven D'Aprano, Oct 28, 2011
    #4
  5. Thanks Steven. You are right. I was confusing characters with bytes.
     
    Fletcher Johnson, Nov 1, 2011
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.