A few questiosn about encoding

Discussion in 'Python' started by Íéêüëáïò Êïýñáò, Jun 9, 2013.

  1. A few questiosn about encoding please:

    >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
    >> values up to 256?


    >Because then how do you tell when you need one byte, and when you need
    >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
    >characters, with ordinal values 0x4C and 0xFA, or one character with
    >ordinal value 0x4CFA?


    I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256.


    >> UTF-8 and UTF-16 and UTF-32
    >> I though the number beside of UTF- was to declare how many bits the
    >> character set was using to store a character into the hdd, no?


    >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
    >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
    >values to make a surrogate pair.


    A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
    Is this what a surrogate is? a pari of 2 chars?


    >UTF-8 uses 8-bit values, but sometimes
    >it combines two, three or four of them to represent a single code-point.


    'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
    'á´' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )
    'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (sinceordinal > 65000 )

    The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table?
     
    Íéêüëáïò Êïýñáò, Jun 9, 2013
    #1
    1. Advertising

  2. On 9 Jun 2013 11:49, "Íéêüëáïò Êïýñáò" <> wrote:
    >
    > A few questiosn about encoding please:
    >
    > >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
    > >> values up to 256?

    >
    > >Because then how do you tell when you need one byte, and when you need
    > >two? If you read two bytes, and see 0x4C 0xFA, does that mean two
    > >characters, with ordinal values 0x4C and 0xFA, or one character with
    > >ordinal value 0x4CFA?

    >
    > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant

    up to 256, not above 256.
    >
    >
    > >> UTF-8 and UTF-16 and UTF-32
    > >> I though the number beside of UTF- was to declare how many bits the
    > >> character set was using to store a character into the hdd, no?

    >
    > >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
    > >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
    > >values to make a surrogate pair.

    >
    > A surrogate pair is like itting for example Ctrl-A, which means is a

    combination character that consists of 2 different characters?
    > Is this what a surrogate is? a pari of 2 chars?
    >
    >
    > >UTF-8 uses 8-bit values, but sometimes
    > >it combines two, three or four of them to represent a single code-point.

    >
    > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
    > 'á´' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >

    127 )
    > 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ?

    (since ordinal > 65000 )
    >
    > The amount of bytes needed to store a character solely depends on the

    character's ordinal value in the Unicode table?
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2
    to 4 bytes. A utf-32 always takes 4 bytes.

    The process of encoding bytes to characters is called encoding. The
    opposite is decoding. This is all made transparent in python with the
    encode() and decode() methods. You normally don't care about this kind of
    things.
     
    Fábio Santos, Jun 9, 2013
    #2
    1. Advertising

  3. Íéêüëáïò Êïýñáò

    Nobody Guest

    On Sun, 09 Jun 2013 03:44:57 -0700, Îικόλαος ΚοÏÏας wrote:

    >>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
    >>> values up to 256?

    >
    >>Because then how do you tell when you need one byte, and when you need
    >>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
    >>characters, with ordinal values 0x4C and 0xFA, or one character with
    >>ordinal value 0x4CFA?

    >
    > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
    > meant up to 256, not above 256.


    But then you've used up all 256 possible bytes for storing the first 256
    characters, and there aren't any left for use in multi-byte sequences.

    You need some means to distinguish between a single-byte character and an
    individual byte within a multi-byte sequence.

    UTF-8 does that by allocating specific ranges to specific purposes.
    0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of
    multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences.

    This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is
    corrupted, added or removed, it will only affect the character containing
    that particular byte; the encoder can re-synchronise at the beginning of
    the following character.

    OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or
    removing a byte will result in desyncronisation, with all subsequent
    characters being corrupted.

    > A surrogate pair is like itting for example Ctrl-A, which means is a
    > combination character that consists of 2 different characters? Is this
    > what a surrogate is? a pari of 2 chars?


    A surrogate pair is a pair of 16-bit codes used to represent a single
    Unicode character whose code is greater than 0xFFFF.

    The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to
    represent characters, but "surrogates". Unicode characters with codes
    in the range 0x10000-0x10FFFF are represented in UTF-16 as a pair of
    surrogates. First, 0x10000 is subtracted from the code, giving a value in
    the range 0-0xFFFFF (20 bits). The top ten bits are added to 0xD800 to
    give a value in the range 0xD800-0xDBFF, while the bottom ten bits are
    added to 0xDC00 to give a value in the range 0xDC00-0xDFFF.

    Because the codes used for surrogates aren't valid as individual
    characters, scanning a string for a particular character won't
    accidentally match part of a multi-word character.

    > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
    > 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is
    > > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be

    > stored ? (since ordinal > 65000 )


    Most Chinese, Japanese and Korean (CJK) characters have codepoints within
    the BMP (i.e. <= 0xFFFF), so they only require 3 bytes in UTF-8. The
    codepoints above the BMP are mostly for archaic ideographs (those no
    longer in normal use), mathematical symbols, dead languages, etc.

    > The amount of bytes needed to store a character solely depends on the
    > character's ordinal value in the Unicode table?


    Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned
    integers such that smaller integers require fewer bytes than larger
    integers (subsequent revisions of Unicode cap the range of possible
    codepoints to 0x10FFFF, as that's all that UTF-16 can handle).
     
    Nobody, Jun 9, 2013
    #3
  4. On Sun, Jun 9, 2013 at 12:44 PM, Îικόλαος ΚοÏÏας <> wrote:
    > A few questiosn about encoding please:
    >
    >>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
    >>> values up to 256?

    >
    >>Because then how do you tell when you need one byte, and when you need
    >>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
    >>characters, with ordinal values 0x4C and 0xFA, or one character with
    >>ordinal value 0x4CFA?

    >
    > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meantup to 256, not above 256.


    It is required so the computer can know where characters begin.
    0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further
    details here: http://en.wikipedia.org/wiki/UTF-8#Description

    >>> UTF-8 and UTF-16 and UTF-32
    >>> I though the number beside of UTF- was to declare how many bits the
    >>> character set was using to store a character into the hdd, no?

    >
    >>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
    >>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
    >>values to make a surrogate pair.

    >
    > A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters?
    > Is this what a surrogate is? a pari of 2 chars?


    http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

    Long story short: codepoint - 0x10000 (up to 20 bits) → two 10-bit
    numbers → 0xD800 + first_half 0xDC00 + second_half. Rephrasing:

    We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: ð

    It is over 0xFFFF, and we need to use surrogate pairs. We end up with
    0xD401, or 0b1101010000000001. Both representations are worthless, as
    we have a 16-bit number, not a 20-bit one. We throw in some leading
    zeroes and end up with 0b00001101010000000001. Split it in half and
    we get 0b0000110101 and 0b0000000001, which we can now shorten to
    0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 +
    0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00. Type it into python and:

    >>> b'\xD8\x35\xDC\x01'.decode('utf-16be')

    'ð'

    And before you ask: that “BE†stands for Big-Endian. Little-Endian
    would mean reversing the bytes in a codepoint, which would make it
    '\x35\xD8\x01\xDC' (the name is based on the first 256 characters,
    which are 0x6500 for 'a' in a little-endian encoding.

    Another question you may ask: 0xD800…0xDFFF are reserved in Unicode
    for the purposes of UTF-16, so there is no conflicts.

    >>UTF-8 uses 8-bit values, but sometimes
    >>it combines two, three or four of them to represent a single code-point.

    >
    > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)
    > 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 )


    yup. α is at 0x03B1, or 945 decimal.

    > 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 )


    Not necessarily, as CJK characters start at U+2E80, which is in the
    3-byte range (0x0800 through 0xFFFF) — the table is here:
    http://en.wikipedia.org/wiki/UTF-8#Description

    --
    Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
    stop html mail | always bottom-post
    http://asciiribbon.org | http://caliburn.nl/topposting.html
     
    Chris “Kwpolska†Warrick, Jun 9, 2013
    #4
  5. On Wed, 12 Jun 2013 09:09:05 +0000, Îικόλαος ΚοÏÏας wrote:

    > Isn't 14 bits way to many to store a character ?


    No.

    There are 1114111 possible characters in Unicode. (And in Japan, they
    sometimes use TRON instead of Unicode, which has even more.)

    If you list out all the combinations of 14 bits:

    0000 0000 0000 00
    0000 0000 0000 01
    0000 0000 0000 10
    0000 0000 0000 11
    [...]
    1111 1111 1111 10
    1111 1111 1111 11

    you will see that there are only 32767 (2**15-1) such values. You can't
    fit 1114111 characters with just 32767 values.



    --
    Steven
     
    Steven D'Aprano, Jun 12, 2013
    #5
  6. On 12/6/2013 12:24 μμ, Steven D'Aprano wrote:
    > On Wed, 12 Jun 2013 09:09:05 +0000, Îικόλαος ΚοÏÏας wrote:
    >
    >> Isn't 14 bits way to many to store a character ?

    >
    > No.
    >
    > There are 1114111 possible characters in Unicode. (And in Japan, they
    > sometimes use TRON instead of Unicode, which has even more.)
    >
    > If you list out all the combinations of 14 bits:
    >
    > 0000 0000 0000 00
    > 0000 0000 0000 01
    > 0000 0000 0000 10
    > 0000 0000 0000 11
    > [...]
    > 1111 1111 1111 10
    > 1111 1111 1111 11
    >
    > you will see that there are only 32767 (2**15-1) such values. You can't
    > fit 1114111 characters with just 32767 values.
    >
    >
    >

    Thanks Steven,
    So, how many bytes does UTF-8 stored for codepoints > 127 ?

    example for codepoint 256, 1345, 16474 ?
     
    Îικόλαος ΚοÏÏας, Jun 12, 2013
    #6
  7. Íéêüëáïò Êïýñáò

    Dave Angel Guest

    On 06/12/2013 05:24 AM, Steven D'Aprano wrote:
    > On Wed, 12 Jun 2013 09:09:05 +0000, Îικόλαος ΚοÏÏας wrote:
    >
    >> Isn't 14 bits way to many to store a character ?

    >
    > No.
    >
    > There are 1114111 possible characters in Unicode. (And in Japan, they
    > sometimes use TRON instead of Unicode, which has even more.)
    >
    > If you list out all the combinations of 14 bits:
    >
    > 0000 0000 0000 00
    > 0000 0000 0000 01
    > 0000 0000 0000 10
    > 0000 0000 0000 11
    > [...]
    > 1111 1111 1111 10
    > 1111 1111 1111 11
    >
    > you will see that there are only 32767 (2**15-1) such values. You can't
    > fit 1114111 characters with just 32767 values.
    >
    >


    Actually, it's worse. There are 16536 such values (2**14), assuming you
    include null, which you did in your list.

    --
    DaveA
     
    Dave Angel, Jun 12, 2013
    #7
  8. Am 12.06.2013 13:23, schrieb Îικόλαος ΚοÏÏας:
    > So, how many bytes does UTF-8 stored for codepoints > 127 ?


    What has your research turned up? I personally consider it lazy and
    respectless to get lots of pointers that you could use for further
    research and ask for more info before you even followed these links.


    > example for codepoint 256, 1345, 16474 ?


    Yes, examples exist. Gee, if there only was an information network that
    you could access and where you could locate information on various
    programming-related topics somehow. Seriously, someone should invent
    this thing! But still, even without it, you have all the tools (i.e.
    Python) in your hand to generate these examples yourself! Check out ord,
    bin, encode, decode for a start.


    Uli
     
    Ulrich Eckhardt, Jun 12, 2013
    #8
  9. Íéêüëáïò Êïýñáò

    Nobody Guest

    On Wed, 12 Jun 2013 14:23:49 +0300, Îικόλαος ΚοÏÏας wrote:

    > So, how many bytes does UTF-8 stored for codepoints > 127 ?


    U+0000..U+007F 1 byte
    U+0080..U+07FF 2 bytes
    U+0800..U+FFFF 3 bytes
    >=U+10000 4 bytes


    So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic,
    Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead
    languages and mathematical symbols.

    The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total
    of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than 20 bits).
     
    Nobody, Jun 12, 2013
    #9
  10. On Wed, 12 Jun 2013 14:23:49 +0300, Îικόλαος ΚοÏÏας wrote:

    > So, how many bytes does UTF-8 stored for codepoints > 127 ?


    Two, three or four, depending on the codepoint.


    > example for codepoint 256, 1345, 16474 ?


    You can do this yourself. I have already given you enough information in
    previous emails to answer this question on your own, but here it is again:

    Open an interactive Python session, and run this code:

    c = ord(16474)
    len(c.encode('utf-8'))


    That will tell you how many bytes are used for that example.



    --
    Steven
     
    Steven D'Aprano, Jun 13, 2013
    #10
  11. On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote:

    > The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a
    > total of 31 bits, but UTF-16 is limited to U+10FFFF (slightly more than
    > 20 bits).


    Same with UTF-8 and UTF-32, both of which are limited to U+10FFFF because
    that is what Unicode is limited to.

    The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
    that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the
    mechanism of UTF-32 could go up to 0xFFFFFFFF, but doing so means you
    don't have Unicode chars any more, and hence your byte-string is not
    valid UTF-32:

    py> b = b'\xFF'*8
    py> b.decode('UTF-32')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3:
    codepoint not in range(0x110000)


    --
    Steven
     
    Steven D'Aprano, Jun 13, 2013
    #11
  12. On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
    <> wrote:
    > The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
    > that's not UTF-8, that's UTF-8-plus-extra-codepoints.


    And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80",
    even though mathematically they would translate into U+0000 and U+D800
    respectively. The UTF-16 *mechanism* is limited to no more than
    Unicode has currently used, but I'm left wondering if that's actually
    the other way around - that Unicode planes were deemed to stop at the
    point where UTF-16 can't encode any more. Not that it matters; with
    most of the current planes completely unallocated, it seems unlikely
    we'll be needing more.

    ChrisA
     
    Chris Angelico, Jun 13, 2013
    #12
  13. On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
    > On Wed, 12 Jun 2013 14:23:49 +0300, Îικόλαος ΚοÏÏας wrote:
    >
    >> So, how many bytes does UTF-8 stored for codepoints > 127 ?

    >
    > Two, three or four, depending on the codepoint.


    The amount of bytes needed by UTF-8 to store a code-point(character),
    depends on the ordinal value of the code-point in the Unicode charset,
    correct?

    If this is correct then the higher the ordinal value(which is an decimal
    integer) in the Unicode charset the more bytes needed for storage.

    Its like the bigger a decimal integer is the bigger binary number it
    produces.

    Is this correct?


    >> example for codepoint 256, 1345, 16474 ?

    >
    > You can do this yourself. I have already given you enough information in
    > previous emails to answer this question on your own, but here it is again:
    >
    > Open an interactive Python session, and run this code:
    >
    > c = ord(16474)
    > len(c.encode('utf-8'))
    >
    >
    > That will tell you how many bytes are used for that example.

    This si actually wrong.

    ord()'s arguments must be a character for which we expect its ordinal value.

    >>> chr(16474)

    'äš'

    Some Chinese symbol.
    So code-point 'äš' has a Unicode ordinal value of 16474, correct?

    where in after encoding this glyph's ordinal value to binary gives us
    the following bytes:

    >>> bin(16474).encode('utf-8')

    b'0b100000001011010'

    Now, we take tow symbols out:

    'b' symbolism which is there to tell us that we are looking a bytes
    object as well as the
    '0b' symbolism which is there to tell us that we are looking a binary
    representation of a bytes object

    Thus, there we count 15 bits left.
    So it says 15 bits, which is 1-bit less that 2 bytes.
    Is the above statements correct please?


    but thinking this through more and more:

    >>> chr(16474).encode('utf-8')

    b'\xe4\x81\x9a'
    >>> len(b'\xe4\x81\x9a')

    3

    it seems that the bytestring the encode process produces is of length 3.

    So i take it is 3 bytes?

    but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>>
    chr(16474).encode('utf-8') is telling us here.

    Care to explain that too please ?
     
    Îικόλαος ΚοÏÏας, Jun 13, 2013
    #13
  14. On 12/6/2013 11:30 μμ, Nobody wrote:
    > On Wed, 12 Jun 2013 14:23:49 +0300, Îικόλαος ΚοÏÏας wrote:
    >
    >> So, how many bytes does UTF-8 stored for codepoints > 127 ?

    >
    > U+0000..U+007F 1 byte
    > U+0080..U+07FF 2 bytes
    > U+0800..U+FFFF 3 bytes
    >> =U+10000 4 bytes


    'U' stands for Unicode code-point which means a character right?

    How can you be able to tell up to what character utf-8 needs 1 byte or 2
    bytes or 3?


    And some of the bytes' bits are used to tell where a code-points
    representations stops, right? i mean if we have a code-point that needs
    2 bytes to be stored that the high bit must be set to 1 to signify that
    this character's encoding stops at 2 bytes.

    I just know that 2^8 = 256, that's by first look 265 places, which mean
    256 positions to hold a code-point which in turn means a character.

    We take the high bit out and then we have 2^7 which is enough positions
    for 0-127 standard ASCII. High bit is set to '0' to signify that char is
    encoded in 1 byte.

    Please tell me that i understood correct so far.

    But how about for 2 or 3 or 4 bytes?

    Am i saying ti correct ?
     
    Îικόλαος ΚοÏÏας, Jun 13, 2013
    #14
  15. Íéêüëáïò Êïýñáò

    jmfauth Guest

    ------

    UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit*

    UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit*

    (still actual, unless tealy freshly modified)

    jmf
     
    jmfauth, Jun 13, 2013
    #15
  16. On Thu, Jun 13, 2013 at 4:21 PM, Íéêüëáïò Êïýñáò <> wrote:
    > How can you be able to tell up to what character utf-8 needs 1 byte or 2
    > bytes or 3?


    You look up Wikipedia, using the handy links that have been put to you
    MULTIPLE TIMES.

    ChrisA
     
    Chris Angelico, Jun 13, 2013
    #16
  17. On Thu, 13 Jun 2013 09:09:19 +0300, Îικόλαος ΚοÏÏας wrote:

    > On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:


    >> Open an interactive Python session, and run this code:
    >>
    >> c = ord(16474)
    >> len(c.encode('utf-8'))
    >>
    >>
    >> That will tell you how many bytes are used for that example.

    > This si actually wrong.
    >
    > ord()'s arguments must be a character for which we expect its ordinal
    > value.


    Gah!

    That's twice I've screwed that up. Sorry about that!


    > >>> chr(16474)

    > 'äš'
    >
    > Some Chinese symbol.
    > So code-point 'äš' has a Unicode ordinal value of 16474, correct?


    Correct.


    > where in after encoding this glyph's ordinal value to binary gives us
    > the following bytes:
    >
    > >>> bin(16474).encode('utf-8')

    > b'0b100000001011010'


    No! That creates a string from 16474 in base two:

    '0b100000001011010'

    The leading 0b is just syntax to tell you "this is base 2, not base 8
    (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

    Then you encode the string '0b100000001011010' into UTF-8. There are 17
    characters in this string, and they are all ASCII characters to they take
    up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In
    hex form, they are:

    b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'

    which takes up a lot more room, which is why Python prefers to show ASCII
    characters as characters rather than as hex.

    What you want is:

    chr(16474).encode('utf-8')


    [...]
    > Thus, there we count 15 bits left.
    > So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
    > statements correct please?


    No. There are 17 BYTES there. The string "0" doesn't get turned into a
    single bit. It still takes up a full byte, 0x30, which is 8 bits.


    > but thinking this through more and more:
    >
    > >>> chr(16474).encode('utf-8')

    > b'\xe4\x81\x9a'
    > >>> len(b'\xe4\x81\x9a')

    > 3
    >
    > it seems that the bytestring the encode process produces is of length 3.


    Correct! Now you have got the right idea.




    --
    Steven
     
    Steven D'Aprano, Jun 13, 2013
    #17
  18. On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:

    >> >>> chr(16474)

    >> 'äš'
    >>
    >> Some Chinese symbol.
    >> So code-point 'äš' has a Unicode ordinal value of 16474, correct?

    >
    > Correct.
    >
    >
    >> where in after encoding this glyph's ordinal value to binary gives us
    >> the following bytes:
    >>
    >> >>> bin(16474).encode('utf-8')

    >> b'0b100000001011010'


    An observations here that you please confirm as valid.

    1. A code-point and the code-point's ordinal value are associated into a
    Unicode charset. They have the so called 1:1 mapping.

    So, i was under the impression that by encoding the code-point into
    utf-8 was the same as encoding the code-point's ordinal value into utf-8.

    That is why i tried to:
    bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

    So, now i believe they are two different things.
    The code-point *is what actually* needs to be encoded and *not* its
    ordinal value.


    > The leading 0b is just syntax to tell you "this is base 2, not base 8
    > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.


    But byte objects are represented as '\x' instead of the aforementioned
    '0x'. Why is that?


    > No! That creates a string from 16474 in base two:
    > '0b100000001011010'


    I disagree here.
    16474 is a number in base 10. Doing bin(16474) we get the binary
    representation of number 16474 and not a string.
    Why you say we receive a string while python presents a binary number?


    > Then you encode the string '0b100000001011010' into UTF-8. There are 17
    > characters in this string, and they are all ASCII characters to they take
    > up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).


    0b100000001011010 stands for a number in base 2 for me not as a string.
    Have i understood something wrong?
     
    Îικόλαος ΚοÏÏας, Jun 13, 2013
    #18
  19. On 13/6/2013 10:58 πμ, Chris Angelico wrote:
    > On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <> wrote:
    >> On 13/6/2013 10:11 ��, Steven D'Aprano wrote:
    >>> No! That creates a string from 16474 in base two:
    >>> '0b100000001011010'

    >>
    >> I disagree here.
    >> 16474 is a number in base 10. Doing bin(16474) we get the binary
    >> representation of number 16474 and not a string.
    >> Why you say we receive a string while python presents a binary number?

    >
    > You can disagree all you like. Steven cited a simple point of fact,
    > one which can be verified in any Python interpreter. Nikos, you are
    > flat wrong here; bin(16474) creates a string.


    Indeed python embraced it in single quoting '0b100000001011010' and not
    as 0b100000001011010 which in fact makes it a string.

    But since bin(16474) seems to create a string rather than an expected
    number(at leat into my mind) then how do we get the binary
    representation of the number 16474 as a number?
     
    Îικόλαος ΚοÏÏας, Jun 13, 2013
    #19
  20. On Thu, Jun 13, 2013 at 6:08 PM, Îικόλαος ΚοÏÏας <> wrote:
    > On 13/6/2013 10:58 πμ, Chris Angelico wrote:
    >>
    >> On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <>
    >> wrote:
    >>
    >>> On 13/6/2013 10:11 ��, Steven D'Aprano wrote:
    >>>>
    >>>> No! That creates a string from 16474 in base two:
    >>>> '0b100000001011010'
    >>>
    >>>
    >>> I disagree here.
    >>> 16474 is a number in base 10. Doing bin(16474) we get the binary
    >>> representation of number 16474 and not a string.
    >>> Why you say we receive a string while python presents a binary number?

    >>
    >>
    >> You can disagree all you like. Steven cited a simple point of fact,
    >> one which can be verified in any Python interpreter. Nikos, you are
    >> flat wrong here; bin(16474) creates a string.

    >
    >
    > Indeed python embraced it in single quoting '0b100000001011010' and not as
    > 0b100000001011010 which in fact makes it a string.
    >
    > But since bin(16474) seems to create a string rather than an expected
    > number(at leat into my mind) then how do we get the binary representationof
    > the number 16474 as a number?


    In Python 2:
    >>> 16474


    In Python 3, you have to fiddle around with ctypes, but broadly
    speaking, the same thing.

    ChrisA
     
    Chris Angelico, Jun 13, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,940
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Murali
    Replies:
    2
    Views:
    593
    Jerry Coffin
    Mar 9, 2006
  3. Cameron Simpson

    Re: A few questiosn about encoding

    Cameron Simpson, Jun 17, 2013, in forum: Python
    Replies:
    0
    Views:
    100
    Cameron Simpson
    Jun 17, 2013
  4. Antoon Pardon

    Re: A few questiosn about encoding

    Antoon Pardon, Jun 17, 2013, in forum: Python
    Replies:
    0
    Views:
    111
    Antoon Pardon
    Jun 17, 2013
  5. Antoon Pardon

    Re: A few questiosn about encoding

    Antoon Pardon, Jun 17, 2013, in forum: Python
    Replies:
    0
    Views:
    119
    Antoon Pardon
    Jun 17, 2013
Loading...

Share This Page