UTF-8 question from Dive into Python 3

Discussion in 'Python' started by carlo, Jan 17, 2011.

  1. carlo

    carlo Guest

    Hi,
    recently I had to study *seriously* Unicode and encodings for one
    project in Python but I left with a couple of doubts arised after
    reading the unicode chapter of Dive into Python 3 book by Mark
    Pilgrim.

    1- Mark says:
    "Also (and you’ll have to trust me on this, because I’m not going to
    show you the math), due to the exact nature of the bit twiddling,
    there are no byte-ordering issues. A document encoded in UTF-8 uses
    the exact same stream of bytes on any computer."
    Is it true UTF-8 does not have any "big-endian/little-endian" issue
    because of its encoding method? And if it is true, why Mark (and
    everyone does) writes about UTF-8 with and without BOM some chapters
    later? What would be the BOM purpose then?

    2- If that were true, can you point me to some documentation about the
    math that, as Mark says, demonstrates this?

    thank you
    Carlo
     
    carlo, Jan 17, 2011
    #1
    1. Advertising

  2. On 17.01.2011 23:19, carlo wrote:

    > Is it true UTF-8 does not have any "big-endian/little-endian" issue
    > because of its encoding method? And if it is true, why Mark (and
    > everyone does) writes about UTF-8 with and without BOM some chapters
    > later? What would be the BOM purpose then?


    Can't answer your other questions, but the UTF-8 BOM is simply a
    marker saying "This is a UTF-8 text file, not an ASCII text file"

    If I'm not wrong, this was a Microsoft invention and surely one of
    their brightest ideas. I really wish, that this had been done for
    ANSI some decades ago. Determining the encoding for text files is
    hard to impossible because such a mark was never introduced.
     
    Alexander Kapps, Jan 17, 2011
    #2
    1. Advertising

  3. carlo

    Tim Harig Guest

    On 2011-01-17, carlo <> wrote:
    > Is it true UTF-8 does not have any "big-endian/little-endian" issue
    > because of its encoding method? And if it is true, why Mark (and
    > everyone does) writes about UTF-8 with and without BOM some chapters
    > later? What would be the BOM purpose then?


    Yes, it is true. The BOM simply identifies that the encoding as a UTF-8.:

    http://unicode.org/faq/utf_bom.html#bom5

    > 2- If that were true, can you point me to some documentation about the
    > math that, as Mark says, demonstrates this?


    It is true because UTF-8 is essentially an 8 bit encoding that resorts
    to the next bit once it exhausts the addressible space of the current
    byte it moves to the next one. Since the bytes are accessed and assessed
    sequentially, they must be in big-endian order.
     
    Tim Harig, Jan 17, 2011
    #3
  4. On Mon, 17 Jan 2011 14:19:13 -0800 (PST)
    carlo <> wrote:
    > Is it true UTF-8 does not have any "big-endian/little-endian" issue
    > because of its encoding method?


    Yes.

    > And if it is true, why Mark (and
    > everyone does) writes about UTF-8 with and without BOM some chapters
    > later? What would be the BOM purpose then?


    "BOM" in this case is a misnomer. For UTF-8, it is only used as a
    marker (a magic number, if you like) to signal than a given text file
    is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
    actually, it does not change with endianness.

    (note that it is not required to put an UTF-8 "BOM" at the beginning of
    text files; it is just a hint that some tools use when
    generating/reading UTF-8)

    > 2- If that were true, can you point me to some documentation about the
    > math that, as Mark says, demonstrates this?


    Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
    encoding. There is no math involved, it just works by construction.

    Regards

    Antoine.
     
    Antoine Pitrou, Jan 17, 2011
    #4
  5. carlo

    carlo Guest

    On 17 Gen, 23:34, Antoine Pitrou <> wrote:
    > On Mon, 17 Jan 2011 14:19:13 -0800 (PST)
    >
    > carlo <> wrote:
    > > Is it true UTF-8 does not have any "big-endian/little-endian" issue
    > > because of its encoding method?

    >
    > Yes.
    >
    > > And if it is true, why Mark (and
    > > everyone does) writes about UTF-8 with and without BOM some chapters
    > > later? What would be the BOM purpose then?

    >
    > "BOM" in this case is a misnomer. For UTF-8, it is only used as a
    > marker (a magic number, if you like) to signal than a given text file
    > is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
    > actually, it does not change with endianness.
    >
    > (note that it is not required to put an UTF-8 "BOM" at the beginning of
    > text files; it is just a hint that some tools use when
    > generating/reading UTF-8)
    >
    > > 2- If that were true, can you point me to some documentation about the
    > > math that, as Mark says, demonstrates this?

    >
    > Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
    > encoding. There is no math involved, it just works by construction.
    >
    > Regards
    >
    > Antoine.


    thank you all, eventually found http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404
    which clears up.
    No math in fact, as Tim and Antoine pointed out.
     
    carlo, Jan 17, 2011
    #5
  6. On Jan 17, 2:19 pm, carlo <> wrote:
    > Hi,
    > recently I had to study *seriously* Unicode and encodings for one
    > project in Python but I left with a couple of doubts arised after
    > reading the unicode chapter of Dive into Python 3 book by Mark
    > Pilgrim.
    >
    > 1- Mark says:
    > "Also (and you’ll have to trust me on this, because I’m not going to
    > show you the math), due to the exact nature of the bit twiddling,
    > there are no byte-ordering issues. A document encoded in UTF-8 uses
    > the exact same stream of bytes on any computer."

    . . .
    > 2- If that were true, can you point me to some documentation about the
    > math that, as Mark says, demonstrates this?


    I believe Mark was referring to the bit-twiddling described in
    the Design section at http://en.wikipedia.org/wiki/UTF-8 .

    Raymond
     
    Raymond Hettinger, Jan 18, 2011
    #6
  7. carlo

    Tim Harig Guest

    On 2011-01-19, Tim Roberts <> wrote:
    > Tim Harig <> wrote:
    >>On 2011-01-17, carlo <> wrote:
    >>
    >>> 2- If that were true, can you point me to some documentation about the
    >>> math that, as Mark says, demonstrates this?

    >>
    >>It is true because UTF-8 is essentially an 8 bit encoding that resorts
    >>to the next bit once it exhausts the addressible space of the current
    >>byte it moves to the next one. Since the bytes are accessed and assessed
    >>sequentially, they must be in big-endian order.

    >
    > You were doing excellently up to that last phrase. Endianness only applies
    > when you treat a series of bytes as a larger entity. That doesn't apply to
    > UTF-8. None of the bytes is more "significant" than any other, so by
    > definition it is neither big-endian or little-endian.


    It depends how you process it and it doesn't generally make much
    difference in Python. Accessing UTF-8 data from C can be much trickier
    if you use a multibyte type to store the data. In that case, if happen
    to be on a little-endian architecture, it may be necessary to remember
    that the data is not in the order that your processor expects it to be
    for numeric operations and comparisons. That is why the FAQ I linked to
    says yes to the fact that you can consider UTF-8 to always be in big-endian
    order. Essentially all byte based data is big-endian.
     
    Tim Harig, Jan 19, 2011
    #7
  8. On Wed, 19 Jan 2011 11:34:53 +0000 (UTC)
    Tim Harig <> wrote:
    > That is why the FAQ I linked to
    > says yes to the fact that you can consider UTF-8 to always be in big-endian
    > order.


    It certainly doesn't. Read better.

    > Essentially all byte based data is big-endian.


    This is pure nonsense.
     
    Antoine Pitrou, Jan 19, 2011
    #8
  9. carlo

    Tim Harig Guest

    Considering you post contained no information or evidence for your
    negations, I shouldn't even bother responding. I will bite once.
    Hopefully next time your arguments will contain some pith.

    On 2011-01-19, Antoine Pitrou <> wrote:
    > On Wed, 19 Jan 2011 11:34:53 +0000 (UTC)
    > Tim Harig <> wrote:
    >> That is why the FAQ I linked to
    >> says yes to the fact that you can consider UTF-8 to always be in big-endian
    >> order.

    >
    > It certainly doesn't. Read better.


    - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
    - yes, then can I still assume the remaining UTF-8 bytes are in big-endian
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    - order?
    ^^^^^^
    -
    - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
    ^^^
    - to the endianness of the byte stream. UTF-8 always has the same byte
    ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    - order. An initial BOM is only used as a signature -- an indication that
    ^^^^^^
    - an otherwise unmarked text file is in UTF-8. Note that some recipients of
    - UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently
    - in 8-bit environments, the use of a BOM will interfere with any protocol
    - or file format that expects specific ASCII characters at the beginning,
    - such as the use of "#!" of at the beginning of Unix shell scripts.

    The question that was not addressed was whether you can consider UTF-8
    to be little endian. I pointed out why you cannot always make that
    assumption in my previous post.

    UTF-8 has no apparent endianess if you only store it as a byte stream.
    It does however have a byte order. If you store it using multibytes
    (six bytes for all UTF-8 possibilites) , which is useful if you want
    to have one storage container for each letter as opposed to one for
    each byte(1), the bytes will still have the same order but you have
    interrupted its sole existance as a byte stream and have returned it
    to the underlying multibyte oriented representation. If you attempt
    any numeric or binary operations on what is now a multibyte sequence,
    the processor will interpret the data using its own endian rules.

    If your processor is big-endian, then you don't have any problems.
    The processor will interpret the data in the order that it is stored.
    If your processor is little endian, then it will effectively change the
    order of the bytes for its own evaluation.

    So, you can always assume a big-endian and things will work out correctly
    while you cannot always make the same assumption as little endian
    without potential issues. The same holds true for any byte stream data.
    That is why I say that byte streams are essentially big endian. It is
    all a matter of how you look at it.

    I prefer to look at all data as endian even if it doesn't create
    endian issues because it forces me to consider any endian issues that
    might arise. If none do, I haven't really lost anything. If you simply
    assume that any byte sequence cannot have endian issues you ignore the
    possibility that such issues might not arise. When an issue like the
    above does, you end up with a potential bug.

    (1) For unicode it is probably better to convert to characters to
    UTF-32/UCS-4 for internal processing; but, creating a container large
    enough to hold any length of UTF-8 character will work.
     
    Tim Harig, Jan 19, 2011
    #9
  10. On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
    Tim Harig <> wrote:
    >
    > - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
    > - yes, then can I still assume the remaining UTF-8 bytes are in big-endian
    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > - order?
    > ^^^^^^
    > -
    > - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
    > ^^^
    > - to the endianness of the byte stream. UTF-8 always has the same byte
    > ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > - order.
    > ^^^^^^


    Which certainly doesn't mean that byte order can be called "big
    endian" for any recognized definition of the latter. Similarly, ASCII
    test has its own order which certainly can't be characterized as either
    "little endian" or "big endian".

    > UTF-8 has no apparent endianess if you only store it as a byte stream.
    > It does however have a byte order. If you store it using multibytes
    > (six bytes for all UTF-8 possibilites) , which is useful if you want
    > to have one storage container for each letter as opposed to one for
    > each byte(1)


    That's a ridiculous proposition. Why would you waste so much space?
    UTF-8 exists *precisely* so that you can save space with most scripts.
    If you are ready to use 4+ bytes per character, just use UTF-32 which
    has much nicer properties.

    Bottom line: you are not describing UTF-8, only your own foolish
    interpretation of it. UTF-8 does not have any endianness since it is a
    byte stream and does not care about "machine words".

    Antoine.
     
    Antoine Pitrou, Jan 19, 2011
    #10
  11. carlo

    Adam Skutt Guest

    On Jan 19, 9:00 am, Tim Harig <> wrote:
    >
    > So, you can always assume a big-endian and things will work out correctly
    > while you cannot always make the same assumption as little endian
    > without potential issues.  The same holds true for any byte stream data..


    You need to spend some serious time programming a serial port or other
    byte/bit-stream oriented interface, and then you'll realize the folly
    of your statement.

    > That is why I say that byte streams are essentially big endian. It is
    > all a matter of how you look at it.


    It is nothing of the sort. Some byte streams are in fact, little
    endian: when the bytes are combined into larger objects, the least-
    significant byte in the object comes first. A lot of industrial/
    embedded stuff has byte streams with LSB leading in the sequence, CAN
    comes to mind as an example.

    The only way to know is for the standard describing the stream to tell
    you what to do.

    >
    > I prefer to look at all data as endian even if it doesn't create
    > endian issues because it forces me to consider any endian issues that
    > might arise.  If none do, I haven't really lost anything.  
    > If you simply assume that any byte sequence cannot have endian issues you ignore the
    > possibility that such issues might not arise.


    No, you must assume nothing unless you're told how to combine the
    bytes within a sequence into a larger element. Plus, not all byte
    streams support such operations! Some byte streams really are just a
    sequence of bytes and the bytes within the stream cannot be
    meaningfully combined into larger data types. If I give you a series
    of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell
    me how to combine them into a 16, 32, or 64-bit integer. You cannot
    do it without altering the meaning of the samples; it is a completely
    non-nonsensical operation.

    Adam
     
    Adam Skutt, Jan 19, 2011
    #11
  12. carlo

    Tim Harig Guest

    On 2011-01-19, Adam Skutt <> wrote:
    > On Jan 19, 9:00 am, Tim Harig <> wrote:
    >> That is why I say that byte streams are essentially big endian. It is
    >> all a matter of how you look at it.

    >
    > It is nothing of the sort. Some byte streams are in fact, little
    > endian: when the bytes are combined into larger objects, the least-
    > significant byte in the object comes first. A lot of industrial/
    > embedded stuff has byte streams with LSB leading in the sequence, CAN
    > comes to mind as an example.


    You are correct. Point well made.
     
    Tim Harig, Jan 19, 2011
    #12
  13. carlo

    Tim Harig Guest

    On 2011-01-19, Antoine Pitrou <> wrote:
    > On Wed, 19 Jan 2011 14:00:13 +0000 (UTC)
    > Tim Harig <> wrote:
    >> UTF-8 has no apparent endianess if you only store it as a byte stream.
    >> It does however have a byte order. If you store it using multibytes
    >> (six bytes for all UTF-8 possibilites) , which is useful if you want
    >> to have one storage container for each letter as opposed to one for
    >> each byte(1)

    >
    > That's a ridiculous proposition. Why would you waste so much space?


    Space is only one tradeoff. There are many others to consider. I have
    created data structures with much higher overhead than that because
    they happen to make the problem easier and significantly faster for the
    operations that I am performing on the data.

    For many operations, it is just much faster and simpler to use a single
    character based container opposed to having to process an entire byte
    stream to determine individual letters from the bytes or to having
    adaptive size containers to store the data.

    > UTF-8 exists *precisely* so that you can save space with most scripts.


    UTF-8 has many reasons for existing. One of the biggest is that it
    is compatible for tools that were designed to process ASCII and other
    8bit encodings.

    > If you are ready to use 4+ bytes per character, just use UTF-32 which
    > has much nicer properties.


    I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
    not want to have to worry about converting the encodings back and forth
    before and after processing them. That said, and more importantly, many
    variable length byte streams may not have alternate representations as
    unicode does.
     
    Tim Harig, Jan 19, 2011
    #13
  14. On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
    Tim Harig <> wrote:
    >
    > For many operations, it is just much faster and simpler to use a single
    > character based container opposed to having to process an entire byte
    > stream to determine individual letters from the bytes or to having
    > adaptive size containers to store the data.


    You *have* to "process the entire byte stream" in order to determine
    boundaries of individual letters from the bytes if you want to use a
    "character based container", regardless of the exact representation.
    Once you do that it shouldn't be very costly to compute the actual code
    points. So, "much faster" sounds a bit dubious to me; especially if you
    factor in the cost of memory allocation, and the fact that a larger
    container will fit less easily in a data cache.

    > That said, and more importantly, many
    > variable length byte streams may not have alternate representations as
    > unicode does.


    This whole thread is about UTF-8 (see title) so I'm not sure what kind
    of relevance this is supposed to have.
     
    Antoine Pitrou, Jan 19, 2011
    #14
  15. carlo

    Tim Harig Guest

    On 2011-01-19, Antoine Pitrou <> wrote:
    > On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
    > Tim Harig <> wrote:
    >>
    >> For many operations, it is just much faster and simpler to use a single
    >> character based container opposed to having to process an entire byte
    >> stream to determine individual letters from the bytes or to having
    >> adaptive size containers to store the data.

    >
    > You *have* to "process the entire byte stream" in order to determine
    > boundaries of individual letters from the bytes if you want to use a
    > "character based container", regardless of the exact representation.


    Right, but I only have to do that once. After that, I can directly address
    any piece of the stream that I choose. If I leave the information as a
    simple UTF-8 stream, I would have to walk the stream again, I would have to
    walk through the the first byte of all the characters from the beginning to
    make sure that I was only counting multibyte characters once until I found
    the character that I actually wanted. Converting to a fixed byte
    representation (UTF-32/UCS-4) or separating all of the bytes for each
    UTF-8 into 6 byte containers both make it possible to simply index the
    letters by a constant size. You will note that Python does the former.

    UTF-32/UCS-4 conversion is definitly supperior if you are actually
    doing any major but it adds the complexity and overhead of requiring
    the bit twiddling to make the conversions (once in, once again out).
    Some programs don't really care enough about what the data actually
    contains to make it worth while. They just want to be able to use the
    characters as black boxes.

    > Once you do that it shouldn't be very costly to compute the actual code
    > points. So, "much faster" sounds a bit dubious to me; especially if you


    You could I suppose keep a separate list of pointers to each letter so that
    you could use the pointer list for indexing or keep a list of the
    character sizes so that you can add them and calculate the variable width
    index; but, that adds overhead as well.
     
    Tim Harig, Jan 19, 2011
    #15
  16. On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
    Tim Harig <> wrote:
    > On 2011-01-19, Antoine Pitrou <> wrote:
    > > On Wed, 19 Jan 2011 16:03:11 +0000 (UTC)
    > > Tim Harig <> wrote:
    > >>
    > >> For many operations, it is just much faster and simpler to use a single
    > >> character based container opposed to having to process an entire byte
    > >> stream to determine individual letters from the bytes or to having
    > >> adaptive size containers to store the data.

    > >
    > > You *have* to "process the entire byte stream" in order to determine
    > > boundaries of individual letters from the bytes if you want to use a
    > > "character based container", regardless of the exact representation.

    >
    > Right, but I only have to do that once.


    You only have to decode once as well.

    > If I leave the information as a
    > simple UTF-8 stream,


    That's not what we are talking about. We are talking about the supposed
    benefits of your 6-byte representation scheme versus proper decoding
    into fixed width code points.

    > UTF-32/UCS-4 conversion is definitly supperior if you are actually
    > doing any major but it adds the complexity and overhead of requiring
    > the bit twiddling to make the conversions (once in, once again out).


    "Bit twiddling" is not something processors are particularly bad at.
    Actually, modern processors are much better at arithmetic and logic
    than at recovering from mispredicted branches, which seems to suggest
    that discovering boundaries probably eats most of the CPU cycles.

    > Converting to a fixed byte
    > representation (UTF-32/UCS-4) or separating all of the bytes for each
    > UTF-8 into 6 byte containers both make it possible to simply index the
    > letters by a constant size. You will note that Python does the
    > former.


    Indeed, Python chose the wise option. Actually, I'd be curious of any
    real-world software which successfully chose your proposed approach.
     
    Antoine Pitrou, Jan 19, 2011
    #16
  17. carlo

    Tim Harig Guest

    On 2011-01-19, Antoine Pitrou <> wrote:
    > On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
    > Tim Harig <> wrote:
    >> Converting to a fixed byte
    >> representation (UTF-32/UCS-4) or separating all of the bytes for each
    >> UTF-8 into 6 byte containers both make it possible to simply index the
    >> letters by a constant size. You will note that Python does the
    >> former.

    >
    > Indeed, Python chose the wise option. Actually, I'd be curious of any
    > real-world software which successfully chose your proposed approach.


    The point is basically the same. I created an example because it
    was simpler to follow for demonstration purposes then an actual UTF-8
    conversion to any official multibyte format. You obviously have no
    other purpose then to be contrary, so we ended up following tangents.

    As soon as you start to convert to a multibyte format the endian issues
    occur. For UTF-8 on big endian hardware, this is anti-climactic because
    all of the bits are already stored in proper order. Little endian systems
    will probably convert to a native native endian format. If you choose
    to ignore that, that is your perogative. Have a nice day.
     
    Tim Harig, Jan 19, 2011
    #17
  18. On Wed, 19 Jan 2011 19:18:49 +0000 (UTC)
    Tim Harig <> wrote:
    > On 2011-01-19, Antoine Pitrou <> wrote:
    > > On Wed, 19 Jan 2011 18:02:22 +0000 (UTC)
    > > Tim Harig <> wrote:
    > >> Converting to a fixed byte
    > >> representation (UTF-32/UCS-4) or separating all of the bytes for each
    > >> UTF-8 into 6 byte containers both make it possible to simply index the
    > >> letters by a constant size. You will note that Python does the
    > >> former.

    > >
    > > Indeed, Python chose the wise option. Actually, I'd be curious of any
    > > real-world software which successfully chose your proposed approach.

    >
    > The point is basically the same. I created an example because it
    > was simpler to follow for demonstration purposes then an actual UTF-8
    > conversion to any official multibyte format. You obviously have no
    > other purpose then to be contrary [...]


    Right. You were the one who jumped in and tried to lecture everyone on
    how UTF-8 was "big-endian", and now you are abandoning the one esoteric
    argument you found in support of that.

    > As soon as you start to convert to a multibyte format the endian issues
    > occur.


    Ok. Good luck with your "endian issues" which don't exist.
     
    Antoine Pitrou, Jan 19, 2011
    #18
  19. carlo

    Terry Reedy Guest

    On 1/19/2011 1:02 PM, Tim Harig wrote:

    > Right, but I only have to do that once. After that, I can directly address
    > any piece of the stream that I choose. If I leave the information as a
    > simple UTF-8 stream, I would have to walk the stream again, I would have to
    > walk through the the first byte of all the characters from the beginning to
    > make sure that I was only counting multibyte characters once until I found
    > the character that I actually wanted. Converting to a fixed byte
    > representation (UTF-32/UCS-4) or separating all of the bytes for each
    > UTF-8 into 6 byte containers both make it possible to simply index the
    > letters by a constant size. You will note that Python does the former.


    The idea of using a custom fixed-width padded version of a UTF-8 steams
    waw initially shocking to me, but I can imagine that there are
    specialized applications, which slice-and-dice uninterpreted segments,
    for which that is appropriate. However, it is not germane to the folly
    of prefixing standard UTF-8 steams with a 3-byte magic number,
    mislabelled a 'byte-order-mark, thus making them non-standard.

    --
    Terry Jan Reedy
     
    Terry Reedy, Jan 19, 2011
    #19
  20. carlo

    jmfauth Guest

    On Jan 19, 11:33 pm, Terry Reedy <> wrote:
    > On 1/19/2011 1:02 PM, Tim Harig wrote:
    >
    > > Right, but I only have to do that once.  After that, I can directly address
    > > any piece of the stream that I choose.  If I leave the information as a
    > > simple UTF-8 stream, I would have to walk the stream again, I would have to
    > > walk through the the first byte of all the characters from the beginning to
    > > make sure that I was only counting multibyte characters once until I found
    > > the character that I actually wanted.  Converting to a fixed byte
    > > representation (UTF-32/UCS-4) or separating all of the bytes for each
    > > UTF-8 into 6 byte containers both make it possible to simply index the
    > > letters by a constant size.  You will note that Python does the former.

    >
    > The idea of using a custom fixed-width padded version of a UTF-8 steams
    > waw initially shocking to me, but I can imagine that there are
    > specialized applications, which slice-and-dice uninterpreted segments,
    > for which that is appropriate. However, it is not germane to the folly
    > of prefixing standard UTF-8 steams with a 3-byte magic number,
    > mislabelled a 'byte-order-mark, thus making them non-standard.
    >



    Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe
    *Unicode Signature*.
     
    jmfauth, Jan 20, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luis P. Mendes

    Dive into Python java equivalent

    Luis P. Mendes, May 13, 2005, in forum: Python
    Replies:
    0
    Views:
    337
    Luis P. Mendes
    May 13, 2005
  2. Franz Mueller
    Replies:
    6
    Views:
    549
    Anders Eriksson
    Nov 29, 2005
  3. Franz Mueller

    Dive into Python PDF

    Franz Mueller, Nov 30, 2005, in forum: Python
    Replies:
    2
    Views:
    536
    Paul Rubin
    Nov 30, 2005
  4. Replies:
    3
    Views:
    339
    cga2000
    Jul 26, 2006
  5. Fred C. Dobbs

    Dive into Python question

    Fred C. Dobbs, Aug 26, 2006, in forum: Python
    Replies:
    2
    Views:
    275
    Fred C. Dobbs
    Aug 26, 2006
Loading...

Share This Page