UTF-8 question from Dive into Python 3

carlo · Jan 17, 2011

Hi,
recently I had to study *seriously* Unicode and encodings for one
project in Python but I left with a couple of doubts arised after
reading the unicode chapter of Dive into Python 3 book by Mark
Pilgrim.

1- Mark says:
"Also (and you’ll have to trust me on this, because I’m not going to
show you the math), due to the exact nature of the bit twiddling,
there are no byte-ordering issues. A document encoded in UTF-8 uses
the exact same stream of bytes on any computer."
Is it true UTF-8 does not have any "big-endian/little-endian" issue
because of its encoding method? And if it is true, why Mark (and
everyone does) writes about UTF-8 with and without BOM some chapters
later? What would be the BOM purpose then?

2- If that were true, can you point me to some documentation about the
math that, as Mark says, demonstrates this?

thank you
Carlo

Alexander Kapps · Jan 17, 2011

Is it true UTF-8 does not have any "big-endian/little-endian" issue
because of its encoding method? And if it is true, why Mark (and
everyone does) writes about UTF-8 with and without BOM some chapters
later? What would be the BOM purpose then?

Can't answer your other questions, but the UTF-8 BOM is simply a
marker saying "This is a UTF-8 text file, not an ASCII text file"

If I'm not wrong, this was a Microsoft invention and surely one of
their brightest ideas. I really wish, that this had been done for
ANSI some decades ago. Determining the encoding for text files is
hard to impossible because such a mark was never introduced.

Tim Harig · Jan 17, 2011

Is it true UTF-8 does not have any "big-endian/little-endian" issue
because of its encoding method? And if it is true, why Mark (and
everyone does) writes about UTF-8 with and without BOM some chapters
later? What would be the BOM purpose then?

Yes, it is true. The BOM simply identifies that the encoding as a UTF-8.:

http://unicode.org/faq/utf_bom.html#bom5

2- If that were true, can you point me to some documentation about the
math that, as Mark says, demonstrates this?

It is true because UTF-8 is essentially an 8 bit encoding that resorts
to the next bit once it exhausts the addressible space of the current
byte it moves to the next one. Since the bytes are accessed and assessed
sequentially, they must be in big-endian order.

Antoine Pitrou · Jan 17, 2011

Is it true UTF-8 does not have any "big-endian/little-endian" issue
because of its encoding method?
Yes.

And if it is true, why Mark (and
everyone does) writes about UTF-8 with and without BOM some chapters
later? What would be the BOM purpose then?

"BOM" in this case is a misnomer. For UTF-8, it is only used as a
marker (a magic number, if you like) to signal than a given text file
is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
actually, it does not change with endianness.

(note that it is not required to put an UTF-8 "BOM" at the beginning of
text files; it is just a hint that some tools use when
generating/reading UTF-8)

2- If that were true, can you point me to some documentation about the
math that, as Mark says, demonstrates this?

Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
encoding. There is no math involved, it just works by construction.

Regards

Antoine.

carlo · Jan 17, 2011

"BOM" in this case is a misnomer. For UTF-8, it is only used as a
marker (a magic number, if you like) to signal than a given text file
is UTF-8. The UTF-8 "BOM" does not say anything about byte order; and,
actually, it does not change with endianness.

(note that it is not required to put an UTF-8 "BOM" at the beginning of
text files; it is just a hint that some tools use when
generating/reading UTF-8)

Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
encoding. There is no math involved, it just works by construction.

Regards

Antoine.

thank you all, eventually found http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404
which clears up.
No math in fact, as Tim and Antoine pointed out.

Raymond Hettinger · Jan 18, 2011

Hi,
recently I had to study *seriously* Unicode and encodings for one
project in Python but I left with a couple of doubts arised after
reading the unicode chapter of Dive into Python 3 book by Mark
Pilgrim.

1- Mark says:
"Also (and you’ll have to trust me on this, because I’m not going to
show you the math), due to the exact nature of the bit twiddling,
there are no byte-ordering issues. A document encoded in UTF-8 uses
the exact same stream of bytes on any computer." . . .
2- If that were true, can you point me to some documentation about the
math that, as Mark says, demonstrates this?

I believe Mark was referring to the bit-twiddling described in
the Design section at http://en.wikipedia.org/wiki/UTF-8 .

Raymond

Tim Harig · Jan 19, 2011

You were doing excellently up to that last phrase. Endianness only applies
when you treat a series of bytes as a larger entity. That doesn't apply to
UTF-8. None of the bytes is more "significant" than any other, so by
definition it is neither big-endian or little-endian.

It depends how you process it and it doesn't generally make much
difference in Python. Accessing UTF-8 data from C can be much trickier
if you use a multibyte type to store the data. In that case, if happen
to be on a little-endian architecture, it may be necessary to remember
that the data is not in the order that your processor expects it to be
for numeric operations and comparisons. That is why the FAQ I linked to
says yes to the fact that you can consider UTF-8 to always be in big-endian
order. Essentially all byte based data is big-endian.

Antoine Pitrou · Jan 19, 2011

That is why the FAQ I linked to
says yes to the fact that you can consider UTF-8 to always be in big-endian
order.

It certainly doesn't. Read better.

Essentially all byte based data is big-endian.

This is pure nonsense.

Tim Harig · Jan 19, 2011

Considering you post contained no information or evidence for your
negations, I shouldn't even bother responding. I will bite once.
Hopefully next time your arguments will contain some pith.

It certainly doesn't. Read better.

- Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
- yes, then can I still assume the remaining UTF-8 bytes are in big-endian
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order?
^^^^^^
-
- A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
^^^
- to the endianness of the byte stream. UTF-8 always has the same byte
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order. An initial BOM is only used as a signature -- an indication that
^^^^^^
- an otherwise unmarked text file is in UTF-8. Note that some recipients of
- UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently
- in 8-bit environments, the use of a BOM will interfere with any protocol
- or file format that expects specific ASCII characters at the beginning,
- such as the use of "#!" of at the beginning of Unix shell scripts.

The question that was not addressed was whether you can consider UTF-8
to be little endian. I pointed out why you cannot always make that
assumption in my previous post.

UTF-8 has no apparent endianess if you only store it as a byte stream.
It does however have a byte order. If you store it using multibytes
(six bytes for all UTF-8 possibilites) , which is useful if you want
to have one storage container for each letter as opposed to one for
each byte(1), the bytes will still have the same order but you have
interrupted its sole existance as a byte stream and have returned it
to the underlying multibyte oriented representation. If you attempt
any numeric or binary operations on what is now a multibyte sequence,
the processor will interpret the data using its own endian rules.

If your processor is big-endian, then you don't have any problems.
The processor will interpret the data in the order that it is stored.
If your processor is little endian, then it will effectively change the
order of the bytes for its own evaluation.

So, you can always assume a big-endian and things will work out correctly
while you cannot always make the same assumption as little endian
without potential issues. The same holds true for any byte stream data.
That is why I say that byte streams are essentially big endian. It is
all a matter of how you look at it.

I prefer to look at all data as endian even if it doesn't create
endian issues because it forces me to consider any endian issues that
might arise. If none do, I haven't really lost anything. If you simply
assume that any byte sequence cannot have endian issues you ignore the
possibility that such issues might not arise. When an issue like the
above does, you end up with a potential bug.

(1) For unicode it is probably better to convert to characters to
UTF-32/UCS-4 for internal processing; but, creating a container large
enough to hold any length of UTF-8 character will work.

Antoine Pitrou · Jan 19, 2011

- Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
- yes, then can I still assume the remaining UTF-8 bytes are in big-endian
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order?
^^^^^^
-
- A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
^^^
- to the endianness of the byte stream. UTF-8 always has the same byte
^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- order.
^^^^^^

Which certainly doesn't mean that byte order can be called "big
endian" for any recognized definition of the latter. Similarly, ASCII
test has its own order which certainly can't be characterized as either
"little endian" or "big endian".

UTF-8 has no apparent endianess if you only store it as a byte stream.
It does however have a byte order. If you store it using multibytes
(six bytes for all UTF-8 possibilites) , which is useful if you want
to have one storage container for each letter as opposed to one for
each byte(1)

That's a ridiculous proposition. Why would you waste so much space?
UTF-8 exists *precisely* so that you can save space with most scripts.
If you are ready to use 4+ bytes per character, just use UTF-32 which
has much nicer properties.

Bottom line: you are not describing UTF-8, only your own foolish
interpretation of it. UTF-8 does not have any endianness since it is a
byte stream and does not care about "machine words".

Antoine.

Adam Skutt · Jan 19, 2011

So, you can always assume a big-endian and things will work out correctly
while you cannot always make the same assumption as little endian
without potential issues. The same holds true for any byte stream data..

You need to spend some serious time programming a serial port or other
byte/bit-stream oriented interface, and then you'll realize the folly
of your statement.

That is why I say that byte streams are essentially big endian. It is
all a matter of how you look at it.

It is nothing of the sort. Some byte streams are in fact, little
endian: when the bytes are combined into larger objects, the least-
significant byte in the object comes first. A lot of industrial/
embedded stuff has byte streams with LSB leading in the sequence, CAN
comes to mind as an example.

The only way to know is for the standard describing the stream to tell
you what to do.

I prefer to look at all data as endian even if it doesn't create
endian issues because it forces me to consider any endian issues that
might arise. If none do, I haven't really lost anything.
If you simply assume that any byte sequence cannot have endian issues you ignore the
possibility that such issues might not arise.

No, you must assume nothing unless you're told how to combine the
bytes within a sequence into a larger element. Plus, not all byte
streams support such operations! Some byte streams really are just a
sequence of bytes and the bytes within the stream cannot be
meaningfully combined into larger data types. If I give you a series
of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell
me how to combine them into a 16, 32, or 64-bit integer. You cannot
do it without altering the meaning of the samples; it is a completely
non-nonsensical operation.

Adam

Tim Harig · Jan 19, 2011

It is nothing of the sort. Some byte streams are in fact, little
endian: when the bytes are combined into larger objects, the least-
significant byte in the object comes first. A lot of industrial/
embedded stuff has byte streams with LSB leading in the sequence, CAN
comes to mind as an example.

You are correct. Point well made.

Tim Harig · Jan 19, 2011

That's a ridiculous proposition. Why would you waste so much space?

Space is only one tradeoff. There are many others to consider. I have
created data structures with much higher overhead than that because
they happen to make the problem easier and significantly faster for the
operations that I am performing on the data.

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

UTF-8 exists *precisely* so that you can save space with most scripts.

UTF-8 has many reasons for existing. One of the biggest is that it
is compatible for tools that were designed to process ASCII and other
8bit encodings.

If you are ready to use 4+ bytes per character, just use UTF-32 which
has much nicer properties.

I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
not want to have to worry about converting the encodings back and forth
before and after processing them. That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.

Antoine Pitrou · Jan 19, 2011

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

You *have* to "process the entire byte stream" in order to determine
boundaries of individual letters from the bytes if you want to use a
"character based container", regardless of the exact representation.
Once you do that it shouldn't be very costly to compute the actual code
points. So, "much faster" sounds a bit dubious to me; especially if you
factor in the cost of memory allocation, and the fact that a larger
container will fit less easily in a data cache.

That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.

This whole thread is about UTF-8 (see title) so I'm not sure what kind
of relevance this is supposed to have.

Tim Harig · Jan 19, 2011

You *have* to "process the entire byte stream" in order to determine
boundaries of individual letters from the bytes if you want to use a
"character based container", regardless of the exact representation.

Right, but I only have to do that once. After that, I can directly address
any piece of the stream that I choose. If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted. Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size. You will note that Python does the former.

UTF-32/UCS-4 conversion is definitly supperior if you are actually
doing any major but it adds the complexity and overhead of requiring
the bit twiddling to make the conversions (once in, once again out).
Some programs don't really care enough about what the data actually
contains to make it worth while. They just want to be able to use the
characters as black boxes.

Once you do that it shouldn't be very costly to compute the actual code
points. So, "much faster" sounds a bit dubious to me; especially if you

You could I suppose keep a separate list of pointers to each letter so that
you could use the pointer list for indexing or keep a list of the
character sizes so that you can add them and calculate the variable width
index; but, that adds overhead as well.

Antoine Pitrou · Jan 19, 2011

Right, but I only have to do that once.

You only have to decode once as well.

If I leave the information as a
simple UTF-8 stream,

That's not what we are talking about. We are talking about the supposed
benefits of your 6-byte representation scheme versus proper decoding
into fixed width code points.

UTF-32/UCS-4 conversion is definitly supperior if you are actually
doing any major but it adds the complexity and overhead of requiring
the bit twiddling to make the conversions (once in, once again out).

"Bit twiddling" is not something processors are particularly bad at.
Actually, modern processors are much better at arithmetic and logic
than at recovering from mispredicted branches, which seems to suggest
that discovering boundaries probably eats most of the CPU cycles.

Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size. You will note that Python does the
former.

Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.

Tim Harig · Jan 19, 2011

Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.

The point is basically the same. I created an example because it
was simpler to follow for demonstration purposes then an actual UTF-8
conversion to any official multibyte format. You obviously have no
other purpose then to be contrary, so we ended up following tangents.

As soon as you start to convert to a multibyte format the endian issues
occur. For UTF-8 on big endian hardware, this is anti-climactic because
all of the bits are already stored in proper order. Little endian systems
will probably convert to a native native endian format. If you choose
to ignore that, that is your perogative. Have a nice day.

Antoine Pitrou · Jan 19, 2011

Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.

Click to expand...

The point is basically the same. I created an example because it
was simpler to follow for demonstration purposes then an actual UTF-8
conversion to any official multibyte format. You obviously have no
other purpose then to be contrary [...]

Right. You were the one who jumped in and tried to lecture everyone on
how UTF-8 was "big-endian", and now you are abandoning the one esoteric
argument you found in support of that.

As soon as you start to convert to a multibyte format the endian issues
occur.

Ok. Good luck with your "endian issues" which don't exist.

Terry Reedy · Jan 19, 2011

Right, but I only have to do that once. After that, I can directly address
any piece of the stream that I choose. If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted. Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size. You will note that Python does the former.

The idea of using a custom fixed-width padded version of a UTF-8 steams
waw initially shocking to me, but I can imagine that there are
specialized applications, which slice-and-dice uninterpreted segments,
for which that is appropriate. However, it is not germane to the folly
of prefixing standard UTF-8 steams with a 3-byte magic number,
mislabelled a 'byte-order-mark, thus making them non-standard.

jmfauth · Jan 20, 2011

The idea of using a custom fixed-width padded version of a UTF-8 steams
waw initially shocking to me, but I can imagine that there are
specialized applications, which slice-and-dice uninterpreted segments,
for which that is appropriate. However, it is not germane to the folly
of prefixing standard UTF-8 steams with a 3-byte magic number,
mislabelled a 'byte-order-mark, thus making them non-standard.

Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe
*Unicode Signature*.

Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
UTF-8 output problems	2	Mar 10, 2007
print UTF-8 file with BOM	5	Dec 23, 2005
Unicode/utf-8 data in SQL Server	4	Aug 8, 2006
Reading Text File Encoding and converting to Perls internal UTF-8 encoding	2	Apr 17, 2009
Newbie Q: Extra spaces after conversion from utf-8 to utf-16-le ?	2	Apr 11, 2004
Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior	6	Nov 15, 2007

UTF-8 question from Dive into Python 3

carlo

Alexander Kapps

Tim Harig

Antoine Pitrou

carlo

Raymond Hettinger

Tim Harig

Antoine Pitrou

Tim Harig

Antoine Pitrou

Adam Skutt

Tim Harig

Tim Harig

Antoine Pitrou

Tim Harig

Antoine Pitrou

Tim Harig

Antoine Pitrou

Terry Reedy

jmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads