Is this a bug? BOM decoded with UTF8

P

pekka niiranen

Hi there,

I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
EFBBBF6161

Contents of "my.utf16" in HEX:
FEFF6161


For some reason Python2.4 decodes the BOM for UTF8
but not for UTF16. See below:
>>> fh = codecs.open("my.uft8", "rb", "utf8")
>>> fh.readlines() [u'\ufeffaa'] # BOM is decoded, why
>>> fh.close()
>>> fh = codecs.open("my.utf16", "rb", "utf16")
>>> fh.readlines() [u'\u6161'] # No BOM here
>>> fh.close()

Is there a trick to read UTF8 encoded file with BOM not decoded?

-pekka-
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

pekka said:
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
EFBBBF6161

Contents of "my.utf16" in HEX:
FEFF6161

This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.
Is there a trick to read UTF8 encoded file with BOM not decoded?

It's very easy: just drop the first character if it is the BOM.

The UTF-8 codec will never do this on its own.

Regards,
Martin
 
P

pekka niiranen

pekka said:
This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.
Correct, I used hexeditor to create those files.
It's very easy: just drop the first character if it is the BOM.

I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?
The UTF-8 codec will never do this on its own.


Never? Hmm, so that is not going to change in future versions?
 
D

Diez B. Roggisch

I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.
 
B

Brian Quinlan

Diez said:
BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.

What are you talking about? The BOM and UTF-16 go hand-and-hand.
Without a Byte Order Mark, you can't unambiguosly determine whether big
or little endian UTF-16 was used. If, for example, you came across a
UTF-16 text file containing this hexidecimal data: 2200

what would you assume? That is is quote character in little-endian
format or that it is a for-all symbol in big-endian format?

For more details, see:
http://www.unicode.org/faq/utf_bom.html#BOM

Cheers,
Brian
 
D

Diez B. Roggisch

What are you talking about? The BOM and UTF-16 go hand-and-hand.
Without a Byte Order Mark, you can't unambiguosly determine whether big
or little endian UTF-16 was used. If, for example, you came across a
UTF-16 text file containing this hexidecimal data: 2200>
what would you assume? That is is quote character in little-endian
format or that it is a for-all symbol in big-endian format?

I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:

"""
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?


A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An
initial BOM is only used as a signature ? an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded
data do not expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol or file
format that expects specific ASCII characters at the beginning, such as the
use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD]
"""

So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap. :)
 
N

Nick Coghlan

Diez said:
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap. :)

Oh, good. I'm not the only person who went "A BOM in UTF-8 data? WTF do you need
a byte order marker for when you have 8-bit data?"

It also clarifies Martin's comment about the UTF-8 codec ignoring the existence
of this piece of silliness :)

Cheers,
Nick.
 
B

Brian Quinlan

Diez said:
I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:
[snipped]
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

They say that it makes no sense as an byte-order indicator but they
indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any
8-bit encoding. Of course the encoder must be know:
.... .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

I can assume you that most Germans can differentiate between "Tür" and
"Tã¼r".

Using a BOM with UTF-8 makes it easy to indentify it as such AND it
shouldn't break any probably written Unicode-aware tools.

Cheers,
Brian
 
D

Diez B. Roggisch

They say that it makes no sense as an byte-order indicator but they
indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any
8-bit encoding. Of course the encoder must be know:

That every utf-8 string can be decoded in any byte-sized encoding. Does it
make sense? No. But does it fail (as decoding utf-8 frequently does)? No.

So if you are in a situation where you _don't_ know the encoding, a decoding
can only be based on a heuristic. And a utf-8 BOM can be part of that
heuristic - but it still is only a hint. Besides that, lots of tools don't
produce it. E.g. everything that produces/consumes xml doesn't need it.
... .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

If the encoder is to be known, using the BOM becomes obsolete.
I can assume you that most Germans can differentiate between "Tür" and
"Tã¼r".

Oh, germans can. Computers oth can't. You could try and use common words
like "für" and so on for a heuristic. But that is no guarantee.
Using a BOM with UTF-8 makes it easy to indentify it as such AND it
shouldn't break any probably written Unicode-aware tools.

As the faq states, that can very well happen.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

What are you talking about? The BOM and UTF-16 go hand-and-hand. Without
a Byte Order Mark, you can't unambiguosly determine whether big or
little endian UTF-16 was used.

In the old days, UCS-2 was *implicitly* big-endian. It was only
when Microsoft got that wrong that little-endian version of UCS-2
came along. So while the BOM is now part of all relevant specifications,
it is still "Microsoft crap".

"some higher level protocols", "can be useful" - not
"is inherent part of all byte-level encodings".

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top