Is this a bug? BOM decoded with UTF8

pekka niiranen · Feb 10, 2005

Hi there,

I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
EFBBBF6161

Contents of "my.utf16" in HEX:
FEFF6161

For some reason Python2.4 decodes the BOM for UTF8
but not for UTF16. See below:

>>> fh = codecs.open("my.uft8", "rb", "utf8")
>>> fh.readlines() [u'\ufeffaa'] # BOM is decoded, why
>>> fh.close()
>>> fh = codecs.open("my.utf16", "rb", "utf16")
>>> fh.readlines() [u'\u6161'] # No BOM here
>>> fh.close()

Click to expand...

Click to expand...

Is there a trick to read UTF8 encoded file with BOM not decoded?

-pekka-

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 10, 2005

pekka said:
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.

Contents of "my.utf8" in HEX:
EFBBBF6161

Contents of "my.utf16" in HEX:
FEFF6161

This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.

Is there a trick to read UTF8 encoded file with BOM not decoded?

It's very easy: just drop the first character if it is the BOM.

The UTF-8 codec will never do this on its own.

Regards,
Martin

pekka niiranen · Feb 11, 2005

pekka said:
This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.

Correct, I used hexeditor to create those files.

It's very easy: just drop the first character if it is the BOM.

I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

The UTF-8 codec will never do this on its own.

Never? Hmm, so that is not going to change in future versions?

Diez B. Roggisch · Feb 11, 2005

I know its easy (string.replace()) but why does UTF-16 do

it on its own then? Is that according to Unicode standard or just
Python convention?

BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.

Kent Johnson · Feb 11, 2005

Diez said:
BOM is microsoft-proprietary crap.

Uh, no. BOM is part of the Unicode standard. The intent is to allow consumers of Unicode text files
to disambiguate UTF-8, big-endian UTF-16 and little-endian UTF-16.
See http://www.unicode.org/faq/utf_bom.html#BOM

Kent

Brian Quinlan · Feb 11, 2005

Diez said:
BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.

What are you talking about? The BOM and UTF-16 go hand-and-hand.
Without a Byte Order Mark, you can't unambiguosly determine whether big
or little endian UTF-16 was used. If, for example, you came across a
UTF-16 text file containing this hexidecimal data: 2200

what would you assume? That is is quote character in little-endian
format or that it is a for-all symbol in big-endian format?

For more details, see:
http://www.unicode.org/faq/utf_bom.html#BOM

Cheers,
Brian

Diez B. Roggisch · Feb 11, 2005

What are you talking about? The BOM and UTF-16 go hand-and-hand.

Without a Byte Order Mark, you can't unambiguosly determine whether big
or little endian UTF-16 was used. If, for example, you came across a
UTF-16 text file containing this hexidecimal data: 2200>
what would you assume? That is is quote character in little-endian
format or that it is a for-all symbol in big-endian format?

I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:

"""
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An
initial BOM is only used as a signature ? an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded
data do not expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol or file
format that expects specific ASCII characters at the beginning, such as the
use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD]
"""

So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap.

Nick Coghlan · Feb 11, 2005

Diez said:
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap.

Oh, good. I'm not the only person who went "A BOM in UTF-8 data? WTF do you need
a byte order marker for when you have 8-bit data?"

It also clarifies Martin's comment about the UTF-8 codec ignoring the existence
of this piece of silliness

Cheers,
Nick.

Brian Quinlan · Feb 11, 2005

Diez said:
I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:
[snipped]
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

They say that it makes no sense as an byte-order indicator but they
indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any
8-bit encoding. Of course the encoder must be know:
.... .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

I can assume you that most Germans can differentiate between "Tür" and
"Tã¼r".

Using a BOM with UTF-8 makes it easy to indentify it as such AND it
shouldn't break any probably written Unicode-aware tools.

Cheers,
Brian

Diez B. Roggisch · Feb 11, 2005

They say that it makes no sense as an byte-order indicator but they

indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any
8-bit encoding. Of course the encoder must be know:

That every utf-8 string can be decoded in any byte-sized encoding. Does it
make sense? No. But does it fail (as decoding utf-8 frequently does)? No.

So if you are in a situation where you _don't_ know the encoding, a decoding
can only be based on a heuristic. And a utf-8 BOM can be part of that
heuristic - but it still is only a hint. Besides that, lots of tools don't
produce it. E.g. everything that produces/consumes xml doesn't need it.

... .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'

If the encoder is to be known, using the BOM becomes obsolete.

I can assume you that most Germans can differentiate between "Tür" and
"Tã¼r".

Oh, germans can. Computers oth can't. You could try and use common words
like "für" and so on for a heuristic. But that is no guarantee.

Using a BOM with UTF-8 makes it easy to indentify it as such AND it
shouldn't break any probably written Unicode-aware tools.

As the faq states, that can very well happen.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 11, 2005

What are you talking about? The BOM and UTF-16 go hand-and-hand. Without

a Byte Order Mark, you can't unambiguosly determine whether big or
little endian UTF-16 was used.

In the old days, UCS-2 was *implicitly* big-endian. It was only
when Microsoft got that wrong that little-endian version of UCS-2
came along. So while the BOM is now part of all relevant specifications,
it is still "Microsoft crap".

For more details, see:
http://www.unicode.org/faq/utf_bom.html#BOM

"some higher level protocols", "can be useful" - not
"is inherent part of all byte-level encodings".

Regards,
Martin

Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 9, 2009
python3 Unicode is slow	1	Oct 25, 2009
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	3	Jun 25, 2007
XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	0	Jun 25, 2007
UTF8 strings and filesystem access	6	Oct 10, 2007
MySql+UTF8 woes	0	Jul 26, 2007

Is this a bug? BOM decoded with UTF8

pekka niiranen

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

pekka niiranen

Diez B. Roggisch

Kent Johnson

Brian Quinlan

Diez B. Roggisch

Nick Coghlan

Brian Quinlan

Diez B. Roggisch

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads