: > (Oliver's erroneous statement

: > storing the bytecount is two bytes more because the byte 0xff 0xef get
: > prepended automatically,
: The BOM is the relevant encoding of the Unicode character U+FEFF. No
: way is it 0xff 0xef.
Oops, I goofed up here, and the twisted order shows exactly what a byte
order mark is good for. Just imagine this would have been transmitted as
UCS-2, in Big Endian order.
: The various encoded byte patterns are shown in
: that Unicode FAQ, and in utf-8 it's *three* bytes.
Again, my fault. Shouldn't post when I'm too tired.
: > This makes sense with UCS-2 Unicode (the "original" Unicode
: > encoding)
: Yes, but "UCS-2" is out of date:
:
http://www.unicode.org/faq/basic_q.html#23
But several (notably MS-based) applications still allow the user to choose
UCS-2, UTF-8 _and_ Unicode.
: > but not with UTF-8 (8-bit transformation format of Unicode) because
: > the characters encoded in UTF-8 are self-synchronizing and no
: > information about byte order is needed.
: Nevertheless, the Unicode FAQ points out that utf-8 can usefully
: start with a BOM as an encoding signature.
The FAQ says so, but...
: > In contrast, other programs behaving correctly frequently complain
: > if the BOM appears where it simply doesn't belong.
: Except that it is not inherently incorrect for it to appear at the
: beginning of a utf-8 stream - but see the cited FAQ for details.
But my experience (with shell scripts, interpretation of shebang lines
of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
BOM causes unnecessary hiccups, even if this is against the formal spec.
: Seems to me you would have done well to read that FAQ yourself, before
: putting misleading opinions on the record.
Sorry, I should have consulted the FAQ, but I stand by my negative experiences
with superfluous BOMs.
Oliver.