UTF-8 vs w_char

James Kuyper · Nov 4, 2013

Hmmm, perhaps you're right... the first page gives a link to
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
which discusses "Java modified UTF-8" which I thought might be
superficially related (I didn't read the whole article).

I didn't read that article, either, but I just skimmed it for references
to UTF-16. It has many, but none that seemed to suggest use of a
modified version of UTF-16.

Ben Bacarisse · Nov 4, 2013

Stephen Sprunk said:
I thought Java only modified UTF-8 (to encode embedded NULs as the
overlong sequence 0xC0 0x80),

I think Java's "modified UTF-8" also encodes UTF-16 surrogate pairs by
encoding each of the two 16-bit values using UTF-8 rules.

and even then only in certain cases, such
as serializing objects.

That would be good. It should never appear in the wild because other
systems will report the overlong null encoding as an error and will get
the miss-coded surrogate pairs completely wrong.

<snip>

Xavier Roche · Nov 4, 2013

I find this intriguing. Why do they do modify UTF-16? Can you at least
give a pointer so I can google the whole story? My sanity is arguably
already compromised, no worries.

http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html#Modified_UTF-8

"The incompatibility between modified UTF-8 and standard UTF-8 stems
from two differences. First, modified UTF-8 represents the character
U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses
the single byte value 0x0. Second, modified UTF-8 represents
supplementary characters by separately encoding the two surrogate code
units of their UTF-16 representation. Each of the surrogate code units
is represented by three bytes, for a total of six bytes. Standard UTF-8,
on the other hand, uses a single four byte sequence for the complete
character. "

Stephen Sprunk · Nov 4, 2013

But there are some issues. Presumably the resource compiler will
have to spit strings out as non-human readable unsigned char[]
arrays.

Click to expand...

They're human-readable if the software you're displaying them with
understands UTF-8, and most of it does these days--particularly if
you put the (invalid) "UTF-8 BOM" at the start of your files.

Click to expand...

BabyX comes with a resource compiler which generates images, fonts,
and strings. It dumps them as C source files.

TTF fonts, which it understands, come with a unicode value for every
glyph, So it stores those values and, internally Baby C looks them up
before doing its own rasterisation. String are currently spat out as
normal C strings. So if you enter Fred as a string value, it will
produce a variable char *fred_str - "Fred"; But currently the string
code has no support for non-ascii. Is it possible to spit out a UTF-8
string and have most editors display it in a (polyglot)
human-readable form?

We can quibble over "most", but many editors will do so, especially if
you put the (invalid) "UTF-8 BOM" at the start of your files. Try it
with your favorite editor and find out. Your odds of success are best
if you're using a UTF-8-flavored locale, e.g. "en-US.UTF-8" on Linux.

S

Stephen Sprunk · Nov 4, 2013

A great many programs for Windows
continue to use 8-bit characters with different code pages (if they
consider the idea of non-ASCII characters at all). Those that
support wider characters tend to get mixed up between UTF-16, UCS-2
and wchar_t, such as by assuming that each wchar_t corresponds to a
single character. That works fine in testing - until somebody tries
to use the program with Egyptian hieroglyphics.

Actually, it breaks as soon as someone tries to use the program with
combining (rather than precomposed) characters, which are within the
BMP. No encoding is safe from that problem.

Linux has 32-bit wchar_t, which can obviously support UTF-32 and
therefore all unicode. But it is not much used - UTF-8 is the
standard in the *nix world.

.... because UTF-8 is mostly transparent to code which isn't Unicode
aware, which avoids having to duplicate every API call that currently
uses (char*) as Windows had to do. The main problem is string length
calculations, but that's a rather hairy problem regardless of what
encoding is used.

S

Keith Thompson · Nov 4, 2013

Stephen Sprunk said:
... because UTF-8 is mostly transparent to code which isn't Unicode
aware, which avoids having to duplicate every API call that currently
uses (char*) as Windows had to do. The main problem is string length
calculations, but that's a rather hairy problem regardless of what
encoding is used.

That makes sense. Windows went with 16-bit characters for file
names and similar things, making 16-bit (UCS-2 or later UTF-16)
pervasive. Unix and Linux stuck with 8-bit characters for file
names. That might have made the use of non-ASCII characters not
exceeding 65535 in file names easier for Windows than for Linux for
a while, but it made for a smoother transition from ASCII to UTF-8.

(Unless I've got the history wrong, which is entirely possible.)

Ben Bacarisse · Nov 4, 2013

Stephen Sprunk said:
On 04-Nov-13 04:21, Malcolm McLean wrote:

[...] Is it possible to spit out a UTF-8
string and have most editors display it in a (polyglot)
human-readable form?

Click to expand...

We can quibble over "most", but many editors will do so, especially if
you put the (invalid) "UTF-8 BOM" at the start of your files.

I'd say that any *nix tool that only works when it sees an initial ZERO
WIDTH NO-BREAK SPACE (that's what the BOM is) is broken. Maybe I'm
taking "especially" the wrong way, but I'd try without and ditch the
tool if it did no respect my locale setting.

By the way, why is it invalid?

Try it
with your favorite editor and find out. Your odds of success are best
if you're using a UTF-8-flavored locale, e.g. "en-US.UTF-8" on Linux.

Yes, <punch>that's the way to do it</punch>. Modern Linux distros have
very good UTF-8 support. Because I use it all the time (multilingual
family), I am constantly disappointed that I can't do so here. (The RFC
for Usenet has embraced UTF-8, but it still seems to make people angry.)

Malcolm McLean · Nov 4, 2013

We can quibble over "most", but many editors will do so, especially if
you put the (invalid) "UTF-8 BOM" at the start of your files. Try it
with your favorite editor and find out. Your odds of success are best
if you're using a UTF-8-flavored locale, e.g. "en-US.UTF-8" on Linux.

The baby X resource compiler is designed to convert external data into
embedded binary data within the program. So it generates C source files,
which are mainly non-human meaningful binary dumps, but contain C symbols
which will normally be given human-meaningful names, and can be linked
with the rest of the program.
However currently strings are human readable. The resource compiler just
outputs a C string literal, as you'd expect. Now if we go to unicode,
obviously it would be nice to keep the strings still readable. But will
a C compiler accept a UTF-8 file with a BOM marker?

Ben Bacarisse · Nov 4, 2013

Malcolm McLean said:
The baby X resource compiler is designed to convert external data into
embedded binary data within the program. So it generates C source files,
which are mainly non-human meaningful binary dumps, but contain C symbols
which will normally be given human-meaningful names, and can be linked
with the rest of the program.
However currently strings are human readable. The resource compiler just
outputs a C string literal, as you'd expect. Now if we go to unicode,
obviously it would be nice to keep the strings still readable. But will
a C compiler accept a UTF-8 file with a BOM marker?

Please forget about the BOM, at least as far as *nix platforms are
concerned (unless, of course, you need a zero width no-break space).
It's going to bite you one day and it serves no useful purpose. It's
useful only with 16-bit encodings, where there is some doubt about the
byte order.

Keith Thompson · Nov 4, 2013

Ben Bacarisse said:
Please forget about the BOM, at least as far as *nix platforms are
concerned (unless, of course, you need a zero width no-break space).
It's going to bite you one day and it serves no useful purpose. It's
useful only with 16-bit encodings, where there is some doubt about the
byte order.

Agreed.

However, a UTF-8 BOM could be used as a way to distinguish between UTF-8
and ASCII, for files that contain no other non-ASCII characters. (But
it's not a *good* way to make that disinction, since UTF-8 files with no
BOM are still valid UTF-8.)

A couple of data points: gcc 4.7.2 and clang 3.0, both on Linux Mint,
accept UTF-8 source files with an initial BOM.

Xavier Roche · Nov 4, 2013

Le 04/11/2013 21:17, Keith Thompson a écrit :

However, a UTF-8 BOM could be used as a way to distinguish between UTF-8
and ASCII, for files that contain no other non-ASCII characters.

[ I have always considered UTF8 BOM as a really ugly thing. ]

What if you `cat *.h > foo.h' ? Will the ZbNbSp be gently ignored by
compilers if found in the middle of a file ?

Keith Thompson · Nov 4, 2013

Xavier Roche said:
Le 04/11/2013 21:17, Keith Thompson a Ã©crit :

However, a UTF-8 BOM could be used as a way to distinguish between UTF-8
and ASCII, for files that contain no other non-ASCII characters.

Click to expand...

[ I have always considered UTF8 BOM as a really ugly thing. ]

What if you `cat *.h > foo.h' ? Will the ZbNbSp be gently ignored by
compilers if found in the middle of a file ?

A quick experiment with gcc and clang indicates that the anwer is yes --
but I wouldn't want to count on it.

Stephen Sprunk · Nov 4, 2013

Stephen Sprunk said:
Stephen Sprunk said:

[...] Is it possible to spit out a UTF-8 string and have most
editors display it in a (polyglot) human-readable form?

Click to expand...

We can quibble over "most", but many editors will do so, especially
if you put the (invalid) "UTF-8 BOM" at the start of your files.

Click to expand...

I'd say that any *nix tool that only works when it sees an initial
ZERO WIDTH NO-BREAK SPACE (that's what the BOM is) is broken. Maybe
I'm taking "especially" the wrong way, but I'd try without and ditch
the tool if it did no respect my locale setting.

If you have a UTF-8-flavored locale, it should work without the BOM; if
you don't, the BOM will often cause software to ignore it and switch to
UTF-8. It's an ingenious solution to the proliferation of character
encodings on the Internet.

Some software (e.g. web browsers) will use heuristics to attempt to
guess the encoding used, and UTF-8 is fairly easy to recognize, so UTF-8
sometimes works even without the "BOM" _or_ a UTF-8 locale.

By the way, why is it invalid?

UTF-16 needs a BOM due to endianness ambiguity. UTF-8, however, is a
byte-oriented encoding; there is no byte order to be marked, so to call
it a byte-order mark is invalid.

The "UTF-8 BOM" is a legitimate ZWNBSP, of course, but one that the user
did not intentionally put in the file and, in most cases, is unable to
get rid of or even see, which causes problems with software that doesn't
understand UTF-8 or requires certain bytes at the start of a file, e.g.
Unix's #! syntax.

Yes, <punch>that's the way to do it</punch>. Modern Linux distros
have very good UTF-8 support. Because I use it all the time
(multilingual family), I am constantly disappointed that I can't do
so here. (The RFC for Usenet has embraced UTF-8, but it still seems
to make people angry.)

I set my mail client and newsreader to UTF-8 long ago, and if anyone
complains, I just point them to the relevant RFCs; any anger they feel
is their problem, not mine. It usually seems to stem from them using
obsolete software, and that's not my problem either.

S

Stephen Sprunk · Nov 4, 2013

Agreed.

However, a UTF-8 BOM could be used as a way to distinguish between
UTF-8 and ASCII, for files that contain no other non-ASCII
characters. (But it's not a *good* way to make that disinction,
since UTF-8 files with no BOM are still valid UTF-8.)

It's clearly not perfect. However, if you're reading some text and have
no information about what encoding it's in, a UTF-16 BOM or "UTF-8 BOM"
is a clear sign that's how it should be interpreted. Without that, you
have to either use complicated (and unreliable) heuristics or just punt
and use some default encoding, which Murphy's Law tells us will usually
be the wrong one.

As ugly as it may be, proliferation of the "UTF-8 BOM" has solved far
more problems than it has caused.

S

Stephen Sprunk · Nov 4, 2013

I think Java's "modified UTF-8" also encodes UTF-16 surrogate pairs
by encoding each of the two 16-bit values using UTF-8 rules.

Ah, I forgot about that part. CESU-8 does that too, and many alleged
"UTF-8" implementations are actually CESU-8. I suspect it's usually
related to a UTF-16 implementation that doesn't handle surrogates
properly, which seems to be the rule rather than the exception.

That would be good. It should never appear in the wild because
other systems will report the overlong null encoding as an error and
will get the miss-coded surrogate pairs completely wrong.

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
"In normal usage, the Java programming language supports standard UTF-8
when reading and writing strings through InputStreamReader and
OutputStreamWriter. However it uses Modified UTF-8 for object
serialization, for the Java Native Interface, and for embedding constant
strings in class files."

S

Ben Bacarisse · Nov 4, 2013

Stephen Sprunk said:
Stephen Sprunk said:

On 04-Nov-13 04:21, Malcolm McLean wrote:
[...] Is it possible to spit out a UTF-8 string and have most
editors display it in a (polyglot) human-readable form?

We can quibble over "most", but many editors will do so, especially
if you put the (invalid) "UTF-8 BOM" at the start of your files.

Click to expand...

I'd say that any *nix tool that only works when it sees an initial
ZERO WIDTH NO-BREAK SPACE (that's what the BOM is) is broken. Maybe
I'm taking "especially" the wrong way, but I'd try without and ditch
the tool if it did no respect my locale setting.

Click to expand...

If you have a UTF-8-flavored locale, it should work without the BOM; if
you don't, the BOM will often cause software to ignore it and switch to
UTF-8.

Personally, I don't want that behaviour. If I don't have a
UTF-8-flavoured locale set, I want the software to respect that fact.

It's an ingenious solution to the proliferation of character
encodings on the Internet.

I've not come across it often, but every time it's been a pain, one way
or another. Maybe your mileage has been different, but I would rather
not see it inserted by any software.

The "UTF-8 BOM" is a legitimate ZWNBSP, of course, but one that the user
did not intentionally put in the file and, in most cases, is unable to
get rid of or even see, which causes problems with software that doesn't
understand UTF-8 or requires certain bytes at the start of a file, e.g.
Unix's #! syntax.

It causes problems even with software that does understand UTF-8.
Modern grep understands UTF-8, but grep ^# will miss a #include if a C
file starts with a ZWNBSP. Maybe the file is a list of files to tar, or
to pass to xargs or to... you name it, most likely the ZWNBSP will make
it go wrong.

<snip>

Stephen Sprunk · Nov 5, 2013

Personally, I don't want that behaviour. If I don't have a
UTF-8-flavoured locale set, I want the software to respect that
fact.

Unfortunately, a certain popular OS does not _allow_ the user to select
a UTF-8-flavored locale. IIRC, that OS also happens to be where the
proliferation of "UTF-8 BOM"s started, though due to an unrelated issue.

(I'm talking about Windows, of course. While it's now possible to
programmatically select CP_UTF8 for character-encoding conversions,
there is still no way to set it as the user's default.)

I've not come across it often, but every time it's been a pain, one
way or another. Maybe your mileage has been different, but I would
rather not see it inserted by any software.

It's not perfect, but in my experience it solves more problems than it
causes, particularly on a certain OS. YMMV.

It's certainly less common in the Unix world, but that's primarily
because most Unix software assumes UTF-8 by default anyway. That's
certainly a lot simpler than the mess that Microsoft has created.

It causes problems even with software that does understand UTF-8.
Modern grep understands UTF-8, but grep ^# will miss a #include if a
C file starts with a ZWNBSP. Maybe the file is a list of files to
tar, or to pass to xargs or to... you name it, most likely the ZWNBSP
will make it go wrong.

AFAIK, those programs _don't_ really understand UTF-8; they just deal
with strings of arbitrary bytes, and UTF-8 "just works" with little or
no code change required, which is again why it's so popular.

Yes, a "UTF-8 BOM" does sometimes cause problems, but a UTF-16 (or
UTF-32) BOM probably will as well in the same cases, if they're
supported at all. They're often not, since they require substantial
code changes to convert all your char strings to wchar strings, use
wcs*() rather than str*(), etc.

S

Malcolm McLean · Nov 5, 2013

Yes, UTF-8 is almost always the best choice. It is usually the most
space efficient, it avoids little-endian/big-endian issues (which
Windows screwed up by making little-endian UTF-16/UCS-2 the default if
there is no BOM, even though Unicode standard says big-endian is the
default), it means that non-international ASCII just works as expected,
and everything except character length functions can just treat strings
as old-fashioned ASCII strings.

Also string breaking and cursor positioning functions. Some things work with
UTF-8 transparently, some don't, and some will work reasonably well for
left to right languages but break with things like Hebrew where the word
has to be written right to left and the base letters need to be decorated
with vowels (which are optional, it's the same word with or without the
vowels, which also adds a point of difficulty).

From reading around, it seems that UTF-8 is the best option. But nothing is
problem-free.

Stephen Sprunk · Nov 5, 2013

Also string breaking and cursor positioning functions.

That gets into the difference between code points and grapheme clusters,
which is mostly orthogonal to the encoding. UTF-8 and UTF-16 both make
it rather easy to avoid splitting a single code point, unlike many other
encodings, but that's not enough.

Some things work with UTF-8 transparently, some don't, and some will
work reasonably well for left to right languages but break with
things like Hebrew where the word has to be written right to left
and the base letters need to be decorated with vowels (which are
optional, it's the same word with or without the vowels, which also
adds a point of difficulty).

Combining characters, directionality, equivalence, sorting and many
other features mean full Unicode support is a truly monstrous task.
However, many programs can either ignore it (treating text as opaque
binary blobs) or farm the work out to common libraries, depending on
what they actually _do_ with the text they're processing.

From reading around, it seems that UTF-8 is the best option. But
nothing is problem-free.

Agreed. Unicode is quite complex, especially in the code to actually
display (or worse, edit) it. No encoding is immune from its inherent
complexity, and in some ways that's actually the easiest part to deal
with, but some encodings (e.g. UTF-16) make it worse than necessary.

S

Ben Bacarisse · Nov 5, 2013

Stephen Sprunk said:
Unfortunately, a certain popular OS does not _allow_ the user to select
a UTF-8-flavored locale. IIRC, that OS also happens to be where the
proliferation of "UTF-8 BOM"s started, though due to an unrelated issue.

(I'm talking about Windows, of course. While it's now possible to
programmatically select CP_UTF8 for character-encoding conversions,
there is still no way to set it as the user's default.)

It's not perfect, but in my experience it solves more problems than it
causes, particularly on a certain OS. YMMV.

My remarks were about *nix. I know very little about Windows handling
of UTF-8 so I'm happy to take your word that adding a ZWNBSP to the
start of Windows UTF-8 files helps more than it hurts,

It's certainly less common in the Unix world, but that's primarily
because most Unix software assumes UTF-8 by default anyway.

I did not know that (but then there's a lot of different Unixes out
there). My impression was that most Unix software assumes the C locale
unless you tell it otherwise.

AFAIK, those programs _don't_ really understand UTF-8; they just deal
with strings of arbitrary bytes, and UTF-8 "just works" with little or
no code change required, which is again why it's so popular.

I don't get this at all. I can grep for "Ã©?" and it will match an
optional e-acute character despite it being multi-byte. But if I prefix
the call with LANG=C is does not (as I'd expect and want).

However, that's not really the point. I should have said that the
initial ZWNBSP *also* causes problems with programs that understand
UTF-8. In other words, it's not that the program "doesn't understand
UTF-8" that causes problems (as you seemed to suggest) it's that the
blasted things is there are all. All the examples I gave go wrong
regardless of how well the programs understand UTF-8.

<snip>

Unicode (UTF-8) in C	13	Mar 16, 2014
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
UTF-8 and strings	44	Jun 7, 2011
CGI and UTF-8	14	Sep 28, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 7, 2013
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 7, 2009

UTF-8 vs w_char

James Kuyper

Ben Bacarisse

Xavier Roche

Stephen Sprunk

Stephen Sprunk

Keith Thompson

Ben Bacarisse

Malcolm McLean

Ben Bacarisse

Keith Thompson

Xavier Roche

Keith Thompson

Stephen Sprunk

Stephen Sprunk

Stephen Sprunk

Ben Bacarisse

Stephen Sprunk

Malcolm McLean

Stephen Sprunk

Ben Bacarisse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads