What is with you French ? Nuking the pacific is not enough ?
Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. From what I've seen of other languages, this seems to
be the usual case. Long before Unicode, different regions
developed different encodings to handle non-US ASCII characters,
because a definite need for it was felt.
I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.
Not all my posts. I frequently post to fr.comp.lang.c++ and
de.comp.lang.iso-c++ as well, and my posts there contain
characters which are not ASCII.
Formally, of course, the issue is far from simple. If you're
dealing with text data over the network, you have to be ready to
handle different code sets. In practice, most protocols will
insist on either one of the Unicode encodings or an encoding
which shares the first 129 characters with ASCII for the start
of the headers, until you've transmitted the information as to
which encoding you are actually using. And if you know that it
is text, and that it starts with a header, picking between
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE and a byte encoding is
trivial, and that allows you to get through until you've read
the real encoding.
And of course, most of the newer protocols just say: it has to
be UTF-8.
Still, even on Windows, most text files are created as 8 bit. The
only tool I use regularly that produces utf-16 files in regedit
although it will read utf-8 files correctly.
I suspect very few applications will read utf-16 in a conforming way.
I don't if ISO-10646 has been updated, but a while back, utf-16 was a
stateful encoding (it still is for all intents and purposes). Any
time you read a reversed BOM you need to swap endianness. I have met
very few programmers that know what a surrogate pair is.
I have met very few programmers who even know that there exist
character sets which aren't encoded using single, 8 bit
characters. I'm not saying that ignorance isn't wide spread,
but I will try to fight it, whenever I can.
Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool
It's making headway. But a lot of code and text is old code and
text. And it's not going to go away anytime soon.
Yes. That's right. You need to have a lib that is robust enough to
tell you.
Or write one yourself
.