raddog58c said:
The data is stored in UNICODE whether you require it or not.
Well, Unicode is not a storage encoding system, or anything like that.
Unicode is primarily a mapping from characters (in the linguistic conceptual
sense, not in the C/C++ data type sense) to numbers. And you can't directly
store numbers in computers. You can store bitstreams, and thus you need an
extra step to encode from numbers to bitstreams. There are many such
encodings: ASCII, UTF-8, UTF-16, etc. some of them being lossy (e.g. ASCII).
I'm not
writing multinational code at this juncture. In 25+ years of
programming the number of times I've needed multinational character
sets can be counted on one had with fingers to spare.
Well, I don't know what kind of software you write, so I can't comment
much on that. But consider how many people have requested that the
developers of WinAmp (a once popular mp3 player) to support unicode
characters, so that I WinAmp could probably display the names of my English,
French, Russian, Japanese and Korean songs. They refused to do so, stating
that 90% of the Internet is English (a figure I'm sure they just made up).
There are several problems with this argument.
First of all, internet usage in Asia is huge. Gold farming (which
essentially comes down to playing video games online for pay) is a 1 billion
dollar business in Korea alone
(
http://arstechnica.com/news.ars/post/20061227-8503.html), and playing video
games online is a tiny segment of the internet usage pie chart, compare to
web browsing, e-mail or file sharing, for example. According to
http://www.internetworldstats.com/stats2.htm, North America accounts for
only 20% of the internet usage, and while Internet usage is growing at a
rate of 100+% (i.e. doubling) over 7 years, Internet usage in the rest of
the world is growing at a rate of 200+% (i.e. tripling) over 7 years. This
last diagram really says it all:
http://www.internetworldstats.com/stats.htm
Second of all, just because one is an English-only speaker doesn't mean
one wouldn't benefit from the ability to display characters outside of ASCII
but within Unicode. Another poster presented the example of being able to
display mathematical symbols. I'll present an additional example of my mp3s
again.
One of the ID3 tags for my mp3s contains what I believe to be russian
characters. I'm not sure, because I don't actually speak Russian. The artist
name can be viewed at
http://en.wikipedia.org/wiki/TËЯRA and it's
very easy for an English speaker to recognize: It's a T, an E with two dots
on top, a backwards R, a forwards R, and an A. And the prounciation "Terra"
comes intuitively. But try to load an ID3 tag with this text via an
ASCII-only mp3 player, and you'll only see gibberish.
See, I don't even speak Russian, and yet I benefit from my software
being able to display Russian characters. That's why Unicode is more than
just "supporting other countries' languages". It's about being able to
represent text that you would normally find all around you in real life on
your computer.
You might find it archaic, but I find it wasteful. It's a waste
converting into and out of a format you never use.
What formats do you think one is converting to and from? There are bits
on the harddrive or RAM, and you need to somehow semantically treat these
bits as if they represented text. From what I understand, in C, you actually
manipulate these bits almost directly, and so an algorithm (e.g. testing
whether a character is numeric) designed to work with ASCII will not work
with EBCDIC and vice versa. In Java, things are a bit more high level: You
*don't* work directly with bits. Instead, you work with characters.
Theoretically, how these characters are represented in the JVM shouldn't
matter to you (in practice, due to backwards compabitility reasons, it has
"leaked out" that the internal representation is UTF-16-like). They might
internally be stored as UTF-16, UTF-8, or some crazy undocumented internal
format. It doesn't matter, because you shouldn't be manipulating the bits
that represent those characters, you should be dealing with the characters
directly. Any algorithm (e.g. testing whether a character is numeric) will
work regardless of the encoding, because the actual encoding is (supposed to
be) abstracted away.
Now if you have a String of characters in memory, and you want to store
it on disk somehow, there are many encodings to do this, just like if you
wanted to store a binary tree on disk somehow, there are many encodings to
do this. *This* is where any "converting" might occur, though the term
"converting" is misleading: "encoding" would be a better term. You can
encode the text as ASCII, UTF-8, or some other format. And if you want to
read the bitstream from disk and convert it back to text, a decoding stage
occurs.
In C, there's no similar stage, because once again, there's no
abstracting the encoding away from the text. If you want to replicate C's
behaviour in Java, rather than reading in text, read in bytes. Then, you can
manipulate the bytes in anyway you like, and if you think these bytes
represent text, you'll have to guess at the encoding (ASCII? EBCDIC? UTF-8?)
just like you would with C. And just like in C, a "isNumeric()" algorithm
written with the assumption of ASCII will fail for other encodings. And just
like in C, no encoding or decoding stage secretly occurs beneath the covers.
Why don't you convert your data into Russian characterset. Since
you're never communicating in Russian, when you need English, swap
back. What's the big deal?
Dataloss. There isn't a one-to-one correspondance between Russian
characters and English characters.
That's what I'm saying. It's conversion to a format that I'm not
personally using. Some people need it; some don't; yet we all pay for
it.
Let's say I never use the pipe character: |, and I'd be perfectly happy
if whenever someone sent me text containing the pipe character, it would
instead get converted to some lossy gibberish. Why don't we simply invent a
new encoding scheme, more efficient than ASCII, so that I wouldn't have to
pay for this character that I don't need? Well, we certainly *could* do
that, but it'd be a lot of work to support such a small proportion of all
computer users.
Similarly, people who speak only English form such a small proportion of
all computer users. It's been a lot of work trying to support these people.
Why should we all pay for it?
- Oliver