C
Chris Uppal
gk said:so, this means, each encoding recognises other encoding.....and thats
why they are able to revert back.
Not quite. Your argument is sensible but what you don't (yet ;-) know is that
all or nearly all character encodings overlap for a certain range of
characters. Specifically, the printable ASCII characters have the same
numerical values in CP1252, ISO8859-1, and nearly all other character encodings
(including ASCII). What's more the Unicode assigned code-points (numbers to
you and me) for those characters are the same too.
So the String ABC contains the chars with numerical values 0x61 0x62 0x63. If
we translate that to bytes using ISO8859-1 then we will get bytes with values
0x61 0x62 0x63. But don't let that mislead you, outside that limited range
(essentially the printable characters in the range 32-127) things become very
different.
In a way that overlap is very handy. It means that if someone sends me an
old-fashioned, 8-bit, text file (not Unicode) written in English then the
chances are that I'll be able to read it without me having to try to find out
what codepage the author used to create it. Which is a good thing because (a)
there's a good chance that the author hasn't got the faintest idea what a
code-page /is/ let alone which one s/he used to create the file, and (b) I
don't want to mess around trying to change code-page. Unfortunately, that only
works for text using the restricted range of characters. As soon as you start
using accented characters, or characters from non-English orthographies, the
whole thing breaks down and life becomes very awkward. Which is what Unicode
is /intended/ to avoid.
But in a way, it's a very Bad Thing too. Because of the overlap, it's very
hard (at least for people handling mostly English text) to see when they've
made a mistake with their programming. Or when they've carelessly, or
sloppily, made assumptions about the code-page in use. It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset which mapped
Unicode data to some sort of recognisable gibberish -- case-inverted or even
"rot13" would do. For all I know, there could be one there already, and I've
missed it...
-- chris