Stroustrup says: "A variable of type 'char' can hold a
character of the implementation's character set."
I have numerous doubts related to character sets.
What do we mean by 'implementation's character set'? Do we
mean 'Operating System' by implementation? If yes, then is
the notion of char independent of compiler (i.e. dependent
only on OS)?
In this case (supposing that Stroustrup is trying to express
exactly what the standard says), the character set in question
is the "implementation's basic execution character set". The
C++ implementation; in C++, the basic execution character set
consists of exactly 100 characters: the 96 basic characters used by
C++, plus '\a', '\b', '\r' and '\0'. Anything else is part of
the extended execution character set.
The standard imposes a couple of constraints for the basic
character set: that they be representable in a single char (no
multibyte characters), and that they have positive values when
stored in a char (no high bit set if char is signed). It makes
no such requirements for the extended character set, however.
Does Win XP japanese have a different character set than Win
XP english? If yes then what are those character sets? How can
I find the name of these character sets?
That's all very implementation defined. The Windows boxes I
have access to all use ISO 8859-1 as the default extended
execution character set; I don't know if that's univeral,
however, or if it can be changed dynamically (by changing the
locale).
What is platform encoding? Will the platform encoding change
if I change the locale of the os?
What IS the platform encoding? And how can you determine it
within the program? There are no simple and portable answers.
For wchar_t, Windows and AIX use UTF-16, I think; Linux uses
UTF-32 and Solaris UCS. The first two regardless of the locale.
For char, on the other hand, the encoding may (and probably
does) depend on the locale. But the locale only determines how
the functions in the C++ library (and probably other libraries)
interpret the encoding; it has no effect outside of the program:
if you write UTF-8 to a file, and then send the file to a
printer which supposes ISO 8859-1, the printer will interpret
the bytes as ISO 8859-1, regardless of the locale which was
active when you wrote the file.
I know these questions might be very easy for most of you
No they're not. They're a source of problems for even the most
experienced programmers.
but I am seriously confused.
Just remember that the machine itself doesn't know anything
about character encodings. A char is just so many bits (usually
8), with a numeric value, and nothing more. It's the individual
programs which "interpret" the numeric value as a character, and
thus define the encoding. It's a question of the conventions
used by each of the programs to ensure that they interpret the
numeric values in the same way---C/C++ use locale to condition
how the standard library interpret such encodings, and hopefully
any other librarys will do the same. Beyond that, different
systems have different conventions, which are more or less
respected. (Unix, for example, defines a number of environment
variables which map directly to the categories in <locale.h>;
programs are supposed to respect these. Except that for
display, the terminal windows will use the encoding of the font
which has been selected, rather than one determined from the
environment variables.)
Can you please suggest me some good books or online documents?
Regretfully, I don't know of any that cover everything. Perhaps
the one that comes the closest is "Fontes et Codages", by Yannis
Haralambous, but it's many concerned with display
considerations, and much less with program internal
considerations; it's also very Unicode oriented. (I'm pretty
sure that the book has been translated into English. Check
O'Reilly's pages, searching for the author, and you should find
it.)