FWIW: as far as I can tell, the API of most OS's is encoding
independent. You send out a stream of bytes, and cross your
fingers that where ever it goes interprets it in the same
encoding you use. Thus, for example, under X, the endoding
depends on the font being used. Create your files in an xterm
using UTF-8, and then do an ls in an xterm using ISO 8859-1, and
the results will be strange, to say the least.
The long term tendency, of course, is to use UTF-8 everywhere,
at least externally. (Depending on what you are doing with the
text, it may be simpler to use UTF-32 internally. Although I'm
not really convinced---any serious text processing has to deal
with multi-word characters anyway.)
Locales can affect some issues. In particular, the locale you
imbue an fstream with controls code translation when reading and
writing.
No. You can't store anything through a void*.
That's easier said than done. You can imbue the stream with
locale "C" to begin with, read a number of bytes, guess the
encoding, seek to the beginning, imbue the correct locale for
that encoding, and then read the file in the desired encoding.
How well this works will depend, largerly, on how well you can
guess, which in turn depends on a lot of external factors:
-- Some formats, e.g. HTML, provide for the information in
clear text. In such cases, you're not really guessing, you
know (except that you'll doubtlessly end up having to read
text whose authors didn't insert the necessary information).
-- If you know the input is Unicode, and is text, you can
usually determine which format from the first 10 or 20
bytes.
-- If you have to deal with different ISO 8859-n encodings,
it's almost impossible to determine which one is being used,
regardless of how many bytes you read. If you can find some
bytes with the top bit set, however, you should be able to
distinguish ISO 8859 from any of the Unicode formats.
Seehttp://
www.unicode.org/andhttp://www.cl.cam.ac.uk/~mgk25/unicode.html. A fair amount of
Haralambous' excellent book, "Fonts and Encoding" is concerned
with Unicode as well.
Typically, regardless of your internal format, you have to deal
with a variety of external formats as well.
--
James Kanze (GABI Software) email:
[email protected]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34