C++ and internationalization/localization

  • Thread starter kasthurirangan.balaji
  • Start date
K

kasthurirangan.balaji

Hi,

I am looking for advice on i18n/l10n using c++. I understand about
wchar & wstreams. I have gone thru the net for some knowledge. All
speak of encodings like utf8/utf16/iso8859 and so. I also see that by
using utf8, we can achieve i18n and l10n using char itself. It would
be great if someone can provide or point out resources in c++ with
lots of example programs. Books are also welcome. References should
speak of persistence, network communications.

Thanks,
Balaji.
 
D

Daniel T.

I am looking for advice on i18n/l10n using c++. I understand about
wchar & wstreams. I have gone thru the net for some knowledge. All
speak of encodings like utf8/utf16/iso8859 and so. I also see that by
using utf8, we can achieve i18n and l10n using char itself. It would
be great if someone can provide or point out resources in c++ with
lots of example programs. Books are also welcome. References should
speak of persistence, network communications.

I have localized several games over the years, you can often get away
with single byte, depending on what languages you are localizing for.
For example, my current programs all use Latin-9 and are localized for
English, Spanish, French, German, Italian, Danish, Dutch, Norwegian, and
Swedish.

Of course, if you have to support Asian character sets, then double byte
is practically a must. The best thing, IMO, is to come up with an
interface for dealing with strings that has a UTF-16 (or maybe even a
UTF-32) interface, then write several classes that all implement that
interface but store the data internally in some other, possibly more
compact, format like UTF-16, UTF-8 or even Latin-1 or 9. That way, you
can switch out encoding formats with just a recompile and measure which
one makes the best use of resources.

One thing I haven't had to deal with is right to left or up to down
issues. The two Japanese titles I have done accepted left to right text
display without complaint.

My current batch of localization are on the Nintendo DS which has two
rather small screens (256 x 192) which limits the amount of text that
can be displayed on one line. On one of the titles, I am limited to 17
characters per line in dialog, and many of the other text bits are
limited to 10, and sometimes even 8 characters. This has proven quite
difficult in some languages. Of course these kinds of issues probably
aren't a problem with Windows systems and I know they aren't an issue
with Mac systems.
 
M

Martin York

Have a look at ICU:

http://www-306.ibm.com/software/globalization/icu/index.jsp

Basically you want to store the data in some portable format (one of
the UTF-X formats is easy). How you display it is then OS dependent.
But this is where ICU comes in. It basically allows conversion from
anything to anything and UTF is a good starting point (though it will
depend on your requirements).

Most modern OS's Win/Mac usually have Unicode API for displaying text
(not sure about UNIX). The basic char API's are now usually a simple
wrappers that call the Unicode version of the same function after some
conversion (usually).

Hope that helps.
 
K

kasthurirangan.balaji

Have a look at ICU:

http://www-306.ibm.com/software/globalization/icu/index.jsp

Basically you want to store the data in some portable format (one of
the UTF-X formats is easy). How you display it is then OS dependent.
But this is where ICU comes in. It basically allows conversion from
anything to anything and UTF is a good starting point (though it will
depend on your requirements).

Most modern OS's Win/Mac usually have Unicode API for displaying text
(not sure about UNIX). The basic char API's are now usually a simple
wrappers that call the Unicode version of the same function after some
conversion (usually).

Hope that helps.

Thanks Daniel/Martin. Just started looking at ICU. Isn't it possible
to use the locale features of c++ - i purchased the book "Standard C++
Streams and Locales - Langer/Kreft". Can i not use void * to store the
data and use encoders/decoders - like read first few bytes and based
on them find out the encoding and then apply the corresponding
decoding and then display within their locale. Also i would like to
know about utf8 vs utf16. I came across utf8cpp(sourgeforge.net) which
is an utf8 c++ lib, but it has got converters for utf16 too. I do not
know why?
I may be totally wrong as i am very much new to this subject.

Thanks,
Balaji.
 
D

Daniel T.

Thanks Daniel/Martin. Just started looking at ICU. Isn't it
possible to use the locale features of c++ - i purchased the book
"Standard C++ Streams and Locales - Langer/Kreft". Can i not use
void * to store the data and use encoders/decoders - like read
first few bytes and based on them find out the encoding and then
apply the corresponding decoding and then display within their
locale.

I don't really know much about the locale system. Since we work in a
rather tight niche, we don't have the luxury of lots of OS support, we
had to roll our own code even including text display onscreen.

In answer to your question though, no you cannot "read the first few
bytes" and then assume you know what encoding is being used for a
particular chunk of text. You have to know up front what encoding the
text uses, or have a standard method of determining the encoding through
some sort of header. One nice article I just found
(http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html)
Also i would like to know about utf8 vs utf16. I came across
utf8cpp(sourgeforge.net) which is an utf8 c++ lib, but it has got
converters for utf16 too. I do not know why?

It probably has those converters because Windows uses UTF-16LE for a lot
of stuff (for example excel spreadsheets can be saved as "unicode text"
files which are actually UTF-16LE.)

I took a quick look at the docs for the library and although I'm sure it
can be helpful for someone who is very careful, it isn't very type-safe
at all. As a result, you would have to use hungarian notation, or some
other means, to keep tracks of which strings contain what encodings.

You, and your company, need to make some decisions up front.

1) What will be the default encoding for your projects?

This will be the encoding that is used for all the strings the program
stores internally. The strings themselves will have to be in some
external file and referenced in code through some sort of mapping. If
you are writing for just one OS, then you should probably use whatever
default the OS itself uses, and look to see what tools the OS provides
to manage that data. You may want to decide the default encoding on a
project by project basis. I have written some programs for my company
that default to ASCII, some Latin-1, some Latin-9, and some UTF-16LE.

2) What other encodings do your projects need to support?

Depending on the type of project, you may not need to support any other
encodings, if that's the case then great, you can keep everything
standard. Otherwise, you will need to have converters from each encoding
to your default and back, you will have to decide what to do when one of
the two encodings contains a character that isn't supported by the
other, and you will have to decide how you are going to tag text so you
can know what encoding it is using.

3) How are you going to handle formatting issues?

If your program deals with weights and measures, money, or some other
data that is displayed differently depending on the language, you need
to be ready to handle that.

4) What about hyphenation?

You may not have to worry about this at all, I do. Like I said before,
some of my text blocks have a 17 character line max. Several of the
languages that the program displays in have words (individual words mind
you) that are more than 17 characters long. Some of my fonts are so
large that they can only display 8 characters on one line. Of course the
goal is to reduce your dependence on these sorts of issues, and that
means knowing about them up front. With the Mac OS, the whole window or
dialog box can be different depending on the language, different sized
buttons, in different places, and even a different sized window. I don't
have that sort of luxury right now though.
 
J

James Kanze

FWIW: as far as I can tell, the API of most OS's is encoding
independent. You send out a stream of bytes, and cross your
fingers that where ever it goes interprets it in the same
encoding you use. Thus, for example, under X, the endoding
depends on the font being used. Create your files in an xterm
using UTF-8, and then do an ls in an xterm using ISO 8859-1, and
the results will be strange, to say the least.

The long term tendency, of course, is to use UTF-8 everywhere,
at least externally. (Depending on what you are doing with the
text, it may be simpler to use UTF-32 internally. Although I'm
not really convinced---any serious text processing has to deal
with multi-word characters anyway.)
Thanks Daniel/Martin. Just started looking at ICU. Isn't it
possible to use the locale features of c++ - i purchased the
book "Standard C++ Streams and Locales - Langer/Kreft".

Locales can affect some issues. In particular, the locale you
imbue an fstream with controls code translation when reading and
writing.
Can i not use void * to store the data and use
encoders/decoders

No. You can't store anything through a void*.
- like read first few bytes and based on them find out the
encoding and then apply the corresponding decoding and then
display within their locale.

That's easier said than done. You can imbue the stream with
locale "C" to begin with, read a number of bytes, guess the
encoding, seek to the beginning, imbue the correct locale for
that encoding, and then read the file in the desired encoding.
How well this works will depend, largerly, on how well you can
guess, which in turn depends on a lot of external factors:

-- Some formats, e.g. HTML, provide for the information in
clear text. In such cases, you're not really guessing, you
know (except that you'll doubtlessly end up having to read
text whose authors didn't insert the necessary information).

-- If you know the input is Unicode, and is text, you can
usually determine which format from the first 10 or 20
bytes.

-- If you have to deal with different ISO 8859-n encodings,
it's almost impossible to determine which one is being used,
regardless of how many bytes you read. If you can find some
bytes with the top bit set, however, you should be able to
distinguish ISO 8859 from any of the Unicode formats.
Also i would like to know about utf8 vs utf16.

See http://www.unicode.org/ and
http://www.cl.cam.ac.uk/~mgk25/unicode.html. A fair amount of
Haralambous' excellent book, "Fonts and Encoding" is concerned
with Unicode as well.
I came across utf8cpp(sourgeforge.net) which is an utf8 c++
lib, but it has got converters for utf16 too. I do not know
why?

Typically, regardless of your internal format, you have to deal
with a variety of external formats as well.
 
K

kasthurirangan.balaji

FWIW: as far as I can tell, the API of most OS's is encoding
independent.  You send out a stream of bytes, and cross your
fingers that where ever it goes interprets it in the same
encoding you use.  Thus, for example, under X, the endoding
depends on the font being used.  Create your files in an xterm
using UTF-8, and then do an ls in an xterm using ISO 8859-1, and
the results will be strange, to say the least.

The long term tendency, of course, is to use UTF-8 everywhere,
at least externally.  (Depending on what you are doing with the
text, it may be simpler to use UTF-32 internally.  Although I'm
not really convinced---any serious text processing has to deal
with multi-word characters anyway.)


Locales can affect some issues.  In particular, the locale you
imbue an fstream with controls code translation when reading and
writing.


No.  You can't store anything through a void*.


That's easier said than done.  You can imbue the stream with
locale "C" to begin with, read a number of bytes, guess the
encoding, seek to the beginning, imbue the correct locale for
that encoding, and then read the file in the desired encoding.
How well this works will depend, largerly, on how well you can
guess, which in turn depends on a lot of external factors:

 -- Some formats, e.g. HTML, provide for the information in
    clear text.  In such cases, you're not really guessing, you
    know (except that you'll doubtlessly end up having to read
    text whose authors didn't insert the necessary information).

 -- If you know the input is Unicode, and is text, you can
    usually determine which format from the first 10 or 20
    bytes.

 -- If you have to deal with different ISO 8859-n encodings,
    it's almost impossible to determine which one is being used,
    regardless of how many bytes you read.  If you can find some
    bytes with the top bit set, however, you should be able to
    distinguish ISO 8859 from any of the Unicode formats.


Seehttp://www.unicode.org/andhttp://www.cl.cam.ac.uk/~mgk25/unicode.html.  A fair amount of
Haralambous' excellent book, "Fonts and Encoding" is concerned
with Unicode as well.


Typically, regardless of your internal format, you have to deal
with a variety of external formats as well.

--
James Kanze (GABI Software)             email:[email protected]
Conseils en informatique orientée objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Thanks to all for their replies. I have just started and post my
findings in a new thread.

Thanks,
Balaji.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top