Steve said:
I've been charged with investigating the possibilities of internationalizing
[is that a word?!]
Unfortunately yes.
To investigate, use Google to find the BBC news web sites for these locales:
- Spanish (because it's easy)
- Chinese (because it's hard)
- Arabic (right-to-left)
- Sanskrit (super-hard)
Now uses your libraries in applications that transmit strings you copied
from the web sites. Visually check to ensure your GUI outputs glyphs that
look the same as what you copied from the web sites.
Hindu in Win32 is hardest because the OS supports all the others as explicit
code pages. But Win32 only displays the Devenagari character set if you
compile your program in that magic UNICODE mode, forcing wide strings.
std::strings are used all over the place, and unfortunately a mixture of
isalpha,isdigit,etc. functions from the C library and C++ locale stuff.
Okay. Take the unit tests for your library that pass strings, copy each test
case, and upgrade the copy to send it
(Oh, you don't have unit tests? You have a bigger problem, so read /Working
Effectively with Legacy Code/ by Mike Feathers, and add them first before
making such a radical change.)
To fully embrace i18n I'm wondering if we have to fully make the switch to
everything being wide ( wstring, wcin, wcout, wide streams, etc.)
That's not the whole story. I don't know if the following dissertation will
help proportional to its length, but it's essentially all I know about the
topic...
The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.
"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".
A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.
Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.
Modern GUIs still use code page numbers, typically defined by the
"International Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII. Internationalize a
resource file to Greek like this:
LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)
STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da."
END
The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post, or in a compiled application. On WinXP, fix this by
opening the Regional and Language Options applet, and switching the combo
box labeled "Select a language to match the language version of the
non-Unicode programs you want to use" to Greek.
That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.
MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.
Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code
Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)
Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.
TODO ISO 10646, and the "Unicode Consortium", maintain the complete char set
of all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.
Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.
Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.
The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.
Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterization Test demonstrates how that claim
misleads:
TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi");
WCHAR composed[20] = {0};
MultiByteToWideChar(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT(0 != wcscmp(L"Höijärvi", composed));
CPPUNIT_ASSERT(0 == wcscmp(L"Ho\x308ija\x308rvi", composed));
CPPUNIT_ASSERT(0 == lstrcmpW(L"H"ij"rvi", composed));
CPPUNIT_ASSERT_EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEFAULT,
NORM_IGNORECASE,
L"h"ij"rvi", -1,
composed, -1
)
);
}
The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWideChar()
converts it into a Unicode string with all glyphs decomposed into their
constituents.
The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.
The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.
This assertion...
CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));
....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.
The long assertion with CompareStringW() demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.
If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.
Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.
UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.
Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.
Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.
Does anybody have any good advice, pointers, websites, etc. that could
help.
Read /Developing International Software/ by MS Press.
One other question. C++ has the concept of locales and you can declare
locale objects. They define all the peculiarities of different languages.
Now, obviously, the C++ standard doesn't define all these languages, so my
question is - who does? For example, if I set the locale to, say, Norwegian,
how does my app look for the finer details of the Norwegian language?
Wow. I need to learn the answer to that one, too.