Going wide and international

Steve · Jun 29, 2005

Hi,

I've been charged with investigating the possibilities of internationalizing
[is that a word?!] our C++ libraries.

std::strings are used all over the place, and unfortunately a mixture of
isalpha,isdigit,etc. functions from the C library and C++ locale stuff.

To fully embrace i18n I'm wondering if we have to fully make the switch to
everything being wide ( wstring, wcin, wcout, wide streams, etc.)

Does anybody have any good advice, pointers, websites, etc. that could help.

One other question. C++ has the concept of locales and you can declare
locale objects. They define all the peculiarities of different languages.
Now, obviously, the C++ standard doesn't define all these languages, so my
question is - who does? For example, if I set the locale to, say, Norwegian,
how does my app look for the finer details of the Norwegian language?

Thanks for any help.

Phlip · Jun 29, 2005

Steve said:
I've been charged with investigating the possibilities of internationalizing
[is that a word?!]

Unfortunately yes.

our C++ libraries.

To investigate, use Google to find the BBC news web sites for these locales:

- Spanish (because it's easy)
- Chinese (because it's hard)
- Arabic (right-to-left)
- Sanskrit (super-hard)

Now uses your libraries in applications that transmit strings you copied
from the web sites. Visually check to ensure your GUI outputs glyphs that
look the same as what you copied from the web sites.

Hindu in Win32 is hardest because the OS supports all the others as explicit
code pages. But Win32 only displays the Devenagari character set if you
compile your program in that magic UNICODE mode, forcing wide strings.

std::strings are used all over the place, and unfortunately a mixture of
isalpha,isdigit,etc. functions from the C library and C++ locale stuff.

Okay. Take the unit tests for your library that pass strings, copy each test
case, and upgrade the copy to send it

(Oh, you don't have unit tests? You have a bigger problem, so read /Working
Effectively with Legacy Code/ by Mike Feathers, and add them first before
making such a radical change.)

To fully embrace i18n I'm wondering if we have to fully make the switch to
everything being wide ( wstring, wcin, wcout, wide streams, etc.)

That's not the whole story. I don't know if the following dissertation will
help proportional to its length, but it's essentially all I know about the
topic...

The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.

"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".

A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.

Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.

Modern GUIs still use code page numbers, typically defined by the
"International Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII. Internationalize a
resource file to Greek like this:

LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)

STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da."
END

The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post, or in a compiled application. On WinXP, fix this by
opening the Regional and Language Options applet, and switching the combo
box labeled "Select a language to match the language version of the
non-Unicode programs you want to use" to Greek.

That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.

MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.

Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code

Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)

Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.

TODO ISO 10646, and the "Unicode Consortium", maintain the complete char set
of all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.

Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.

Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.

The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.

Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterization Test demonstrates how that claim
misleads:

TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi");
WCHAR composed[20] = {0};

MultiByteToWideChar(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT(0 != wcscmp(L"Höijärvi", composed));
CPPUNIT_ASSERT(0 == wcscmp(L"Ho\x308ija\x308rvi", composed));
CPPUNIT_ASSERT(0 == lstrcmpW(L"H"ij"rvi", composed));

CPPUNIT_ASSERT_EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEFAULT,
NORM_IGNORECASE,
L"h"ij"rvi", -1,
composed, -1
)
);
}

The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWideChar()
converts it into a Unicode string with all glyphs decomposed into their
constituents.

The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.

The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.

This assertion...

CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));

....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.

The long assertion with CompareStringW() demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.

If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.

Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.

UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.

Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.

Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.

Does anybody have any good advice, pointers, websites, etc. that could

help.

Read /Developing International Software/ by MS Press.

One other question. C++ has the concept of locales and you can declare
locale objects. They define all the peculiarities of different languages.
Now, obviously, the C++ standard doesn't define all these languages, so my
question is - who does? For example, if I set the locale to, say, Norwegian,
how does my app look for the finer details of the Norwegian language?

Wow. I need to learn the answer to that one, too.

Alf P. Steinbach · Jun 29, 2005

* Phlip:

Wow. I need to learn the answer to that one, too.

<ot>
There's not one Norwegian language. There are two main language families:
ordinary Norwegian and samisk (spoken by the indigenous people). I don't
know much about samisk, but as I understand it it's not _one_ language.
For ordinary Norwegian there are three main variants: bokmål (the usual),
nynorsk (an attempt at resquing the written language, has to be used to
some extent in e.g. broadcasting, by law), and riksmål (a conservative
variant). Last I checked, some years ago, only bokmål and nynorsk had
standard ISO identifiers. The Norwegian alphabet is simple: it's the English
one with æøåÆØÅ tacked on at the end (after zZ); considering that some
English words are spelled with 'æ' it's amazing they don't have that in the
English alphabet, but then not everything in the world is by design.
</ot>

Should perhaps be mentioned that:

Wide strings in C++ are not guaranteed to e.g. support Unicode UCS2.

Essentially the language itself provides no support at all; like in other
areas it merely _enables_ higher level technology that relies on implementation
specifics and on such specifics being so common that they're de-facto standard.

msalters · Jun 29, 2005

Phlip said:
The C languages introduce 16-bit string literals with an L.

Actually, it's 16+ bits. char is 8+ bits.

If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.

Actually, wchar_t by definition can hold all glyphs equally.
All glyphs /in the wchar_t charset/, to be precise. If
wchar_t is precisely 16 bits, then the logical conclusion
is that the wchar_t charset cannot be Unicode 4.

For this reason, some Unix variants have an 32-bits wchar_t
which has room enough for all code points.

HTH,
Michiel Salters

msalters · Jun 29, 2005

Alf said:
The Norwegian alphabet is simple: it's the English
one with æøåÆØÅ tacked on at the end (after zZ);

After? You mean it doesn't sort æå/ÆÅ with a/A ?
Just shows how surprising i18n is.

Regards,
Michiel Salters

Steve · Jun 29, 2005

* Phlip:

<ot>
There's not one Norwegian language. There are two main language families:
ordinary Norwegian and samisk (spoken by the indigenous people).

I didn't know this. Apologies to the Norwegians reading this.

Steve · Jun 29, 2005

(Oh, you don't have unit tests? You have a bigger problem, so read /Working
Effectively with Legacy Code/ by Mike Feathers, and add them first before
making such a radical change.)

Hmmm... Well, yes.

That's not the whole story.

I kind of suspected it wouldn't be.

I don't know if the following dissertation will
help proportional to its length, but it's essentially all I know about the
topic...

[snipped dissertation]

Read /Developing International Software/ by MS Press.

Will do.

Thanks very much Phlip. Seriously, that dissertation was really helpful.

I think it proves just how little I know now and I have more of an uphill
struggle than I first assumed.

Phlip · Jun 29, 2005

Steve said:
Thanks very much Phlip. Seriously, that dissertation was really helpful.

I think it proves just how little I know now and I have more of an uphill
struggle than I first assumed.

Thank you, but:

That's why I said Spanish then Chinese then Arabic then Sanskrit. Start with
the easiest encodings first. English shares ISO Latin 1 with all Western
Europe (prob'ly except Norway

.

And Google iconv. Even if you don't use it in production, your test-side
code can use it for your experiments, and for assertions.

Steve · Jun 29, 2005

Thank you, but:

That's why I said Spanish then Chinese then Arabic then Sanskrit. Start with
the easiest encodings first. English shares ISO Latin 1 with all Western
Europe (prob'ly except Norway.

And Google iconv. Even if you don't use it in production, your test-side
code can use it for your experiments, and for assertions.

Interestingly, we do - indirectly. We use libxml2 which depends on iconv!

Alf P. Steinbach · Jun 29, 2005

* Phlip:

English shares ISO Latin 1 with all Western Europe (prob'ly except
Norway.

Uhm, I think you have that backwards. ISO Latin 1 fully supports
Norwegian (fact), but possibly not all of Western Europe (e.g., I would
be surprised if Gaelic and Icelandic are fully supported). Except if
Western Europe is _defined_ as what ISO Latin 1 supports... ;-)

Phlip · Jun 29, 2005

I didn't know this. Apologies to the Norwegians reading this.

The question was not strictly appropriate for most localization situations.
Try: How does my app look for the finer details of the language that most
_rich_ Norwegians use?

;-)

Uhm, I think you have that backwards. ISO Latin 1 fully supports
Norwegian (fact), but possibly not all of Western Europe (e.g., I would
be surprised if Gaelic and Icelandic are fully supported). Except if
Western Europe is _defined_ as what ISO Latin 1 supports... ;-)

And I apologize to those tenaciously clinging to Gælic!!!

(Seriously, folks, I fell into the i18n engineer's jargon that uses "Western
European" to mean "everyone we can reach with ISO Latin 1".)

Actually, it's 16+ bits. char is 8+ bits.

Okay; thanks. Rather than pervert that paragraph with extra parenthetic
data, I added a single paragraph, above, to cover everything:

Another point of complexity; I will persist in referring
to char as 8 bit and wchar_t as 16-bit, despite the
letters of the C Standard laws say they may store
more bits. These rules permit the C languages to fully
exploit various hardware architectures.

Actually, wchar_t by definition can hold all glyphs equally.
All glyphs /in the wchar_t charset/, to be precise. If
wchar_t is precisely 16 bits, then the logical conclusion
is that the wchar_t charset cannot be Unicode 4.

I don't see how to use that. wchar_t holds all the characters that wchar_t
holds.

I thought wchar_t was a data type, not a char set, and that functions using
it were responsible for its interpretation.

The OP needs Dietmar Keuhl or someone to whip out an .imbue() that can cover
Norwegian, to get us all back on track...

Phlip · Jun 29, 2005

msalters said:
After? You mean it doesn't sort æå/ÆÅ with a/A ?

What do you get when you convert æ to title case? You get Ae squished
together, which my Office2000 can't seem to generate, so I can't paint it
here, despite it would probably get ???-ed anyway.

Just shows how surprising i18n is.

My Office2000 feature, Format->Change Case->Title Case just converted æ to
Æ...

Alf P. Steinbach · Jun 29, 2005

It doesn't.

What do you get when you convert æ to title case?

'Æ'.
</ot>

Phlip · Jun 29, 2005

Alf said:
It doesn't.

How do I look Ålf up in an Oslo phone book? After Zuckerman??

'Æ'.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04b

"Case is an important property for characters in Latin, Greek, Cyrillic,
Armenian and Georgian scripts.15 For these scripts, both upper- and
lowercase characters are encoded. Because some Latin and Greek digraphs were
included in Unicode, it was necessary to add additional case forms to deal
with the situation in which a string has an initial uppercase character.
Thus, for these digraphs there are upper-, lower- and titlecase characters;
for example,

U+01CA NJ LATIN CAPITAL LETTER NJ,
U+01CB Nj LATIN CAPITAL LETTER N WITH SMALL LETTER J, and
U+01CC nj LATIN SMALL LETTER NJ.

Likewise, there are properties giving uppercase, lowercase and title case
mappings for characters. Thus, U+01CA has a lowercase mapping of U+01CC and
a titlecase mapping of U+01CB."

All of NJ, Nj and nj are singly typeset glyphs, ligated together.

Just my luck they didn't pick Æ for their example, but you get the idea...

John Carson · Jun 30, 2005

msalters said:
Actually, it's 16+ bits. char is 8+ bits.

I don't know about C, but in C++ wchar_t is at least as large as char and no
larger than long. It could therefore be 8 bits, say, See Stroustrup TCPL, p.
75.

Richard Herring · Jul 8, 2005

In message <[email protected]>, Phlip

[...]

The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For
example, f is softer than ph, so flip has a meaning different than ...
you get the idea.

I'm not sure that's a good example. Most English-speakers pronounce <ph>
and <f> the same. Both written forms represent the same phoneme, /f/,
whose exact sound will depend on the phonetic context in which it's
pronounced.

"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".

Nitpick:
That's a diacritic, not a "vowel sign", and it palatalises the consonant
but doesn't introduce any new vowel. The Spanish sound written <ñ> is
one phoneme, not two. Vowel signs are diacritics used in Devanagari and
related abugida scripts, where the basic symbols represent
consonant-plus-vowel and the vowel signs change which vowel is meant.

A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.

We wish!

? Bug in libstdc++? GCC 4.1: wcout.imbue(loc) should (?) set stream encoder's, but it doesn't; std::	2	Sep 11, 2007
wcsftime output encoding	11	Nov 26, 2004
SQL Server and .NET Interview questions free download	0	Oct 28, 2006
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Going wide and international

Steve

Phlip

Alf P. Steinbach

msalters

msalters

Steve

Steve

Phlip

Steve

Alf P. Steinbach

Phlip

Phlip

Alf P. Steinbach

Phlip

John Carson

Richard Herring

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads