Pete said:
Phlip wrote:
An object of type wchar_t holds a character, not a glyph. A glyph can be
made up of more than one character. In Unicode, for example, LATIN SMALL
LETTER O followed by DIAERESIS is two characters that represent the same
glyph as the single character LATIN SMALL LETTER O WITH DIAERESIS. Both
will show up as a single blob of stuff (a glyph) on the display screen.
Right. Deep within the "mind" of the lowly wcschr() function, such things
are hypothetical. It will match combining diaereses as if they were
independent glyphs, and won't match those which precombined. That's why
short posts on such topics are risky, and the alternative is long boring
posts. But feel free to nitpick...
The word "glyph" has five glyphs and four phonemes. A "phoneme" is the
smallest difference in sound that can change a word's meaning. For example,
f is softer than ph, so flip has a meaning different than ... you get the
idea.
"Ligatures" are links between two glyphs, such as fl, with a link at the
top. "Accented" characters, like á, might be considered one glyph or two.
And many languages use "vowel signs" to modifying consonants to introduce
vowels, such as the tilde in the Spanish word niña ("neenya"), meaning
"girl".
[A pause to check my post's encoding. It will go out as Western Europe,
meaning ISO Latin 1. I suspect that's also ISO 8897-1.
[That's funny, because I thought I had it set to Greek these days for some
strange reason...]
A "script" is a set of glyphs that write a language. A "char set" is a table
of integers, one for each glyph in a script. A "code point" is one glyph's
index in that char set. Programmers often say "character" when they mean
"one data element of a string", so it could casually mean either 8-bit char
elements or 16-bit wchar_t elements. An "encoding" is a way to pack a char
set as a sequence of characters, all with the same bit-count. A "code page"
is an identifier to select an encoding. A "glossary" is a list of useful
phrases translated into two or more languages. A "collating order" sorts a
cultures' glyphs so readers can find things in lists by name. A "locale" is
a culture's script, char set, encoding, collating order, glossary, icons,
colors, sounds, formats, and layouts, all bundled into a seamless GUI
experience.
Different locales required different encodings and character widths for
various reasons. In the beginning, there was ASCII, based on encoding the
Latin alphabet, without accent marks, into a 7-bit protocol. Early systems
reserved the 8th bit for a parity check. Then cultures with short phonetic
alphabets computerized their own glyphs. Each culture claimed the same
"high-ASCII" range of the 8 bits in a byte-the ones with the 8th bit turned
on. User interface software, to enable more than one locale, selects the
"meaning" of the high-ASCII characters by selecting a "code page". On some
hardware devices, this variable literally selected the hardware page of a
jump table to convert codes into glyphs.
Modern GUIs still use code page numbers, typically defined by the
"International Standards Organization", or its member committees. The ISO
8859-7 encoding, for example, stores Latin characters in their ASCII
locations, and Greek characters in the high-ASCII.
<warning topicality="off">
Internationalize a resource file to Greek like this:
LANGUAGE LANG_GREEK, SUBLANG_NEUTRAL
#pragma code_page(1253)
STRINGTABLE DISCARDABLE
BEGIN
IDS_WELCOME "?p?d??? st?? ????da." // <-- imagine Greek there
END
</warning>
The quoted Greek words might appear as garbage on your desktop, in a real RC
file, in a USENET post [like this one], or in a compiled application. On
WinXP, fix this by opening the Regional and Language Options applet, and
switching the combo box labeled "Select a language to match the language
version of the non-Unicode programs you want to use" to Greek. Unless if the
garbage is ? marks, in which case a library function somewhere has replaced
the garbage with placeholders.
That user interface verbiage uses "non-Unicode" to mean the "default code
page". When a program runs using that resource, the code page "1253"
triggers the correct interpretation, as (roughly) ISO 8859-7.
MS Windows sometimes supports more than one code page per locale. The two
similar pages, 1253 and ISO 8859-7, differ by a couple of glyphs.
Some languages require more than 127 glyphs. To fit these locales within
8-bit hardware, more complex encodings map some glyphs into more than one
byte. The bytes without their 8th bit still encode ASCII, but any byte with
its 8th bit set is a member of a short sequence of multiple bytes that
require some math formula to extract their actual char set index. These
"Multiple Byte Character Sets" support locale-specific code pages for
cultures from Arabia to Vietnam. However, you cannot put glyphs from too
many different cultures into the same string. OS support functions cannot
expect strings with mixed code
Sanskrit shares a very popular script called Devanagari with several other
Asian languages. (Watch the movie "Seven Years in Tibet" to see a big
ancient document, written with beautiful flowing Devanagari, explaining why
Brad Pitt is not allowed in Tibet.)
Devanagari's code page could have been 57002, based on the standard "Indian
Script Code for Information Interchange". MS Windows does not support this
locale-specific code page. Accessing Devanagari and writing Sanskrit (or
most other modern Indian languages) requires the Mother of All Char Sets,
Unicode.
ISO 10646, and the "Unicode Consortium", maintain the complete char set of
all humanity's glyphs. To reduce the total count, Unicode supplies many
shortcuts. For example, many fonts place glyph clusters, such as accented
characters, into one glyph. Unicode usually defines each glyph component
separately, and relies on software to merge glyphs into one letter. That
rule helps Unicode not fill up with all permutations of combinations of
ligating accented modified characters.
Many letters, such as ñ, have more than one Unicode representation. Such a
glyph could be a single code point (L"\xF1"), grandfathered in from a
well-established char set, or could be a composition of two glyphs
(L"n\x303"). The C languages introduce 16-bit string literals with an L.
Text handling functions must not assume each data character is one glyph, or
compare strings using na<ve character comparisons. Functions that process
Unicode support commands to merge all compositions, or expand all
compositions.
The C languages support a 16-bit character type, wchar_t, and a matching
wcs*() function for every str*() function. The strcmp() function, to compare
8-bit strings, has a matching wcscmp() function to compare 16-bit strings.
These functions return 0 when their string arguments match.
Irritatingly, documentation for wcscmp() often claims it can compare
"Unicode" strings. This Characterization Test demonstrates how that claim
misleads:
TEST_(TestCase, Hoijarvi)
{
std::string str("Höijärvi");
WCHAR composed[20] = {0};
MultiByteToWideChar(
CP_ACP,
MB_COMPOSITE,
str.c_str(),
-1,
composed,
sizeof composed
);
CPPUNIT_ASSERT(0 != wcscmp(L"Höijärvi", composed));
CPPUNIT_ASSERT(0 == wcscmp(L"Ho\x308ija\x308rvi", composed));
CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));
CPPUNIT_ASSERT_EQUAL
(
CSTR_EQUAL,
CompareStringW
(
LOCALE_USER_DEFAULT,
NORM_IGNORECASE,
L"höijärvi", -1,
composed, -1
)
);
}
The test starts with an 8-bit string, "Höijärvi", expressed in this post's
code page, ISO 8859-1, also known as Latin 1. Then MultiByteToWideChar()
converts it into a Unicode string with all glyphs decomposed into their
constituents.
The first assertion reveals that wcscmp() compares raw characters, and
thinks "ö" differs from "o\x308", where \x308 is the COMBINING DIAERESIS
code point.
The second assertion proves the exact bits inside composed contain primitive
o and a glyphs followed by combining diæreses.
This assertion...
CPPUNIT_ASSERT(0 == lstrcmpW(L"Höijärvi", composed));
.....reveals the MS Windows function lstrcmpW() correctly matches glyphs, not
their constituent characters.
The long assertion with CompareStringW() demonstrates how to augment
lstrcmpW()'s internal behavior with more complex arguments.
If we pushed this experiment into archaic Chinese glyphs, it would soon show
that wchar_t cannot hold all glyphs equally, each at their raw Unicode
index. Despite Unicode's careful paucity, human creativity has spawned more
than 65,535 code points.
Whatever the size of your characters, you must store Unicode using its own
kind of Multiple Byte Character Set.
UTF converts raw Unicode to encodings within characters of fixed bit widths.
MS Windows, roughly speaking, represents UTF-8 as a code page among many.
However, roughly speaking again, when an application compiles with the
_UNICODE flag turned on, and executes on a version of Windows derived from
WinNT, it obeys UTF-16 as a code page, regardless of locale.
Because a _UNICODE-enabled application can efficiently use UTF-16 to store a
glyph from any culture, such applications needn't link their locales to
specific code pages. They can manipulate strings containing any glyph. In
this mode, all glyphs are created equal.
Put another way, UTF-8 can store characters of any UNICODE code point, but
Win32 programs can only easily make use of UTF-16 characters.
Which has nothing at all to do with the original problem.
Right: wcschr() can't be slow, so something else was going on.
Get more Greek here:
http://www.greencheese.org/TheFrogs