Poll: Which type would you prefer for UTF-8 string literals in C++0x

Martin B. · Aug 31, 2010

Hi!

This is a poll to get *short* opinions on the planned literal type for
the UTF-8 string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf )
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

thanks a lot!
br,
Martin

Alf P. Steinbach /Usenet · Aug 31, 2010

* Martin B., on 31.08.2010 17:35:

Hi!

This is a poll to get *short* opinions on the planned literal type for the UTF-8
string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf )
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

Not sure this is the right thing to focus on.

Is there C++0x support for wide character arguments to std:

fstream & friends?

That's needed in practice on the Windows platform (can't write correct programs
using those beasts without it), and moreover there is existing practice that can
be standardized, not inventing something new.

Cheers,

- Alf

Martin B. · Sep 1, 2010

Martin B. said:
Martin B. said:

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

[...]
c) FCS! Add a distinct char8_t and make u8 literals use that!

Click to expand...

[...]

As for c), this is already present - char is different from signed char
or unsigned char. The problem is that most mainstream implementations
define plain char as signed. Adding a new type would only cause more
confusion IMO.

With distinct I meant to say distinct from normal character literals.
That is, I, personally, would prefer u8"" literals having a distinct
type from normal "" literals.

cheers,
Martin

tni · Sep 1, 2010

Logically, b) would be better of course. However, as there are zillions
of text interfaces using char and most of them work fine with UTF-8, I
would vote for a).

As for c), this is already present - char is different from signed char
or unsigned char. The problem is that most mainstream implementations
define plain char as signed. Adding a new type would only cause more
confusion IMO.

UTF-8 is clearly unsigned data. Stuffing it into that horrible piece of
legacy crap called char is IMO insane.

Given that char can alias with anything, I don't see your point that a)
is better for using legacy APIs.

Pavel · Sep 2, 2010

Martin said:
Hi!

This is a poll to get *short* opinions on the planned literal type for
the UTF-8 string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf )
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

thanks a lot!
br,
Martin

IMHO the idea seems to be half-baked without library support for
transforming the literal to and from std::string,u16string,u32string and
wstring and maybe more (substrings etc).

To safely and fully support the above, the whole literal must be of a
separate type. This is different from c) option above.

To access the literal content as char or unsigned char, the intention
should be expressed explicitly. This would partly protect against
erroneous usage in assumption the literal provides pointer arithmetic
for access to UTF8 code points within the literal. For example:

std::utf8_string us=u8"xyz";
const char *p1=us.c_begin();
const unsigned char *p2=us.uc_begin();

-Pavel

BGB / cr88192 · Sep 2, 2010

Pavel said:
Martin said:

Hi!

This is a poll to get *short* opinions on the planned literal type for
the UTF-8 string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf )
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

thanks a lot!
br,
Martin

Click to expand...

IMHO the idea seems to be half-baked without library support for
transforming the literal to and from std::string,u16string,u32string and
wstring and maybe more (substrings etc).

To safely and fully support the above, the whole literal must be of a
separate type. This is different from c) option above.

To access the literal content as char or unsigned char, the intention
should be expressed explicitly. This would partly protect against
erroneous usage in assumption the literal provides pointer arithmetic for
access to UTF8 code points within the literal. For example:

std::utf8_string us=u8"xyz";
const char *p1=us.c_begin();
const unsigned char *p2=us.uc_begin();

why not have it just return an "unsigned char" pointer and make it be
otherwise like a normal string literal?...
yes, naive pointer arithmetic may put one in the middle of a codepoint, but
what really does it matter?...

but, then again, I have a compiler (for C) where UTF-8 is the default
character encoding anyways (for narrow strings).

char *s="foo \u3FAC\n";
would handle the codepoint as UTF-8 (as an extension)...

in addition to the more usual:
wchar_t *s=L"foo \u3FAC\n";
occurence...

(my implementation assumes 16-bit wchar_t and UTF-16 strings for this...).

usually, UTF-8 strings are cast to 'unsigned char *' when they need to be
worked on, but otherwise are passed around as normal 'char *' strings for
consistency with existing practice and API functions, ...

a vaguely similar pattern could be followed, and an explicit marker for
UTF-8 could make sense...

however, returning string literals as a complex type would be IMO
sub-optimal, as it breaks with existing patterns...

Pavel · Sep 3, 2010

BGB said:
Pavel said:

Martin said:

Hi!

This is a poll to get *short* opinions on the planned literal type for
the UTF-8 string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
( http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf )
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

thanks a lot!
br,
Martin

Click to expand...

IMHO the idea seems to be half-baked without library support for
transforming the literal to and from std::string,u16string,u32string and
wstring and maybe more (substrings etc).

To safely and fully support the above, the whole literal must be of a
separate type. This is different from c) option above.

To access the literal content as char or unsigned char, the intention
should be expressed explicitly. This would partly protect against
erroneous usage in assumption the literal provides pointer arithmetic for
access to UTF8 code points within the literal. For example:

std::utf8_string us=u8"xyz";
const char *p1=us.c_begin();
const unsigned char *p2=us.uc_begin();

Click to expand...

why not have it just return an "unsigned char" pointer and make it be
otherwise like a normal string literal?...
yes, naive pointer arithmetic may put one in the middle of a codepoint, but
what really does it matter?...

but, then again, I have a compiler (for C) where UTF-8 is the default
character encoding anyways (for narrow strings).

char *s="foo \u3FAC\n";
would handle the codepoint as UTF-8 (as an extension)...

in addition to the more usual:
wchar_t *s=L"foo \u3FAC\n";
occurence...

(my implementation assumes 16-bit wchar_t and UTF-16 strings for this...).

usually, UTF-8 strings are cast to 'unsigned char *' when they need to be
worked on, but otherwise are passed around as normal 'char *' strings for
consistency with existing practice and API functions, ...

a vaguely similar pattern could be followed, and an explicit marker for
UTF-8 could make sense...

however, returning string literals as a complex type would be IMO
sub-optimal, as it breaks with existing patterns...

I agree it does. I do not see anything wrong with it, however. Why
should we cram more and more semantics into the old syntax? I understand
when it's done for compatibility with the existing code but utf-8
literal does not seem to carry any syntactic legacy that should be
preserved.

I thought more about it and realized that, without library support, it
will be even worse than I thought first. Imagine you need to process a
utf-8 literal to get Unicode characters out of it. Portable and
efficient code would have to work out different cases of endianness and
alignment that would be a piece of cake for the standard library
implementation as it has access to implementation-specific stuff. I
would feel myself back into 80th, but not in a good sense, if I had to
do it once again for a just released "latest-n-greatest" language. But
of course, the meaning #28 for square brackets is much more important
than our mundane needs like a decent library support for basic
day-to-day problems... I guess I will retire before being able to split
a file path onto directory and basic name in C++ using standard library
... sad.

-Pavel

James Kanze · Sep 6, 2010

Logically, b) would be better of course. However, as there are
zillions of text interfaces using char and most of them work
fine with UTF-8, I would vote for a).

The question raises a more general issue: should the encoding be
part of the type. Or in other words, should UTF-8 strings and
characters (in general) have a different type than e.g. ISO
8859-1 (which in turn should have a different type than ISO
8859-2)?

I think one could argue both ways, but historically, narrow
characters have always been char/char*/basic_string<char>,
regardless of the encoding, and I don't think it would work to
change this now.

As for c), this is already present - char is different from
signed char or unsigned char. The problem is that most
mainstream implementations define plain char as signed. Adding
a new type would only cause more confusion IMO.

I've usually seen a general convention that text is char, and
that small integers are either signed char or unsigned char. Of
course, using a signed type to represent characters is an
anomaly; allowing it is arguably an error in the initial
specification of C (but making plain char unsigned on a PDP-11
had a very significant negative impact on performance). In some
ways, I'd like to see a requirement that plain char be unsigned,
or even that it be more restricted, only supporting operations
which might make sense on a character (no multiplication, for
example). But practically speaking, it's not going to happen,
and practically speaking, QoI considerations will ensure that
all implementations will support things like ISO 8859-1 or UTF-8
on plain char---if plain char is signed, they will ensure that
there is a lossless two way conversion between an int in the
range 0-UCHAR_MAX (as returned by streambuf::sgetc, for example)
and char. (Note that according to the standard, the results of
converting a value in the range SCHAR_MAX+1-UCHAR_MAX to a char
is implementation defined, and---at least according to the
C standard---may result in an implementation defined signal. In
practice, I wouldn't worry about it.)

Martin B. · Sep 7, 2010

The question raises a more general issue: should the encoding be
part of the type. Or in other words, should UTF-8 strings and
characters (in general) have a different type than e.g. ISO
8859-1 (which in turn should have a different type than ISO
8859-2)?

And C++0x started to define two character types char16_t and char32_t
where the encoding *is* part of the type.

And I know at least one instance where this makes sense:
// ...
const char16_t* utf16str_1 = u"...";
const char32_t* utf32str_1 = U"...";
const wchar_t* wstr_1 = L"...";

Now, if a maintainer where to add another string constant:
const char32_t* utf32str_2 = u"..."; // OOPS. Typo!
a conforming compiler reports an error, because char16_t* is not
convertible to char32_t*!

I think one could argue both ways, but historically, narrow
characters have always been char/char*/basic_string<char>,
regardless of the encoding, and I don't think it would work to
change this now.

But the problem is that, historically, there was only *one* way that
string constants were added to a program:
char* str = "...";

With UTF-8 string constants, we get:
const char* utf8str_1 = u8"...";
const char* utf8str_2 = "..."; // OOPS. Typo!

The compiler has no chance to diagnose this. But it would have, if UTF-8
characters had their own type.

cheers,
Martin

Öö Tiib · Sep 7, 2010

Hi!

This is a poll to get *short* opinions on the planned literal type for
the UTF-8 string literals in C++0x. (Please, no Unicode lectures ;-)

The current draft for C++0x specifies for UTF-8 string literals:
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3126.pdf)
[N3126, p28, §2.14.5]
[Item 6] A string literal that begins with u8, such as u8"asdf",
is a UTF-8 string literal [...]
[Item 7] Ordinary string literals and UTF-8 string literals are
also referred to as narrow string literals. A narrow
string literal has type "array of n const char" [...]

Compare this with the wide character types:
[N3126, p28, §2.14.5]
[Item 8] A string literal that begins with u, such as u"asdf",
is a char16_t string literal. A char16_t string
literal has type "array of n const char16_t" [...]
[Item 9] A string literal that begins with U, such as U"asdf",
is a char32_t string literal. A char32_t string literal
has type "array of n const char32_t" [...]

QUESTION: For the upcoming UTF-8 string literals, which type would you
prefer?

a) The current proposal, "array of n const char" is great!
b) "array of n const unsigned char" would be better!
(Because I'm using libxml2 ;-)
c) FCS! Add a distinct char8_t and make u8 literals use that!

I am somewhat sceptical about usefulness of utf-8 bytes for anything
but storing and transporting texts. Simple operations like
std::toupper will never work with these anyway.

It all smells for another set of pages into already lenghty coding
policies as a reward for a semi-useful literal. Overall a) looks most
dangerous since char is already sort of like synonyme to void and ...
c) looks most promising (if it does not silently convert to char or
unsigned char).

Öö Tiib · Sep 8, 2010

The string variables are mostly used just for storing, concatenating and
transporting texts. Splitting on ASCII delimiters and searching
substrings also works fine with UTF-8. The only problematic operations
are related to single character manipulations, which are quite rare in my
experience.

I meant these functions in <locale> do not you really ever need them?

template < class charT > bool isspace( charT c, const locale& loc );
template < class charT > bool isprint( charT c, const locale& loc );
template < class charT > bool iscntrl( charT c, const locale& loc );
template < class charT > bool isupper( charT c, const locale& loc );
template < class charT > bool islower( charT c, const locale& loc );
template < class charT > bool isalpha( charT c, const locale& loc );
template < class charT > bool isdigit( charT c, const locale& loc );
template < class charT > bool ispunct( charT c, const locale& loc );
template < class charT > bool isxdigit( charT c, const locale& loc );
template < class charT > bool isalnum( charT c, const locale& loc );
template < class charT > bool isgraph( charT c, const locale& loc );
template < class charT > charT toupper( charT c, const locale& loc );

template said:
The toupper() function seems everything than simple to me. The standard
example from James Kanze is the German ß, which should go to SS in
uppercase.

Simple ... i meant for the people who say that here it should be
capitalized and here in upper case, here with bold font. For them all
three feel tasks with similar complexity. It is reasonable
requirement: "i want to search for the text i typed in case-
insensitively", isn't it?
If std::toupper() with char32_t needs still special post-processing
with German ß or some other exception, then okay. If it does not work
with char8_t then it should throw, and not produce rubbish.

In our software all strings are encoded as UTF-8 internally. On Linux
filesystems and user locales are nowadays UTF-8 by default, so it is easy
to open files, print text, call system functions working with filenames
like glob(), etc. And all those interfaces work with char*.

Yes, that is why i voted c). char* is sort of byte* for me ... all
POD stuff can be transported with it (or with unsigned char*). Since
utf-8 string is most popular of all kinds of char* it would be good to
have special type for the byte in it.

char* can still be used since utf-8 string is a POD too but it is
not ... correct. It is similar to const correctness, initially it
caused annoyance with interfaces that asked for immutable data with
pointer to mutable data. Right now it only happens rarely with some
legacy code or platform that has rusted on field for two decades and
now needs a "minor" upgrade.

Öö Tiib · Sep 8, 2010

If std::toupper() with char32_t needs still special post-processing
with German ß or some other exception, then okay. If it does not work
with char8_t then it should throw, and not produce rubbish.

Replace "throw" with "static_assert".

James Kanze · Sep 8, 2010

On Aug 31, 6:35 pm, "Martin B." <[email protected]> wrote:

[...]

I am somewhat sceptical about usefulness of utf-8 bytes for
anything but storing and transporting texts. Simple
operations like std::toupper will never work with these
anyway.

"Simple" operations like std::toupper don't work with most
encodings, including char32_t UTF-32. Mainly because things
like "toupper" aren't simple. (The classical example:
toupper('\u00DF') should result in the two character sequence
"SS".) Any effective toupper has to work on a string level, not
on a character level, and generate a new string (since there is
no one to one mapping to upper). And this can be done in UTF-8,
with the correct tools (which maybe should be part of the
standard).

In practice, a lot of applications aren't concerned with
manipulating individual characters anyway; they need to
recognize separators (but often all of the separators will have
single byte codes in UTF-8), and break the text up into
segments, but not much more. (Or so much more that the
difference between UTF-8 and UTF-32 becomes negligible, e.g.
they need to treat the two code point sequence "\u0061\u0302" as
a single character equal to "\u00E2".)

Öö Tiib · Sep 8, 2010

[...]

I am somewhat sceptical about usefulness of utf-8 bytes for
anything but storing and transporting texts. Simple
operations like std::toupper will never work with these
anyway.

Click to expand...

"Simple" operations like std::toupper don't work with most
encodings, including char32_t UTF-32. Mainly because things
like "toupper" aren't simple. (The classical example:
toupper('\u00DF') should result in the two character sequence
"SS".) Any effective toupper has to work on a string level, not
on a character level, and generate a new string (since there is
no one to one mapping to upper). And this can be done in UTF-8,
with the correct tools (which maybe should be part of the
standard).

Yes, standard library should contain correct tools and not contain
incorrect and misleading tools. Some function in standard library that
accepts char* as character sequence may pretend to be silly and expect
ASCII. If there was a thing like char8_t that has exact meaning and so
old "i thought it's ASCII" trick can nt be pulled.

In practice, a lot of applications aren't concerned with
manipulating individual characters anyway; they need to
recognize separators (but often all of the separators will have
single byte codes in UTF-8), and break the text up into
segments, but not much more. (Or so much more that the
difference between UTF-8 and UTF-32 becomes negligible, e.g.
they need to treat the two code point sequence "\u0061\u0302" as
a single character equal to "\u00E2".)

Hmm but ... very lot of apps have to deal with texts entered by user
or sent by other apps that can not manage to specify a binary (or well-
formed xml) interface. Such apps always need operations like
capitalizing, case-insensitive search/compare, date, time, numeric and
money formatting/parsing and so on. This is not something that can be
called separator searcing or breaking up? Also it all sounds as being
business of <locale> and if <locale> can not pull the weight then it
should be kicked out from standard and something that works should be
put in. If you have to use ICU4C library anyway to have correct locale
specific comparision, transformation and regular expression rules then
standard should drop petending that it provides something.

Francesco S. Carta · Sep 8, 2010

No, not really. For ASCII they are actually not needed (the locale thing
would just be a complication), and our software does not do any natural
language procssing. Our concern is only that if our (say, Japanese)
customer for example decides to use hieroglyphs in script file comments,
file names or database labels, they would be passed and stored correctly.
We are really not interested if some of those hieroglyph denote
punctuation or not.

Well, that would be pretty normal. Strange would be to see a Pharaoh
using ideograms in a label ;-)

James Kanze · Sep 9, 2010

I meant these functions in <locale> do not you really ever need them?

template < class charT > bool isspace( charT c, const locale& loc );
template < class charT > bool isprint( charT c, const locale& loc );
template < class charT > bool iscntrl( charT c, const locale& loc );
template < class charT > bool isupper( charT c, const locale& loc );
template < class charT > bool islower( charT c, const locale& loc );
template < class charT > bool isalpha( charT c, const locale& loc );
template < class charT > bool isdigit( charT c, const locale& loc );
template < class charT > bool ispunct( charT c, const locale& loc );
template < class charT > bool isxdigit( charT c, const locale& loc );
template < class charT > bool isalnum( charT c, const locale& loc );
template < class charT > bool isgraph( charT c, const locale& loc );
template < class charT > charT toupper( charT c, const locale& loc );
template <class charT> charT tolower( charT c, const locale& loc );

In most applications, no. And if you're actually dealing with
full Unicode, neither they nor their wide character equivalents
work: even in UTF-32, you may need several code points to
specify a character.

Simple ... i meant for the people who say that here it should be
capitalized and here in upper case, here with bold font.

These sound like presentation issues (bold font is definitly
one). A lot of applications aren't concerned with presentation.
And those that are, and that need to support full Unicode,
generally can't use the above functions anyway, because several
code points may be necessary to specify a character, even in
Unicode.

For them all three feel tasks with similar complexity. It is
reasonable requirement: "i want to search for the text i typed
in case-insensitively", isn't it?

Maybe, but then you have to define exactly what you mean by
"case-insensitive". In Germany, there are two separate
conventions regarding Umlauts ("ä" may compare equal to "a" or
to "ae", depending on the convention), for example, and of
course, "ß" must compare equal to "SS" (or in certain special
cases, to "SZ", it's context dependent).

If std::toupper() with char32_t needs still special
post-processing with German ß or some other exception, then
okay. If it does not work with char8_t then it should throw,
and not produce rubbish.

That's an interesting proposition; I rather like it.

The simple forms of the functions are useful in many contexts,
where you know that you'll only be treating (or should only be
treating) pure ASCII, for example. They should be supported, if
only for historical reasons. The question is what to do when
something like isalpha is called on something that isn't
a character in the locale specific encoding. The current
specification says to return false (or 0 in the C versions); if
it isn't a character, it isn't an alphabetic character. But
I rather like the idea of throwing an exception: if you pass it
something that isn't a character, then you've probably got the
wrong file, or the wrong data, or whatever. (Alternatively, you
need a function islegal, and then a precondition for the other
functions that islegal returns true.)

I would also be in favor of raising an exception or having
a precondition failure if they are called on a local which uses
a multibyte encoding (like UTF-8), even if the actual character
in question is only a single byte. (And what about calling
islower in an encoding for an alphabet like arabic, which
doesn't have case?)

James Kanze · Sep 9, 2010

[...]

I am somewhat sceptical about usefulness of utf-8 bytes
for anything but storing and transporting texts. Simple
operations like std::toupper will never work with these
anyway.

Click to expand...

"Simple" operations like std::toupper don't work with most
encodings, including char32_t UTF-32. Mainly because things
like "toupper" aren't simple. (The classical example:
toupper('\u00DF') should result in the two character sequence
"SS".) Any effective toupper has to work on a string level, not
on a character level, and generate a new string (since there is
no one to one mapping to upper). And this can be done in UTF-8,
with the correct tools (which maybe should be part of the
standard).

Click to expand...

Yes, standard library should contain correct tools and not
contain incorrect and misleading tools. Some function in
standard library that accepts char* as character sequence may
pretend to be silly and expect ASCII. If there was a thing
like char8_t that has exact meaning and so old "i thought it's
ASCII" trick can nt be pulled.

The standard library does contain "correct" tools, in the sense
that they work according to specification

. They probably are
a bit misleading, but this could be considered a problem of
documentation (which isn't the role of the standard): it's clear
to me that isupper, for example, is meaningless in Unicode, or
with an alphabet which doesn't have case, or ideographs, or any
number of other things: in general, the basic functions in ctype
are only meaningful in "constrained" situations (e.g. to parse
a case insensitive programming language). They generally don't
work well with human languages.

Hmm but ... very lot of apps have to deal with texts entered
by user or sent by other apps that can not manage to specify
a binary (or well- formed xml) interface. Such apps always
need operations like capitalizing, case-insensitive
search/compare, date, time, numeric and money
formatting/parsing and so on.

Do they? A very lot of apps don't do any real text processing
at all.

This is not something that can be called separator searcing or
breaking up? Also it all sounds as being business of <locale>
and if <locale> can not pull the weight then it should be
kicked out from standard and something that works should be
put in.

The issue isn't simple, and there are historical considerations
which have to be taken into account. The current <locale> does
represent a halfway solution, but I don't think that, even now,
we know exactly what is needed for a full solution (but we're

a lot closer than we were when said:
If you have to use ICU4C library anyway to have correct locale
specific comparision, transformation and regular expression
rules then standard should drop petending that it provides
something.

I doubt that even ICU handles all of the cases needed for
correct presentation, although they certainly do a lot more than
anything else I know.

The standard library doesn't pretend to solve all problems. It
offers a minimal set of functionality for certain limited uses,
IMHO more for historical reasons (and the fact that it is needed
for iostream) than for anything else. If you need more, you
need a third party library (if you can find one which is
adequate), or to implement your own code. Full
internationalization is very, very complex, and rather
difficult.

Öö Tiib · Sep 10, 2010

In most applications, no. And if you're actually dealing with
full Unicode, neither they nor their wide character equivalents
work: even in UTF-32, you may need several code points to
specify a character.

I hope we do not discuss applications where localization is not
important or is made trivially by translating the few messages. There
are plenty of such applications too of course. Some application that
regulates the conversions of frequencies embedded into some engine or
brake does not care of localization or portability and so does not use

These sound like presentation issues (bold font is definitly
one). A lot of applications aren't concerned with presentation.
And those that are, and that need to support full Unicode,
generally can't use the above functions anyway, because several
code points may be necessary to specify a character, even in
Unicode.

Yes. There are exceptions and quirks, if not with your code then with
some other application or device in row where you usually have to pass
data around. As coincidence a friendly java team was also working past
week with such two-code-point ä-s in produced html, everything but
IE-7 displays it correctly. Such issues have been always there, and
then someone has to replace the "\u0061\u0302" with one code point ä-s
(and the like) for IE-7. Usual everyday work and nothing C++ specific
there. Maybe standard should add some iscombining() for detecting
symbols that are meant to be parts of combined symbols? Unicode is

full of such things so if C++ said:
Maybe, but then you have to define exactly what you mean by
"case-insensitive". In Germany, there are two separate
conventions regarding Umlauts ("ä" may compare equal to "a" or
to "ae", depending on the convention), for example, and of
course, "ß" must compare equal to "SS" (or in certain special
cases, to "SZ", it's context dependent).

Overall such local differences always pop up. I have faced complex
exceptional conventions that are connected to slang used in subject
field. The deeper the science the more obscure slang and convenience
conventions about terms and strange alien symbols the wizards on field
are using. Exceptions were relatively few, but in often-used terms.
Like you say else thread even ICU can not deal with everything that
might be needed.

It can often be solved by special pre- or post-processing wrappers.
For example search: can construct transformation variants with pre-
processing that applies unusual cultural assumption rules and then
search with std::find_first_of() that then compares using
transformations what standard can provide. If some local
transformation variant is missing then there is always some software-
enthusiast specialist who is pointing it out and it is then simple to
add.

That's an interesting proposition; I rather like it.

Good to read.

The simple forms of the functions are useful in many contexts,
where you know that you'll only be treating (or should only be
treating) pure ASCII, for example. They should be supported, if
only for historical reasons. The question is what to do when
something like isalpha is called on something that isn't
a character in the locale specific encoding. The current
specification says to return false (or 0 in the C versions); if
it isn't a character, it isn't an alphabetic character. But
I rather like the idea of throwing an exception: if you pass it
something that isn't a character, then you've probably got the
wrong file, or the wrong data, or whatever. (Alternatively, you
need a function islegal, and then a precondition for the other
functions that islegal returns true.)

Yes, input may be is from unreliable source. For example entered by
user or read from file provided by user. It is "dirty" data. If it is
expected to be pure ASCII but no-one did even elementary check that it
really is (for example with such islegal() you suggest) then it is
programming error. It may be very serious programming error ...
depending on context.

I would also be in favor of raising an exception or having
a precondition failure if they are called on a local which uses
a multibyte encoding (like UTF-8), even if the actual character
in question is only a single byte. (And what about calling
islower in an encoding for an alphabet like arabic, which
doesn't have case?)

I have no experience with arabic alphabet. Possibly the isupper and
islower should be both false for their letters. Of course the case-
related functionality itself is likely pointless and so should be
hidden/removed to not confuse customers.

Questions on various string literals in c++0x	1	Dec 7, 2010
Non latin characters in string literals	17	Jan 3, 2010
Unicode/UTF-8 confusion	1	Mar 15, 2008
can the pre-processor convert string literals into chars?	10	Mar 29, 2007
can the pre-processor convert string literals into chars?	2	Mar 29, 2007
UTF-16 & wchar_t: the 2nd worst thing about C++	23	Mar 9, 2006
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008

Poll: Which type would you prefer for UTF-8 string literals in C++0x

Martin B.

Alf P. Steinbach /Usenet

Martin B.

tni

Pavel

BGB / cr88192

Pavel

James Kanze

Martin B.

Öö Tiib

Öö Tiib

Öö Tiib

James Kanze

Öö Tiib

Francesco S. Carta

James Kanze

James Kanze

Öö Tiib

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads