tolower and Unicode

K

Kelvin Moss

HI all,

I want to know if tolower can handle unicode characters (codepoints)
correctly? Is it specified by the Standard? If yes, then a reference to
the Standard would be appreciated. Or, is it implementation defined?

Thanks ..
 
C

Chris Torek

I want to know if tolower can handle unicode characters (codepoints)
correctly?

The tolower() function is only required to handle values in
the set {EOF, [0..UCHAR_MAX]} (typically -1 to 255 inclusive).
If the tolower() in a given implementation only *does* handle
those, it is clearly not going to handle Unicode codepoints
above #00ff.

In C99, there is a towupper() function that handles wide
characters. If the implementation happens to use Unicode for
its wide characters (and of course supports C99 well enough),
this will do it; if not, it will not.

There is nothing forcing any given implementation to *not*
handle Unicode with toupper(), but there is nothing forcing it
to do so either.
 
W

William Ahern

On Mon, 25 Sep 2006 07:23:26 +0000, Chris Torek wrote:
In C99, there is a towupper() function that handles wide characters. If
the implementation happens to use Unicode for its wide characters (and
of course supports C99 well enough), this will do it; if not, it will
not.

There is nothing forcing any given implementation to *not* handle
Unicode with toupper(), but there is nothing forcing it to do so either.

How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).

Point being, the standard C string manipulation interface CANNOT
fully support Unicode, and IMHO the above example is a rather trivial
proof of why ISO C cannot support Unicode at any level of sufficiency,
short of a scenario where "Unicode" is used as a fancy term for 7-bit
ASCII.
 
W

William Ahern

On Mon, 25 Sep 2006 18:51:02 -0700, William Ahern wrote:
Point being, the standard C string manipulation interface CANNOT fully
support Unicode, and IMHO the above example is a rather trivial proof of
why ISO C cannot support Unicode at any level of sufficiency, short of a
scenario where "Unicode" is used as a fancy term for 7-bit ASCII.

By "cannot support" I meant neither by it's historical nor wide-character
interfaces. I did not mean to imply the impossibility of a Unicode string
library written in ISO C.

This problem isn't a problem of C, per se. Languages like Python and Java
have also inherited this problem (partly inherited from C). The real issue
lies in the now faulty assumptions behind the functional interfaces.

http://www.unicode.org/faq/casemap_charprop.html

- Bill
 
C

Chris Torek

How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).

Indeed. According to the Standard, it will have to leave the
lowercase eszet as a lowercase eszet, there being no uppercase
character equivalent.
Point being, the standard C string manipulation interface CANNOT
fully support Unicode ...

Well, in the sense of "completely translating a lowercase string
to an equivalent (but longer) uppercase string", no. But towupper()
(and indeed plain toupper() as well) *could* do the job of "translating
a lowercase Unicode character to its (single) uppercase equivalent,
where that exists". Of course, as I said above, there is not even
a guarantee that wide strings and wchar_t characters use Unicode
at all.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top