tolower and Unicode

Kelvin Moss · Sep 25, 2006

HI all,

I want to know if tolower can handle unicode characters (codepoints)
correctly? Is it specified by the Standard? If yes, then a reference to
the Standard would be appreciated. Or, is it implementation defined?

Thanks ..

Chris Torek · Sep 25, 2006

I want to know if tolower can handle unicode characters (codepoints)
correctly?

The tolower() function is only required to handle values in
the set {EOF, [0..UCHAR_MAX]} (typically -1 to 255 inclusive).
If the tolower() in a given implementation only *does* handle
those, it is clearly not going to handle Unicode codepoints
above #00ff.

In C99, there is a towupper() function that handles wide
characters. If the implementation happens to use Unicode for
its wide characters (and of course supports C99 well enough),
this will do it; if not, it will not.

There is nothing forcing any given implementation to *not*
handle Unicode with toupper(), but there is nothing forcing it
to do so either.

William Ahern · Sep 26, 2006

On Mon, 25 Sep 2006 07:23:26 +0000, Chris Torek wrote:

In C99, there is a towupper() function that handles wide characters. If
the implementation happens to use Unicode for its wide characters (and
of course supports C99 well enough), this will do it; if not, it will
not.

There is nothing forcing any given implementation to *not* handle
Unicode with toupper(), but there is nothing forcing it to do so either.

How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).

Point being, the standard C string manipulation interface CANNOT
fully support Unicode, and IMHO the above example is a rather trivial
proof of why ISO C cannot support Unicode at any level of sufficiency,
short of a scenario where "Unicode" is used as a fancy term for 7-bit
ASCII.

William Ahern · Sep 26, 2006

On Mon, 25 Sep 2006 18:51:02 -0700, William Ahern wrote:

Point being, the standard C string manipulation interface CANNOT fully
support Unicode, and IMHO the above example is a rather trivial proof of
why ISO C cannot support Unicode at any level of sufficiency, short of a
scenario where "Unicode" is used as a fancy term for 7-bit ASCII.

By "cannot support" I meant neither by it's historical nor wide-character
interfaces. I did not mean to imply the impossibility of a Unicode string
library written in ISO C.

This problem isn't a problem of C, per se. Languages like Python and Java
have also inherited this problem (partly inherited from C). The real issue
lies in the now faulty assumptions behind the functional interfaces.

http://www.unicode.org/faq/casemap_charprop.html

- Bill

Chris Torek · Oct 6, 2006

How might towupper() handle changing the case of the German eszet (funky
looking B to mine eyes). In the lower case variant it can be represented
by a single integer (32-bit, 64-bit, 1024-bit, w'ever). However, the
uppercase must be "SS", necessitating two integers of some type
(regardless of the encoding format: UTF-8, UTF-16, UTF-32).

Indeed. According to the Standard, it will have to leave the
lowercase eszet as a lowercase eszet, there being no uppercase
character equivalent.

Point being, the standard C string manipulation interface CANNOT
fully support Unicode ...

Well, in the sense of "completely translating a lowercase string
to an equivalent (but longer) uppercase string", no. But towupper()
(and indeed plain toupper() as well) *could* do the job of "translating
a lowercase Unicode character to its (single) uppercase equivalent,
where that exists". Of course, as I said above, there is not even
a guarantee that wide strings and wchar_t characters use Unicode
at all.

Unicode (UTF-8) in C	13	Mar 16, 2014
Unicode codepoints	5	Jun 22, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
UNICODE: reinventing the wheel with WSUCONV	5	Mar 12, 2012
Unicode	3	Apr 20, 2009
ptr_fun & tolower confusion	12	Jul 4, 2008
Newbie - about using toupper/tolower	3	Jul 7, 2003
attempting to print unicode characters.	23	Aug 29, 2010

tolower and Unicode

Kelvin Moss

Chris Torek

William Ahern

William Ahern

Chris Torek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads