convert Unicode to lower/uppercase?

Hallvard B Furuseth · Sep 19, 2003

Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

What I actually need to do is to compare a number of strings in a
case-insensitive manner, so I assume it's simplest to convert to
lower/upper first.

Possibly all strings will be from the latin-1 character set, so I could
convert to 8-bit latin-1, map to lowercase, and convert back, but that
seems rather cumbersome.

Peter Otten · Sep 19, 2003

nospam said:
Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

Toiled and came up with:

ABCÄÖÜß

u'abc\xe4\xf6\xfc'

Peter

Hallvard B Furuseth · Sep 19, 2003

Thanks!

jallan · Sep 21, 2003

Peter Otten said:
Toiled and came up with:

u'abc\xe4\xf6\xfc'

Peter

But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Jim Allan

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 21, 2003

jallan said:
But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Also, in German, the uppercase mapping of ß is of ongoing debate.
For example, the Duden from 1919 says

| Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
| _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
| aufhören muß, sobald ein geeigneter Druckbuchstabe für das
| große ß geschaffen ist.

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

Regards,
Martin

Asun Friere · Sep 22, 2003

Martin v. Löwis said:
The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

And replaced with what? ie. is there now a single capital for SZ?

=?ISO-8859-1?Q?Gerhard_H=E4ring?= · Sep 22, 2003

Asun said:
And replaced with what? ie. is there now a single capital for SZ?

ß (sz) has not been completely eliminated. After *short* vocals it has
been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
vocals, it is still used (Maß, Gruß, ...).

-- Gerhard

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

Peter Otten · Sep 22, 2003

Martin v. Löwis said:
Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list>

? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Also, in German, the uppercase mapping of ß is of ongoing debate.

My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.
For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

Peter

jallan · Sep 22, 2003

Peter Otten said:
Martin v. Löwis said:

Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Click to expand...

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

[...]

# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Yes.

Also the Unicode main charts in the annotation for 00DF state:

uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative. See
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another shit legacy implementation?

Jim Allan

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 22, 2003

And replaced with what? ie. is there now a single capital for SZ?

Unfortunately, I don't have a current Duden here, but I *think* you
now have to write double-S. There is, of course, the old MASSE vs
MASZE issue - I don't know whether this is considered relevant, as
capitalization is rare, anyway, and ambiguities can be clarified from
the context.

Regards,
Martin

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 22, 2003

Peter Otten said:
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.

My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.

There is, of course, the famous "MASSE oder MASZE" example, in particular
in the form "WIR TRINKEN BIER IN MASSEN".

Regards,
Martin

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 22, 2003

So is Python just another shit legacy implementation?

Yes

Regards,
Martin

Asun Friere · Sep 23, 2003

Gerhard Häring said:
PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

As an English speaker, who occasionally finds himself trying to
decipher German text, let me tell you that little flags like that
--"pick me! I'm a noun!" --are actually quite useful.

jallan · Sep 23, 2003

Peter Otten said:
Peter Otten said:

# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Click to expand...

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.

Of course not. From http://www.python.org/doc/current/lib/string-methods.html#l2h-203:

<<
*upper( )*
Return a copy of the string converted to uppercase.
This makes no claim about how the magic is done. But there is
certainly an implied claim that it is done correctly.

Unicode specifications are easily available at
http://www.unicode.org/versions/Unicode4.0.0/.

At 3.13 is indicated:

<< The full case mappings for Unicode characters are obtained by using
the mappings from SpecialCasing.txt _plus_ the mappings from
UnicodeData.txt, excluding any latter mappings that would conflict. >>

Case mappings for Unicode require use of SpecialCasing otherwise the
results are not in accord with the Unicode standard.

At 4.2 is found:

<< Only legacy implementations that cannot handle case mappings that
increase string lengths should use UnicodeData case mappings alone.
The single-character mappings are insufficient for languages such as
German >>

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Unicode again warns that using UnicodeData.txt alone is not
sufficient.

The text continues on "SpecialCasting.txt":

<< Contains additional case mappings that map to more than one
character, such as "ß" to "SS". >>

Section 5.18 Case Mappings goes into further detail about casing
issues and specifically mentions:

<< Case mappings may produce strings of different length than the
original. For example the German character U+00DF ß LATIN SMALL LETTER
SHAPR S expands when uppercase to the sequence of two characters "SS".
This also occurs where there is no prcomposed character corresponding
to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE. >>

See also http://www.unicode.org/faq/casemap_charprop-old.html for the
Unicode FAQ which contains:

<<
Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can
uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in
the case of IPA, which needs to be left strictly alone even when
embedded in another language which is being case converted. The best
you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

The implied combined claim is that Python supports Unicode and
supports proper casing in Unicode.

This implied claim is false.

Truly accurate documentation for upper() should say that it uppercases
a string except for those characters where uppercasing would expand a
character to more than one character in which circumstance that
character is not uppercased or uppercased with loss of data.

Python specifications need not say how casing is done, whether by
using Unicode tables directly or by using its own methods that
accomplish the same results.

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Jim Allan

Peter Otten · Sep 23, 2003

jallan said:
I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

Personally, I think it's a long way to go for a little s, sharp as it may be

Peter

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 23, 2003

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ÃŸ. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
..upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Things are more difficult than they appear to be.

Regards,
Martin

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 23, 2003

Peter Otten said:
Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

Regards,
Martin

Peter Otten · Sep 23, 2003

Martin said:
You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

I followed the code to fixupper() which operates on a preallocated unicode
object and thus cannot cope with a string that expands while being
transformed. I didn't actually resolve the macros.

While we are at it, would it be viable to "abuse" the encoding/decoding
mechanism to do case conversions?

Peter

jallan · Sep 24, 2003

Peter Otten said:
Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

I would love to if I had the time. Sigh! Maybe in some months.

Personally, I think it's a long way to go for a little s, sharp as it may be

If it were just ß one could thrown in a quick conversion of any ß to
ss at the beginning.

But there are over a hundred other characters that expand when
uppercased in http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt,
most of them Greek. Greek is a horror. See
http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html for the
sad tale.

Unfortunately language and orthography are messy and inconsistant and
illogical and sometimes just silly. But handling orthography properly
involves dealing with these complex rules and subrules and exceptions
to rules rather than ignoring them.

Unicode gives us great power, but with great power comes great
responsibility and lots of niggling code. :-(

Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
scripts have such a thing as casing and the Unicode people have
provided data files and algorithms that supposedly handle casing for
these languages acceptably.

From the Conformance requirements for Unicode at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G29484 (C20):

<< An implementation that purports to support the default casing
operations of case conversion, case detection, and caseless mapping
shall do so in accordance with the definitions and specifications in
Section 3.13, Default Case Operations. >>

This involves even more messy fussing about with context specification
for casing and with what values should be returned from a case
querying function, e.g. "A2" is true as either uppercase and titlecase
but not as lowercase. "3" is true as lowercase, uppercase and title
case.

Python or any applicaton or language either does or doesn't conform.

I doubt that there is currently any application that can yet honestly
purport to support Unicode default casing operations of case
conversion, case detection and caseless mapping.

Jim Allan

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 24, 2003

Peter Otten said:
While we are at it, would it be viable to "abuse" the
encoding/decoding mechanism to do case conversions?

It might be viable, but I would consider it abuse: for one thing, I'm
not in favour of codecs which do Unicode->Unicode conversions - IMO, a
codec should convert between Unicode and byte strings. Furthermore, a
codec IMO should represent a proper "encoding", which case conversions
would not do.

Instead, it would be much better to provide such functions in a
library, e.g. by wrapping ICU. Then, case conversions should be done
locale-dependent, instead of being general (as .upper currently is).
The locale-dependent way would best operate on explicit locale
objects, so you would spell

locale_object = load_locale("German", "Plattdeutsch")
up_string = locale_object.to_upper(lower_string)

In that case, the upper-case function would stop being a string
method, and be a locale method instead, taking a string argument.

Regards,
Martin

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Convert on uppercase unaccentent unicode character	12	Oct 3, 2007
convert Unicode filenames to good-looking ASCII	3	May 6, 2010
Ascii to Unicode.	4	Jul 28, 2010
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
unicode compare errors	3	Dec 10, 2010
Help a beginner - simple lowercase to uppercase and so on function	63	Jul 26, 2009
unicode	7	Jul 1, 2007

convert Unicode to lower/uppercase?

Hallvard B Furuseth

Peter Otten

Hallvard B Furuseth

jallan

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Asun Friere

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Peter Otten

jallan

Martin v. =?iso-8859-15?q?L=F6wis?=

Martin v. =?iso-8859-15?q?L=F6wis?=

Martin v. =?iso-8859-15?q?L=F6wis?=

Asun Friere

jallan

Peter Otten

Martin v. =?iso-8859-15?q?L=F6wis?=

Martin v. =?iso-8859-15?q?L=F6wis?=

Peter Otten

jallan

Martin v. =?iso-8859-15?q?L=F6wis?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads