convert Unicode to lower/uppercase?

  • Thread starter Hallvard B Furuseth
  • Start date
H

Hallvard B Furuseth

Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

What I actually need to do is to compare a number of strings in a
case-insensitive manner, so I assume it's simplest to convert to
lower/upper first.

Possibly all strings will be from the latin-1 character set, so I could
convert to 8-bit latin-1, map to lowercase, and convert back, but that
seems rather cumbersome.
 
J

jallan

Peter Otten said:
Toiled and came up with:

u'abc\xe4\xf6\xfc'

Peter

But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Jim Allan
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

jallan said:
But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Also, in German, the uppercase mapping of ß is of ongoing debate.
For example, the Duden from 1919 says

| Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
| _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
| aufhören muß, sobald ein geeigneter Druckbuchstabe für das
| große ß geschaffen ist.

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

Regards,
Martin
 
A

Asun Friere

Martin v. Löwis said:
The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

And replaced with what? ie. is there now a single capital for SZ?
 
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Asun said:
And replaced with what? ie. is there now a single capital for SZ?

ß (sz) has not been completely eliminated. After *short* vocals it has
been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
vocals, it is still used (Maß, Gruß, ...).

-- Gerhard

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.
 
P

Peter Otten

Martin v. Löwis said:
Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.
Also, in German, the uppercase mapping of ß is of ongoing debate.

My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.
For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

Peter
 
J

jallan

Peter Otten said:
Martin v. Löwis said:
Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Yes.

Also the Unicode main charts in the annotation for 00DF state:

uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative. See
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another shit legacy implementation?

Jim Allan
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

And replaced with what? ie. is there now a single capital for SZ?

Unfortunately, I don't have a current Duden here, but I *think* you
now have to write double-S. There is, of course, the old MASSE vs
MASZE issue - I don't know whether this is considered relevant, as
capitalization is rare, anyway, and ambiguities can be clarified from
the context.

Regards,
Martin
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Peter Otten said:
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.
My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.

There is, of course, the famous "MASSE oder MASZE" example, in particular
in the form "WIR TRINKEN BIER IN MASSEN".

Regards,
Martin
 
A

Asun Friere

Gerhard Häring said:
PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

As an English speaker, who occasionally finds himself trying to
decipher German text, let me tell you that little flags like that
--"pick me! I'm a noun!" --are actually quite useful.
 
J

jallan

Peter Otten said:
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.

Of course not. From http://www.python.org/doc/current/lib/string-methods.html#l2h-203:

<<
*upper( )*
Return a copy of the string converted to uppercase.
This makes no claim about how the magic is done. But there is
certainly an implied claim that it is done correctly.

Unicode specifications are easily available at
http://www.unicode.org/versions/Unicode4.0.0/.

At 3.13 is indicated:

<< The full case mappings for Unicode characters are obtained by using
the mappings from SpecialCasing.txt _plus_ the mappings from
UnicodeData.txt, excluding any latter mappings that would conflict. >>

Case mappings for Unicode require use of SpecialCasing otherwise the
results are not in accord with the Unicode standard.

At 4.2 is found:

<< Only legacy implementations that cannot handle case mappings that
increase string lengths should use UnicodeData case mappings alone.
The single-character mappings are insufficient for languages such as
German >>

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Unicode again warns that using UnicodeData.txt alone is not
sufficient.

The text continues on "SpecialCasting.txt":

<< Contains additional case mappings that map to more than one
character, such as "ß" to "SS". >>

Section 5.18 Case Mappings goes into further detail about casing
issues and specifically mentions:

<< Case mappings may produce strings of different length than the
original. For example the German character U+00DF ß LATIN SMALL LETTER
SHAPR S expands when uppercase to the sequence of two characters "SS".
This also occurs where there is no prcomposed character corresponding
to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE. >>

See also http://www.unicode.org/faq/casemap_charprop-old.html for the
Unicode FAQ which contains:

<<
Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can
uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in
the case of IPA, which needs to be left strictly alone even when
embedded in another language which is being case converted. The best
you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

The implied combined claim is that Python supports Unicode and
supports proper casing in Unicode.

This implied claim is false.

Truly accurate documentation for upper() should say that it uppercases
a string except for those characters where uppercasing would expand a
character to more than one character in which circumstance that
character is not uppercased or uppercased with loss of data.

Python specifications need not say how casing is done, whether by
using Unicode tables directly or by using its own methods that
accomplish the same results.

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Jim Allan
 
P

Peter Otten

jallan said:
I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

Personally, I think it's a long way to go for a little s, sharp as it may be
:)

Peter
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
..upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?
Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Things are more difficult than they appear to be.

Regards,
Martin
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Peter Otten said:
Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

Regards,
Martin
 
P

Peter Otten

Martin said:
You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

I followed the code to fixupper() which operates on a preallocated unicode
object and thus cannot cope with a string that expands while being
transformed. I didn't actually resolve the macros.

While we are at it, would it be viable to "abuse" the encoding/decoding
mechanism to do case conversions?

Peter
 
J

jallan

Peter Otten said:
Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

I would love to if I had the time. Sigh! Maybe in some months.
Personally, I think it's a long way to go for a little s, sharp as it may be
:)

If it were just ß one could thrown in a quick conversion of any ß to
ss at the beginning.

But there are over a hundred other characters that expand when
uppercased in http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt,
most of them Greek. Greek is a horror. See
http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html for the
sad tale.

Unfortunately language and orthography are messy and inconsistant and
illogical and sometimes just silly. But handling orthography properly
involves dealing with these complex rules and subrules and exceptions
to rules rather than ignoring them.

Unicode gives us great power, but with great power comes great
responsibility and lots of niggling code. :-(

Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
scripts have such a thing as casing and the Unicode people have
provided data files and algorithms that supposedly handle casing for
these languages acceptably.

From the Conformance requirements for Unicode at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G29484 (C20):

<< An implementation that purports to support the default casing
operations of case conversion, case detection, and caseless mapping
shall do so in accordance with the definitions and specifications in
Section 3.13, Default Case Operations. >>

This involves even more messy fussing about with context specification
for casing and with what values should be returned from a case
querying function, e.g. "A2" is true as either uppercase and titlecase
but not as lowercase. "3" is true as lowercase, uppercase and title
case.

Python or any applicaton or language either does or doesn't conform.

I doubt that there is currently any application that can yet honestly
purport to support Unicode default casing operations of case
conversion, case detection and caseless mapping.

Jim Allan
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Peter Otten said:
While we are at it, would it be viable to "abuse" the
encoding/decoding mechanism to do case conversions?

It might be viable, but I would consider it abuse: for one thing, I'm
not in favour of codecs which do Unicode->Unicode conversions - IMO, a
codec should convert between Unicode and byte strings. Furthermore, a
codec IMO should represent a proper "encoding", which case conversions
would not do.

Instead, it would be much better to provide such functions in a
library, e.g. by wrapping ICU. Then, case conversions should be done
locale-dependent, instead of being general (as .upper currently is).
The locale-dependent way would best operate on explicit locale
objects, so you would spell

locale_object = load_locale("German", "Plattdeutsch")
up_string = locale_object.to_upper(lower_string)

In that case, the upper-case function would stop being a string
method, and be a locale method instead, taking a string argument.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top