convert Unicode to lower/uppercase?

jallan · Sep 25, 2003

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ÃŸ. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

Click to expand...

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
.upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?

I am not aware that it is philosophically a *necessary* feature of
..upper() that a single character not be replaced by a string of two or
more characters.

One should fix the contradition by either changing the behavior of
..upper() so that it will properly case all strings or documenting
clearly that .upper() does not handle particular kinds of casing. Of
course users often don't read the documentation. :-(

Things are more difficult than they appear to be.

Yes.

Again and again one thinks one has a solution for a problem and then
exceptions turn up.

Again and again one finds things that one's code doesn't handle, often
from failure to analyze fully in the intitial stages and adopting
algorithms that prove insufficient to handle the data found in
reality.

Jim Allan

Jim Allan

Neil Hodgson · Sep 25, 2003

jallan:

(e-mail address removed) (Martin v. Löwis) wrote

I am not aware that it is philosophically a *necessary* feature of
.upper() that a single character not be replaced by a string of two or
more characters.

That is not the issue. The issue is that .upper would have to return a
list or map of results (for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}), which would be
difficult for the caller to make use of without performing some additional
work, finding the correct result for its locale. It is simpler for the
caller to provide a locale argument in the .upper call or in its context.

Neil

Neil Hodgson · Sep 25, 2003

Me:

for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),

For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

Neil

jallan · Sep 26, 2003

Neil Hodgson said:
Me:

For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

The file http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
purportedly contains *all* casings for all scripts for all languages
where the casings are not one-to-one or are otherwise not
straightforward.

The *only* locale oddities there are for Lithuanian and the two
languages Turkish and Azeri and concern only dot/no-dot variants of
the letters _i_, _I_, _j_, _J_ and no others.

There are *no* other locale-based oddities. The mess is thankfully
*very* limited in scope.

In my opinion, if the full Unicode casing specification is to be
followed, the most useful solution would be a parameter allowing the
user to choose among (1) normal Latin casing, (2) Turkish/Azeri or (2)
Lithuanian as the casing model for treatment of these letters.

The default for the parameter would either be based on current locale
or be normal Latin casing. I think the latter far better as it is
dangerous to have functions in a language differ from machine to
machine according to the current locale.

Also, in case someone brings it up, it was formerly standard to
generally omit diacritics on capital letters in Portuguese and in
French (in France but not in Quebec!)

This is no longer the norm for either language. See
http://www.academie-francaise.fr/langue/questions.html#accentuation
and http://www.press.uchicago.edu/Misc/Chicago/cmosfaq/cmosfaq.SpecialCharacters.html.

I have seen academic style sheets with a silly rule that diacritics
should be placed on capital letters as on lowercase letters except for
the word "A". See http://www.alphaacademic.co.uk/fcs.htm and
http://www.sagepub.com/journalManuscript.aspx?pid=9669&sc=1:

<< We use accents on capital letters, but capital A does not take a
grave accent. >>

It would not hurt to make a casing table customizable for such unusual
styles. But that is beyond Unicode's specifications.

A programmer who wishes odd customization beyond the norms of a
language and Unicode specifications can do it through transformations
outside of normal casing.

Jim Allan

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Convert on uppercase unaccentent unicode character	12	Oct 3, 2007
convert Unicode filenames to good-looking ASCII	3	May 6, 2010
Ascii to Unicode.	4	Jul 28, 2010
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
unicode compare errors	3	Dec 10, 2010
Help a beginner - simple lowercase to uppercase and so on function	63	Jul 26, 2009
unicode	7	Jul 1, 2007

convert Unicode to lower/uppercase?

jallan

Neil Hodgson

Neil Hodgson

jallan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads