Is unicode.lower() locale-independent?

R

Robert Kern

The section on "String Methods"[1] in the Python documentation states that for
the case conversion methods like str.lower(), "For 8-bit strings, this method is
locale-dependent." Is there a guarantee that unicode.lower() is
locale-*in*dependent?

The section on "Case Conversion" in PEP 100 suggests this, but the code itself
looks like to may call the C function towlower() if it is available. On OS X
Leopard, the manpage for towlower(3) states that it "uses the current locale"
though it doesn't say exactly *how* it uses it.

This is the bug I'm trying to fix:

http://scipy.org/scipy/numpy/ticket/643
http://dev.laptop.org/ticket/5559

[1] http://docs.python.org/lib/string-methods.html
[2] http://www.python.org/dev/peps/pep-0100/

Thanks.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Machin

The section on "String Methods"[1] in the Python documentation states that for
the case conversion methods like str.lower(), "For 8-bit strings, this method is
locale-dependent." Is there a guarantee that unicode.lower() is
locale-*in*dependent?

The section on "Case Conversion" in PEP 100 suggests this, but the code itself
looks like to may call the C function towlower() if it is available. On OS X
Leopard, the manpage for towlower(3) states that it "uses the current locale"
though it doesn't say exactly *how* it uses it.

This is the bug I'm trying to fix:

http://scipy.org/scipy/numpy/ticket/643
http://dev.laptop.org/ticket/5559

[1]http://docs.python.org/lib/string-methods.html
[2]http://www.python.org/dev/peps/pep-0100/

The Unicode standard says that case mappings are language-dependent.
It gives the example of the Turkish dotted capital letter I and
dotless small letter i that "caused" the numpy problem. See
http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

Here is what the Python 2.5.1 unicode implementation does in an
English-language locale:
.... print repr(eye), ucd.name(eye)
....
u'I' LATIN CAPITAL LETTER I
u'i' LATIN SMALL LETTER I
u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
u'\u0131' LATIN SMALL LETTER DOTLESS I.... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
eye.capitalize())
....
u'I' u'I' u'i' u'I'
u'i' u'I' u'i' u'I'
u'\u0130' u'\u0130' u'i' u'\u0130'
u'\u0131' u'I' u'\u0131' u'I'

The conversions for I and i are not correct for a Turkish locale.

I don't know how to repeat the above in a Turkish locale.

However it appears from your bug ticket that you have a much narrower
problem (case-shifting a small known list of English words like VOID)
and can work around it by writing your own locale-independent casing
functions. Do you still need to find out whether Python unicode
casings are locale-dependent?

Cheers,
John
 
R

Robert Kern

John said:
The section on "String Methods"[1] in the Python documentation states that for
the case conversion methods like str.lower(), "For 8-bit strings, this method is
locale-dependent." Is there a guarantee that unicode.lower() is
locale-*in*dependent?

The section on "Case Conversion" in PEP 100 suggests this, but the code itself
looks like to may call the C function towlower() if it is available. On OS X
Leopard, the manpage for towlower(3) states that it "uses the current locale"
though it doesn't say exactly *how* it uses it.

This is the bug I'm trying to fix:

http://scipy.org/scipy/numpy/ticket/643
http://dev.laptop.org/ticket/5559

[1]http://docs.python.org/lib/string-methods.html
[2]http://www.python.org/dev/peps/pep-0100/

The Unicode standard says that case mappings are language-dependent.
It gives the example of the Turkish dotted capital letter I and
dotless small letter i that "caused" the numpy problem. See
http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180

That doesn't determine the behavior of unicode.lower(), I don't think. That
specifies semantics for when one is dealing with a given language in the
abstract. That doesn't specify concrete behavior with respect to a given locale
setting on a real computer. For example, my strings 'VOID', 'INT', etc. are all
English, and I want English case behavior. The language of the data and the
transformations I want to apply to the data is English even though the user may
have set the locale to something else.
Here is what the Python 2.5.1 unicode implementation does in an
English-language locale:

... print repr(eye), ucd.name(eye)
...
u'I' LATIN CAPITAL LETTER I
u'i' LATIN SMALL LETTER I
u'\u0130' LATIN CAPITAL LETTER I WITH DOT ABOVE
u'\u0131' LATIN SMALL LETTER DOTLESS I
... print "%r %r %r %r" % (eye, eye.upper(), eye.lower(),
eye.capitalize())
...
u'I' u'I' u'i' u'I'
u'i' u'I' u'i' u'I'
u'\u0130' u'\u0130' u'i' u'\u0130'
u'\u0131' u'I' u'\u0131' u'I'

The conversions for I and i are not correct for a Turkish locale.

I don't know how to repeat the above in a Turkish locale.

If you have the correct locale data in your operating system, this should be
sufficient, I believe:

$ LANG=tr_TR python
Python 2.4.3 (#1, Mar 14 2007, 19:01:42)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
However it appears from your bug ticket that you have a much narrower
problem (case-shifting a small known list of English words like VOID)
and can work around it by writing your own locale-independent casing
functions. Do you still need to find out whether Python unicode
casings are locale-dependent?

I would still like to know. There are other places where .lower() is used in
numpy, not to mention the rest of my code.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Torsten Bronger

Hallöchen!

Fredrik said:
"lower" uses the informative case mappings provided by the Unicode
character database; see

http://www.unicode.org/Public/4.1.0/ucd/UCD.html

afaik, changing the locale has no influence whatsoever on Python's
Unicode subsystem.

Slightly off-topic because it's not part of the Unicode subsystem,
but I was once irritated that the none-breaking space (codepoint xa0
I think) was included into string.whitespace. I cannot reproduce it
on my current system anymore, but I was pretty sure it occured with
a fr_FR.UTF-8 locale. Is this possible? And who is to blame, or
must my program cope with such things?

Tschö,
Torsten.
 
J

John Machin

"lower" uses the informative case mappings provided by the Unicode
character database; see

http://www.unicode.org/Public/4.1.0/ucd/UCD.html

of which the relevant part is
"""
Case Mappings

There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII. For more
information, see Chapter 3 in Unicode 4.0.

For compatibility with existing parsers, UnicodeData.txt only contains
case mappings for characters where they are one-to-one mappings; it
also omits information about context-sensitive case mappings.
Information about these special cases can be found in a separate data
file, SpecialCasing.txt.
"""

It seems that Python doesn't use the SpecialCasing.txt file. Effects
include:
(a) one-to-many mappings don't happen e.g. LATIN SMALL LETTER SHARP S:
u'\xdf'.upper() produces u'\xdf' instead of u'SS'
(b) language-sensitive mappings (e.g. dotted/dotless I/i for Turkish
(and Azeri)) don't happen
(c) context-sensitive mappings don't happen e.g. lower case of GREEK
CAPITAL LETTER SIGMA depends on whether it is the last letter in a
word.
 
J

John Machin

Hallöchen!






Slightly off-topic because it's not part of the Unicode subsystem,
but I was once irritated that the none-breaking space (codepoint xa0
I think) was included into string.whitespace. I cannot reproduce it
on my current system anymore, but I was pretty sure it occured with
a fr_FR.UTF-8 locale. Is this possible? And who is to blame, or
must my program cope with such things?

The NO-BREAK SPACE is treated as whitespace in the Python unicode
subsystem. As for str objects, the default "C" locale doesn't know it
exists; otherwise AFAIK if the character set for the locale has it, it
will be treated as whitespace.

You were irritated because non-break SPACE was included in
string.whiteSPACE? Surely not! It seems eminently logical to me.
Perhaps you were irritated because str.split() ignored the "no-break"?
If like me you had been faced with removing trailing spaces from text
columns in databases, you surely would have been delighted that
str.rstrip() removed the trailing-padding-for-nicer-layout no-break
spaces that the users had copy/pasted from some clown's website :)

What was the *real* cause of your irritation?
 
R

Robert Kern

Fredrik said:
"lower" uses the informative case mappings provided by the Unicode
character database; see

http://www.unicode.org/Public/4.1.0/ucd/UCD.html

afaik, changing the locale has no influence whatsoever on Python's
Unicode subsystem.

Even if towlower() gets used? I've found an explicit statement that the
conversion it does can be locale-specific:

http://msdn2.microsoft.com/en-us/library/8h19t214.aspx

Thanks, Fredrik.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

Carl Banks

The NO-BREAK SPACE is treated as whitespace in the Python unicode
subsystem. As for str objects, the default "C" locale doesn't know it
exists; otherwise AFAIK if the character set for the locale has it, it
will be treated as whitespace.

You were irritated because non-break SPACE was included in
string.whiteSPACE? Surely not! It seems eminently logical to me.

To me it seems the point of a non-breaking space is to have something
that's printed as whitespace but not treated as it.
Perhaps
you were irritated because str.split() ignored the "no-break"? If like
me you had been faced with removing trailing spaces from text columns in
databases, you surely would have been delighted that str.rstrip()
removed the trailing-padding-for-nicer-layout no-break spaces that the
users had copy/pasted from some clown's website :)

What was the *real* cause of your irritation?

If you want to use str.split() to split words, you will foil the user who
wants to not break at a certain point.

Your use of rstrip() is a lot more specialized, if you ask me.


Carl Banks
 
M

Martin v. Löwis

The Unicode standard says that case mappings are language-dependent.

I think you are misreading it. 5.18 "Implementation Guides" says
(talking about "most environments") "In such cases, the
language-specific mappings *must not* be used." (emphasis also
in the original spec).

Regards,
Martin
 
T

Torsten Bronger

Hallöchen!

John said:
[...]

Slightly off-topic because it's not part of the Unicode
subsystem, but I was once irritated that the none-breaking space
(codepoint xa0 I think) was included into string.whitespace. I
cannot reproduce it on my current system anymore, but I was
pretty sure it occured with a fr_FR.UTF-8 locale. Is this
possible? And who is to blame, or must my program cope with such
things?

The NO-BREAK SPACE is treated as whitespace in the Python unicode
subsystem. As for str objects, the default "C" locale doesn't know
it exists; otherwise AFAIK if the character set for the locale has
it, it will be treated as whitespace.

[...]

What was the *real* cause of your irritation?

I was missing something like string.ascii_whitespace in the string
module. There is string.ascii_lower after all, and the
documentation doesn't clearly say string.whitespace is
locale-dependent.

In contrast to lower/uppercase conversions, where often human
language is transformed, the use cases for whitespace handling are
mostly syntactic purposes. And parsing something with
locale-dependent whitespace definitions is broken.

Thus, I had the choice: defining my own whitespace constant, or
forcing the 'C' locale. I chose the latter because I'm not a big
fan of locales anyway.

On my current computer(s), all locales seem to have the same
definition of whitespace as the 'C' locale. I've only seen that one
(broken, in my opinion) French locale which included the NBSP. In
my opinion, this is a trap rather than anything useful. Well, if I
indeed remember it correctly; this is why I asked above, "Is it
possible?".

Tschö,
Torsten.
 
J

John Machin

I think you are misreading it.

Ummm well, it does say "normative" as opposed to Fredrik's
"informative" ...
5.18 "Implementation Guides" says
(talking about "most environments") "In such cases, the
language-specific mappings *must not* be used." (emphasis also
in the original spec).

Here is the paragraph from which you quote:
"""
In most environments, such as in file systems, text is not and cannot
be tagged with language information. In such cases, the language-
specific mappings /must not/ be used. Otherwise, data structures such
as B-trees might be built based on one set of case foldings and used
based on a different set of case foldings. This discrepancy would
cause those data structures to become corrupt. For such environments,
a constant, language-independent, default case folding is required.
"""
This is from the middle of a section titled "Caseless Matching"; this
section starts:
"""
Caseless matching is implemented using case folding, which is the
process of mapping strings to a canonical form where case differences
are erased. Case folding allows for fast
caseless matches in lookups because only binary comparison is
required. It is more than just conversion to lowercase. For example,
it correctly handles cases such as the Greek sigma, so that
<scrambled_in_transmission1> and <scrambled_in_transmission2> will
match.
"""

Python doesn't offer a foldedcase method, and the attitude of 99% of
users would be YAGNI; use this:
foldedcase = lambda x: x.lower()

What the paragraph you quoted seems to be warning about is that people
who do implement a fully-principled foldedcase using the Unicode
CaseFolding.txt file should be careful about offering foldedcaseTurkic
and foldedcaseLithuanianDictionary -- both dangerous and YAGNI**2.

This topic seems to be quite different to the topic of whether the
results of unicode.lower does/should depend on the locale or not.
 
J

John Machin

To me it seems the point of a non-breaking space is to have something
that's printed as whitespace but not treated as it.

To me it seems the point of a no-break space is that it's treated as a
space in all respects except that it doesn't "break".
If you want to use str.split() to split words, you will foil the user who
wants to not break at a certain point.

Which was exactly my point -- but this would happen only rarely or not
at all in my universe (names, addresses, product descriptions, etc in
databases).
Your use of rstrip() is a lot more specialized, if you ask me.

Not very specialised at all in my universe -- a standard
transformation that one normally applies to database text is to remove
all leading and trailing whitespace, and compress runs of 1 or more
whitespace characters to a single normal space. Your comment seems to
imply that trailing non-break spaces are significant and should be
preserved ...
 
R

Robert Kern

Martin said:
Right. However, the build option of Python where that's the case is
deprecated.

Excellent. Thank you.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top