convert Unicode to lower/uppercase?

Discussion in 'Python' started by Hallvard B Furuseth, Sep 19, 2003.

  1. Has someone got a Python routine or module which converts Unicode
    strings to lowercase (or uppercase)?

    What I actually need to do is to compare a number of strings in a
    case-insensitive manner, so I assume it's simplest to convert to
    lower/upper first.

    Possibly all strings will be from the latin-1 character set, so I could
    convert to 8-bit latin-1, map to lowercase, and convert back, but that
    seems rather cumbersome.

    --
    Hallvard
    Hallvard B Furuseth, Sep 19, 2003
    #1
    1. Advertising

  2. Hallvard B Furuseth

    Peter Otten Guest

    nospam wrote:

    > Has someone got a Python routine or module which converts Unicode
    > strings to lowercase (or uppercase)?


    Toiled and came up with:

    >>> print u"abcäöüß".upper()

    ABCÄÖÜß

    >>> u"ABCÄÖÜ".lower()

    u'abc\xe4\xf6\xfc'

    Peter
    Peter Otten, Sep 19, 2003
    #2
    1. Advertising

  3. Thanks!

    --
    Hallvard
    Hallvard B Furuseth, Sep 19, 2003
    #3
  4. Hallvard B Furuseth

    jallan Guest

    Peter Otten <> wrote in message news:<bkepb9$6a4$01$-online.com>...
    > nospam wrote:
    >
    > > Has someone got a Python routine or module which converts Unicode
    > > strings to lowercase (or uppercase)?

    >
    > Toiled and came up with:
    >
    > >>> print u"abcäöüß".upper()

    > ABCÄÖÜß
    >
    > >>> u"ABCÄÖÜ".lower()

    > u'abc\xe4\xf6\xfc'
    >
    > Peter


    But that really doesn't work properly. According to Unicode specs and
    German usage the uppercase of "ß" is actually "SS", that is the single
    character "ß" should uppercase to two characters.

    Jim Allan
    jallan, Sep 21, 2003
    #4
  5. jallan wrote:

    > But that really doesn't work properly. According to Unicode specs and
    > German usage the uppercase of "ß" is actually "SS", that is the single
    > character "ß" should uppercase to two characters.


    Can you cite exact chapter and verse of the Unicode specs that say so?
    According to the Unicode database,

    http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

    has neither an uppercase mapping, nor a lowercase mapping.

    Also, in German, the uppercase mapping of ß is of ongoing debate.
    For example, the Duden from 1919 says

    | Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
    | _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
    | aufhören muß, sobald ein geeigneter Druckbuchstabe für das
    | große ß geschaffen ist.

    The usage of SZ has only been eliminated in the recent change of
    the amtliche Rechtschreibung.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Sep 21, 2003
    #5
  6. Hallvard B Furuseth

    Asun Friere Guest

    "Martin v. Löwis" <> wrote in message news:<bkkusk$pvi$05$-online.com>...
    > The usage of SZ has only been eliminated in the recent change of
    > the amtliche Rechtschreibung.
    >


    And replaced with what? ie. is there now a single capital for SZ?
    Asun Friere, Sep 22, 2003
    #6
  7. Asun Friere wrote:
    > "Martin v. Löwis" <> wrote in message news:<bkkusk$pvi$05$-online.com>...
    >>The usage of SZ has only been eliminated in the recent change of
    >>the amtliche Rechtschreibung.

    >
    > And replaced with what? ie. is there now a single capital for SZ?


    ß (sz) has not been completely eliminated. After *short* vocals it has
    been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
    vocals, it is still used (Maß, Gruß, ...).

    -- Gerhard

    PS: I was quite disappointed with the reform of German ortography. I'd
    have favoured much more radical steps, like elimination of
    capitalization of the noun.
    =?ISO-8859-1?Q?Gerhard_H=E4ring?=, Sep 22, 2003
    #7
  8. Hallvard B Furuseth

    Peter Otten Guest

    "Martin v. Löwis" wrote:

    > jallan wrote:
    >
    >> But that really doesn't work properly. According to Unicode specs and
    >> German usage the uppercase of "ß" is actually "SS", that is the single
    >> character "ß" should uppercase to two characters.

    >
    > Can you cite exact chapter and verse of the Unicode specs that say so?
    > According to the Unicode database,
    >
    > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
    >
    > has neither an uppercase mapping, nor a lowercase mapping.


    It seems like UnicodeData.txt does not give the full story. Quoting from
    http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:

    [...]
    # (For compatibility, the UnicodeData.txt file only contains case mappings
    for
    # characters where they are 1-1, and does not have locale-specific
    mappings.)
    [...]
    # <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
    [...]
    # The German es-zed is special--the normal mapping is to SS.
    # Note: the titlecase should never occur in practice. It is equal to
    titlecase(uppercase(<es-zed>))

    00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
    [...]

    Thus, to comply with the standard, "ß".upper() --> "SS" is required.

    > Also, in German, the uppercase mapping of ß is of ongoing debate.


    My personal impression is that, even before the orthography reform in 1998,
    the SZ variant was seldom used.
    For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

    Peter
    Peter Otten, Sep 22, 2003
    #8
  9. Hallvard B Furuseth

    jallan Guest

    Peter Otten <> wrote in message news:<bkm919$ast$01$-online.com>...
    > "Martin v. Löwis" wrote:
    >
    > > jallan wrote:
    > >
    > >> But that really doesn't work properly. According to Unicode specs and
    > >> German usage the uppercase of "ß" is actually "SS", that is the single
    > >> character "ß" should uppercase to two characters.

    > >
    > > Can you cite exact chapter and verse of the Unicode specs that say so?
    > > According to the Unicode database,
    > >
    > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
    > >
    > > has neither an uppercase mapping, nor a lowercase mapping.

    >
    > It seems like UnicodeData.txt does not give the full story. Quoting from
    > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt:
    >
    > [...]


    > # (For compatibility, the UnicodeData.txt file only contains case mappings
    > for
    > # characters where they are 1-1, and does not have locale-specific
    > mappings.)
    > [...]
    > # <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
    > [...]
    > # The German es-zed is special--the normal mapping is to SS.
    > # Note: the titlecase should never occur in practice. It is equal to
    > titlecase(uppercase(<es-zed>))
    >
    > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
    > [...]
    >
    > Thus, to comply with the standard, "ß".upper() --> "SS" is required.


    Yes.

    Also the Unicode main charts in the annotation for 00DF state:

    uppercase is "SS"

    See http://www.unicode.org/charts/PDF/U0080.pdf

    This note on the character first appeared in Unicode 1.0 (published in
    1991) and has been in every revision.

    Unicode 1.0, Volume One also lists this in the lower case to upper
    case casing tables on page 453.

    There is nothing new about this casing requirement.

    A further mention occurs in the Unicode 4.0 specifications in Table
    4-1 in section 4.2 Case--Normative. See
    http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

    This contains the warning:

    << Only legacy implementations that cannot handle case mappings that
    increase sring lengths should use UnicodeData case mappings alone. The
    single-character mappings are insufficient for languages such as
    German. >>

    So is Python just another shit legacy implementation?

    Jim Allan
    jallan, Sep 22, 2003
    #9
  10. (Asun Friere) writes:

    > > The usage of SZ has only been eliminated in the recent change of
    > > the amtliche Rechtschreibung.
    > >

    >
    > And replaced with what? ie. is there now a single capital for SZ?


    Unfortunately, I don't have a current Duden here, but I *think* you
    now have to write double-S. There is, of course, the old MASSE vs
    MASZE issue - I don't know whether this is considered relevant, as
    capitalization is rare, anyway, and ambiguities can be clarified from
    the context.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 22, 2003
    #10
  11. Peter Otten <> writes:

    > # The German es-zed is special--the normal mapping is to SS.
    > # Note: the titlecase should never occur in practice. It is equal to
    > titlecase(uppercase(<es-zed>))
    >
    > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
    > [...]
    >
    > Thus, to comply with the standard, "ß".upper() --> "SS" is required.


    No. It would be required if .upper would claim to implement
    SpecialCasing - but it makes no such claim.

    > My personal impression is that, even before the orthography reform in 1998,
    > the SZ variant was seldom used.


    There is, of course, the famous "MASSE oder MASZE" example, in particular
    in the form "WIR TRINKEN BIER IN MASSEN".

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 22, 2003
    #11
  12. (jallan) writes:

    > So is Python just another shit legacy implementation?


    Yes :)

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 22, 2003
    #12
  13. Hallvard B Furuseth

    Asun Friere Guest

    Gerhard Häring <> wrote in message news:<>...

    > PS: I was quite disappointed with the reform of German ortography. I'd
    > have favoured much more radical steps, like elimination of
    > capitalization of the noun.


    As an English speaker, who occasionally finds himself trying to
    decipher German text, let me tell you that little flags like that
    --"pick me! I'm a noun!" --are actually quite useful.
    Asun Friere, Sep 23, 2003
    #13
  14. Hallvard B Furuseth

    jallan Guest

    (Martin v. Löwis) wrote in message news:<-berlin.de>...
    > Peter Otten <> writes:
    >
    > > # The German es-zed is special--the normal mapping is to SS.
    > > # Note: the titlecase should never occur in practice. It is equal to
    > > titlecase(uppercase(<es-zed>))
    > >
    > > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
    > > [...]
    > >
    > > Thus, to comply with the standard, "ß".upper() --> "SS" is required.

    >
    > No. It would be required if .upper would claim to implement
    > SpecialCasing - but it makes no such claim.


    Of course not. From http://www.python.org/doc/current/lib/string-methods.html#l2h-203:

    <<
    *upper( )*
    Return a copy of the string converted to uppercase.
    >>


    This makes no claim about how the magic is done. But there is
    certainly an implied claim that it is done correctly.

    Unicode specifications are easily available at
    http://www.unicode.org/versions/Unicode4.0.0/.

    At 3.13 is indicated:

    << The full case mappings for Unicode characters are obtained by using
    the mappings from SpecialCasing.txt _plus_ the mappings from
    UnicodeData.txt, excluding any latter mappings that would conflict. >>

    Case mappings for Unicode require use of SpecialCasing otherwise the
    results are not in accord with the Unicode standard.

    At 4.2 is found:

    << Only legacy implementations that cannot handle case mappings that
    increase string lengths should use UnicodeData case mappings alone.
    The single-character mappings are insufficient for languages such as
    German >>

    I don't see any particular reason why Python "cannot handle case
    mappings that increase string lengths".

    Unicode again warns that using UnicodeData.txt alone is not
    sufficient.

    The text continues on "SpecialCasting.txt":

    << Contains additional case mappings that map to more than one
    character, such as "ß" to "SS". >>

    Section 5.18 Case Mappings goes into further detail about casing
    issues and specifically mentions:

    << Case mappings may produce strings of different length than the
    original. For example the German character U+00DF ß LATIN SMALL LETTER
    SHAPR S expands when uppercase to the sequence of two characters "SS".
    This also occurs where there is no prcomposed character corresponding
    to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
    PRECEDED BY APOSTROPHE. >>

    See also http://www.unicode.org/faq/casemap_charprop-old.html for the
    Unicode FAQ which contains:

    <<
    Q: Why is there no upper-case SHARP S (ß)?

    A: There are 139 lower-case letters in Unicode 2.1 that have no direct
    uppercase equivalent. Should there be introduced new bogus characters
    for all of them, so that when you see an "fl" ligature you can
    uppercase it to "FL" without expanding anything? Of course not.

    Note that case conversion is inherently language-sensitive, notably in
    the case of IPA, which needs to be left strictly alone even when
    embedded in another language which is being case converted. The best
    you can get is an approximate fit. [JC]

    Q: Is all of the Unicode case mapping information in UnicodeData.txt?

    A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
    but doesn't include 1:many mappings such as the one needed for
    uppercasing ß. Since many parsers now expect this file to have at most
    single characters in the case mapping fields, an additional file
    (SpecialCasing.txt) was added to provide the 1:many mappings. For more
    information, see UTR #21- Case Mappings [MD]
    >>


    Python specifications make an implied claim of full support for
    Unicode and an implied claim that the function upper() uppercases a
    string properly.

    The implied combined claim is that Python supports Unicode and
    supports proper casing in Unicode.

    This implied claim is false.

    Truly accurate documentation for upper() should say that it uppercases
    a string except for those characters where uppercasing would expand a
    character to more than one character in which circumstance that
    character is not uppercased or uppercased with loss of data.

    Python specifications need not say how casing is done, whether by
    using Unicode tables directly or by using its own methods that
    accomplish the same results.

    Users should not have to know such details. They may wish to know
    where a particular function does not do what might be expected of it.

    Jim Allan
    jallan, Sep 23, 2003
    #14
  15. Hallvard B Furuseth

    Peter Otten Guest

    jallan wrote:

    > I don't see any particular reason why Python "cannot handle case
    > mappings that increase string lengths".


    Now that's a long post. I think it essentially boils down to the above
    statement.

    Looking into stringobject.c (judging from a first impression,
    unicodeobject.c has essentially the same algorithm, but with a few
    indirections):

    static PyObject *
    string_upper(PyStringObject *self)
    {
    char *s = PyString_AS_STRING(self), *s_new;
    int i, n = PyString_GET_SIZE(self);
    PyObject *new;

    new = PyString_FromStringAndSize(NULL, n);
    if (new == NULL)
    return NULL;
    s_new = PyString_AsString(new);
    for (i = 0; i < n; i++) {
    int c = Py_CHARMASK(*s++);
    if (islower(c)) {
    *s_new = toupper(c);
    } else
    *s_new = c;
    s_new++;
    }
    return new;
    }

    The whole routine builds on the assumption that len(s) == len(s.upper()) and
    nothing short of a complete rewrite will fix that. But if you volunteer...

    Personally, I think it's a long way to go for a little s, sharp as it may be
    :)

    Peter
    Peter Otten, Sep 23, 2003
    #15
  16. (jallan) writes:

    > A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
    > but doesn't include 1:many mappings such as the one needed for
    > uppercasing ß. Since many parsers now expect this file to have at most
    > single characters in the case mapping fields, an additional file
    > (SpecialCasing.txt) was added to provide the 1:many mappings. For more
    > information, see UTR #21- Case Mappings [MD]
    > >>

    >
    > Python specifications make an implied claim of full support for
    > Unicode and an implied claim that the function upper() uppercases a
    > string properly.


    This is a contradiction: SpecialCasing contains 1:n mappings, whereas
    ..upper() can only return a single result. So how do you think
    SpecialCasing should be considered in the implementation of .upper()?

    > Users should not have to know such details. They may wish to know
    > where a particular function does not do what might be expected of it.


    Things are more difficult than they appear to be.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 23, 2003
    #16
  17. Peter Otten <> writes:

    > Looking into stringobject.c (judging from a first impression,
    > unicodeobject.c has essentially the same algorithm, but with a few
    > indirections):


    You are mistaken. The implementation in unicodeobject.c is
    fundamentally different. The byte string implementation uses the C
    library, the Unicode implementation uses the Unicode character
    database. So the former cannot be changed, whereas the latter could,
    in theory, be extended to use additional data.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 23, 2003
    #17
  18. Hallvard B Furuseth

    Peter Otten Guest

    Martin v. Löwis wrote:

    > Peter Otten <> writes:
    >
    >> Looking into stringobject.c (judging from a first impression,
    >> unicodeobject.c has essentially the same algorithm, but with a few
    >> indirections):

    >
    > You are mistaken. The implementation in unicodeobject.c is
    > fundamentally different. The byte string implementation uses the C
    > library, the Unicode implementation uses the Unicode character
    > database. So the former cannot be changed, whereas the latter could,
    > in theory, be extended to use additional data.


    I followed the code to fixupper() which operates on a preallocated unicode
    object and thus cannot cope with a string that expands while being
    transformed. I didn't actually resolve the macros.

    While we are at it, would it be viable to "abuse" the encoding/decoding
    mechanism to do case conversions?

    Peter
    Peter Otten, Sep 23, 2003
    #18
  19. Hallvard B Furuseth

    jallan Guest

    Peter Otten <> wrote in message news:<bkpvml$m67$06$-online.com>...
    > jallan wrote:
    >
    > > I don't see any particular reason why Python "cannot handle case
    > > mappings that increase string lengths".

    >
    > Now that's a long post. I think it essentially boils down to the above
    > statement.
    >
    > Looking into stringobject.c (judging from a first impression,
    > unicodeobject.c has essentially the same algorithm, but with a few
    > indirections):
    >
    > static PyObject *
    > string_upper(PyStringObject *self)
    > {
    > char *s = PyString_AS_STRING(self), *s_new;
    > int i, n = PyString_GET_SIZE(self);
    > PyObject *new;
    >
    > new = PyString_FromStringAndSize(NULL, n);
    > if (new == NULL)
    > return NULL;
    > s_new = PyString_AsString(new);
    > for (i = 0; i < n; i++) {
    > int c = Py_CHARMASK(*s++);
    > if (islower(c)) {
    > *s_new = toupper(c);
    > } else
    > *s_new = c;
    > s_new++;
    > }
    > return new;
    > }
    >
    > The whole routine builds on the assumption that len(s) == len(s.upper()) and
    > nothing short of a complete rewrite will fix that. But if you volunteer...


    I would love to if I had the time. Sigh! Maybe in some months.

    > Personally, I think it's a long way to go for a little s, sharp as it may be
    > :)


    If it were just ß one could thrown in a quick conversion of any ß to
    ss at the beginning.

    But there are over a hundred other characters that expand when
    uppercased in http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt,
    most of them Greek. Greek is a horror. See
    http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html for the
    sad tale.

    Unfortunately language and orthography are messy and inconsistant and
    illogical and sometimes just silly. But handling orthography properly
    involves dealing with these complex rules and subrules and exceptions
    to rules rather than ignoring them.

    Unicode gives us great power, but with great power comes great
    responsibility and lots of niggling code. :-(

    Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
    scripts have such a thing as casing and the Unicode people have
    provided data files and algorithms that supposedly handle casing for
    these languages acceptably.

    From the Conformance requirements for Unicode at
    http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G29484 (C20):

    << An implementation that purports to support the default casing
    operations of case conversion, case detection, and caseless mapping
    shall do so in accordance with the definitions and specifications in
    Section 3.13, Default Case Operations. >>

    This involves even more messy fussing about with context specification
    for casing and with what values should be returned from a case
    querying function, e.g. "A2" is true as either uppercase and titlecase
    but not as lowercase. "3" is true as lowercase, uppercase and title
    case.

    Python or any applicaton or language either does or doesn't conform.

    I doubt that there is currently any application that can yet honestly
    purport to support Unicode default casing operations of case
    conversion, case detection and caseless mapping.

    Jim Allan
    jallan, Sep 24, 2003
    #19
  20. Peter Otten <> writes:

    > While we are at it, would it be viable to "abuse" the
    > encoding/decoding mechanism to do case conversions?


    It might be viable, but I would consider it abuse: for one thing, I'm
    not in favour of codecs which do Unicode->Unicode conversions - IMO, a
    codec should convert between Unicode and byte strings. Furthermore, a
    codec IMO should represent a proper "encoding", which case conversions
    would not do.

    Instead, it would be much better to provide such functions in a
    library, e.g. by wrapping ICU. Then, case conversions should be done
    locale-dependent, instead of being general (as .upper currently is).
    The locale-dependent way would best operate on explicit locale
    objects, so you would spell

    locale_object = load_locale("German", "Plattdeutsch")
    up_string = locale_object.to_upper(lower_string)

    In that case, the upper-case function would stop being a string
    method, and be a locale method instead, taking a string argument.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 24, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JBJ
    Replies:
    12
    Views:
    496
    =?ISO-8859-15?Q?Ricardo_Ar=E1oz?=
    Oct 5, 2007
  2. Robert Kern

    Is unicode.lower() locale-independent?

    Robert Kern, Jan 12, 2008, in forum: Python
    Replies:
    14
    Views:
    975
    Robert Kern
    Jan 13, 2008
  3. Michal
    Replies:
    57
    Views:
    21,158
    Ian Collins
    Dec 24, 2008
  4. Kless

    Uppercase/Lowercase on unicode

    Kless, Jun 5, 2009, in forum: Python
    Replies:
    3
    Views:
    599
    Дамјан ГеоргиевÑки
    Jun 5, 2009
  5. Richard Sandoval

    How to convert letters to uppercase?

    Richard Sandoval, Jan 23, 2011, in forum: Ruby
    Replies:
    4
    Views:
    112
    Y. NOBUOKA
    Jan 24, 2011
Loading...

Share This Page