Sorting strings containing special characters (german 'Umlaute')

Discussion in 'Python' started by DierkErdmann@mail.com, Mar 2, 2007.

  1. Guest

    Hi !

    I know that this topic has been discussed in the past, but I could not
    find a working solution for my problem: sorting (lists of) strings
    containing special characters like "ä", "ü",... (german umlaute).
    Consider the following list:
    l = ["Aber", "Beere", "Ärger"]

    For sorting the letter "Ä" is supposed to be treated like "Ae",
    therefore sorting this list should yield
    l = ["Aber, "Ärger", "Beere"]

    I know about the module locale and its method strcoll(string1,
    string2), but currently this does not work correctly for me. Consider
    >>> locale.strcoll("Ärger", "Beere")

    1

    Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
    Can someone help?

    Btw: I'm using WinXP (german) and
    >>> locale.getdefaultlocale()

    prints
    ('de_DE', 'cp1252')

    TIA.

    Dierk
     
    , Mar 2, 2007
    #1
    1. Advertising

  2. Robin Becker Guest

    wrote:
    > Hi !
    >
    > I know that this topic has been discussed in the past, but I could not
    > find a working solution for my problem: sorting (lists of) strings
    > containing special characters like "ä", "ü",... (german umlaute).
    > Consider the following list:
    > l = ["Aber", "Beere", "Ärger"]
    >
    > For sorting the letter "Ä" is supposed to be treated like "Ae",
    > therefore sorting this list should yield
    > l = ["Aber, "Ärger", "Beere"]
    >
    > I know about the module locale and its method strcoll(string1,
    > string2), but currently this does not work correctly for me. Consider
    > >>> locale.strcoll("Ärger", "Beere")

    > 1
    >
    > Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
    > Can someone help?
    >
    > Btw: I'm using WinXP (german) and
    >>>> locale.getdefaultlocale()

    > prints
    > ('de_DE', 'cp1252')
    >
    > TIA.
    >
    > Dierk
    >

    we tried this in a javascript version and it seems to work sorry for long line
    and possible bad translation to Python


    #coding: cp1252
    def _deSpell(a):
    u = a.decode('cp1252')
    return
    u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae').replace(u'\u00D6','OE').replace(u'\u00f6','oe').replace(u'\u00DC','Ue').replace(u'\u00fc','ue').replace(u'\u00C5','Ao').replace(u'\u00e5','ao')
    def deSort(a,b):
    return cmp(_deSpell(a),_deSpell(b))

    l = ["Aber", "Ärger", "Beere"]
    l.sort(deSort)
    print l



    --
    Robin Becker
     
    Robin Becker, Mar 2, 2007
    #2
    1. Advertising

  3. Peter Otten Guest

    wrote:

    > I know that this topic has been discussed in the past, but I could not
    > find a working solution for my problem: sorting (lists of) strings
    > containing special characters like "ä", "ü",... (german umlaute).
    > Consider the following list:
    > l = ["Aber", "Beere", "Ärger"]
    >
    > For sorting the letter "Ä" is supposed to be treated like "Ae",


    I don't think so:

    >>> sorted(["Ast", "Ärger", "Ara"], locale.strcoll)

    ['Ara', '\xc3\x84rger', 'Ast']

    >>> sorted(["Ast", "Aerger", "Ara"])

    ['Aerger', 'Ara', 'Ast']

    > therefore sorting this list should yield
    > l = ["Aber, "Ärger", "Beere"]
    >
    > I know about the module locale and its method strcoll(string1,
    > string2), but currently this does not work correctly for me. Consider
    > >>> locale.strcoll("Ärger", "Beere")

    > 1
    >
    > Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
    > Can someone help?
    >
    > Btw: I'm using WinXP (german) and
    >>>> locale.getdefaultlocale()

    > prints
    > ('de_DE', 'cp1252')


    The default locale is not used by default; you have to set it explicitly

    >>> import locale
    >>> locale.strcoll("Ärger", "Beere")

    1
    >>> locale.setlocale(locale.LC_ALL, "")

    'de_DE.UTF-8'
    >>> locale.strcoll("Ärger", "Beere")

    -1

    By the way, you will avoid a lot of "Ärger"* if you use unicode right from
    the start.

    Finally, for efficient sorting, a key function is preferable over a cmp
    function:

    >>> sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)

    ['Ara', '\xc3\x84rger', 'Ast']

    Peter

    (*) German for "trouble"
     
    Peter Otten, Mar 2, 2007
    #3
  4. writes:
    > For sorting the letter "Ä" is supposed to be treated like "Ae",
    > therefore sorting this list should yield
    > l = ["Aber, "Ärger", "Beere"]


    Are you sure? Maybe I'm thinking of another language, I thought Ä shold
    be sorted together with A, but after A if the words are otherwise equal.
    E.g. Antwort, Ärger, Beere. A proper strcoll handles that by
    translating "Ärger" to e.g. ["Arger", <something like "E\0\0\0\0">],
    then it can sort first by the un-accentified name and then by the rest.

    --
    Hallvard
     
    Hallvard B Furuseth, Mar 2, 2007
    #4
  5. Hallvard B Furuseth wrote:
    > writes:


    >> For sorting the letter "Ä" is supposed to be treated like "Ae",
    >> therefore sorting this list should yield
    >> l = ["Aber, "Ärger", "Beere"]

    >
    > Are you sure? Maybe I'm thinking of another language, I thought Ä
    > shold be sorted together with A, but after A if the words are
    > otherwise equal.


    In German, there are some different forms:

    - the classic sorting for e.g. word lists: umlauts and plain vowels
    are of same value (like you mentioned): ä = a

    - name list sorting for e.g. phone books: umlauts have the same
    value as their substitutes (like Dierk described): ä = ae

    There are others, too, but those are the most widely used.

    Regards,


    Björn

    --
    BOFH excuse #277:

    Your Flux Capacitor has gone bad.
     
    Bjoern Schliessmann, Mar 2, 2007
    #5
  6. Guest

    On 2 Mrz., 15:25, Peter Otten <> wrote:
    > wrote:
    > > For sorting the letter "Ä" is supposed to be treated like "Ae",

    There are several way of defining the sorting order. The variant "ä
    equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
    equals ä) complies with DIN 5007-1. Therefore both options are
    possible.

    > The default locale is not used by default; you have to set it explicitly
    >
    > >>> import locale
    > >>> locale.strcoll("Ärger", "Beere")

    > 1
    > >>> locale.setlocale(locale.LC_ALL, "")

    > 'de_DE.UTF-8'
    > >>> locale.strcoll("Ärger", "Beere")

    >
    > -1


    On my machine
    >>> locale.setlocale(locale.LC_ALL, "")

    gives
    'German_Germany.1252'

    But this does not affect the sorting order as it does on your
    computer.
    >>> locale.strcoll("Ärger", "Beere")

    yields 1 in both cases.

    Thank you for your hint using unicode from the beginning on, see the
    difference:
    >>> s1 = unicode("Ärger", "latin-1")
    >>> s2 = unicode("Beere", "latin-1")
    >>> locale.strcoll(s1, s2)

    1
    >>> locale.setlocale(locale.LC_ALL, "")

    -1

    compared to

    >>> s1 = "Ärger"
    >>> s2 = "Beere"
    >>> locale.strcoll(s1, s2)

    1
    >>> locale.setlocale(locale.LC_ALL, "")

    'German_Germany.1252'
    >>> locale.strcoll(s1, s2)

    1

    Thanks for your help.

    Dierk




    >
    > ['Ara', '\xc3\x84rger', 'Ast']
    >
    > Peter
    >
    > (*) German for "trouble"
     
    , Mar 2, 2007
    #6
  7. Robin Becker Guest

    Bjoern Schliessmann wrote:
    > Hallvard B Furuseth wrote:
    >> writes:

    ........
    >
    > In German, there are some different forms:
    >
    > - the classic sorting for e.g. word lists: umlauts and plain vowels
    > are of same value (like you mentioned): ä = a
    >
    > - name list sorting for e.g. phone books: umlauts have the same
    > value as their substitutes (like Dierk described): ä = ae
    >
    > There are others, too, but those are the most widely used.


    Björn, in one of our projects we are sorting in javascript in several languages
    English, German, Scandinavian languages, Japanese; from somewhere (I cannot
    actually remember) we got this sort spelling function for scandic languages

    a
    ..replace(/\u00C4/g,'A~') //A umlaut
    ..replace(/\u00e4/g,'a~') //a umlaut
    ..replace(/\u00D6/g,'O~') //O umlaut
    ..replace(/\u00f6/g,'o~') //o umlaut
    ..replace(/\u00DC/g,'U~') //U umlaut
    ..replace(/\u00fc/g,'u~') //u umlaut
    ..replace(/\u00C5/g,'A~~') //A ring
    ..replace(/\u00e5/g,'a~~'); //a ring

    does this actually make sense?
    --
    Robin Becker
     
    Robin Becker, Mar 2, 2007
    #7
  8. Robin Becker wrote:

    > Björn, in one of our projects we are sorting in javascript in
    > several languages English, German, Scandinavian languages,
    > Japanese; from somewhere (I cannot actually remember) we got this
    > sort spelling function for scandic languages
    >
    > a
    > .replace(/\u00C4/g,'A~') //A umlaut
    > .replace(/\u00e4/g,'a~') //a umlaut
    > .replace(/\u00D6/g,'O~') //O umlaut
    > .replace(/\u00f6/g,'o~') //o umlaut
    > .replace(/\u00DC/g,'U~') //U umlaut
    > .replace(/\u00fc/g,'u~') //u umlaut
    > .replace(/\u00C5/g,'A~~') //A ring
    > .replace(/\u00e5/g,'a~~'); //a ring
    >
    > does this actually make sense?


    If I'm not mistaken, this would sort all umlauts after the "pure"
    vowels. This is, according to <http://de.wikipedia.org/wiki/
    Alphabetische_Sortierung>, used in Austria.

    If you can't understand german, the rules given there in
    section "Einsortierungsregeln" (roughly: ordering rules) translate
    as follows:

    "X und Y sind gleich": "X equals Y"
    "X kommt nach Y": "X comes after Y"

    Regards&HTH,


    Björn

    --
    BOFH excuse #146:

    Communications satellite used by the military for star wars.
     
    Bjoern Schliessmann, Mar 2, 2007
    #8
  9. Robin Becker kirjoitti:
    >
    > Björn, in one of our projects we are sorting in javascript in several
    > languages English, German, Scandinavian languages, Japanese; from
    > somewhere (I cannot actually remember) we got this sort spelling
    > function for scandic languages
    >
    > a
    > .replace(/\u00C4/g,'A~') //A umlaut
    > .replace(/\u00e4/g,'a~') //a umlaut
    > .replace(/\u00D6/g,'O~') //O umlaut
    > .replace(/\u00f6/g,'o~') //o umlaut
    > .replace(/\u00DC/g,'U~') //U umlaut
    > .replace(/\u00fc/g,'u~') //u umlaut
    > .replace(/\u00C5/g,'A~~') //A ring
    > .replace(/\u00e5/g,'a~~'); //a ring
    >
    > does this actually make sense?


    I think this order is not correct for Finnish, which is one of the
    Scandinavian languages. The Finnish alphabet in alphabetical order is:

    a-z, å, ä, ö

    If I understand correctly your replacements cause the order of the last
    3 characters to be

    ä, å, ö

    which is wrong.

    HTH,
    Jussi
     
    Jussi Salmela, Mar 4, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?Q2FybG8gTWFyY2hlc29uaQ==?=

    German Umlaute (Resources)

    =?Utf-8?B?Q2FybG8gTWFyY2hlc29uaQ==?=, Nov 22, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    551
    Sylvain Lafontaine
    Nov 22, 2004
  2. Uwe Braunholz

    German "Umlaute" in QueryString

    Uwe Braunholz, Aug 11, 2008, in forum: ASP .Net
    Replies:
    12
    Views:
    2,184
    Alexey Smirnov
    Aug 16, 2008
  3. Axel Friedrich
    Replies:
    3
    Views:
    169
    Axel Friedrich
    Jun 20, 2004
  4. zak
    Replies:
    3
    Views:
    191
    Jonathan Nielsen
    Jun 15, 2010
  5. abadiya

    csv, perl and German umlaute

    abadiya, Sep 14, 2007, in forum: Perl Misc
    Replies:
    2
    Views:
    133
    Josef Moellers
    Sep 17, 2007
Loading...

Share This Page