Sorting strings containing special characters (german 'Umlaute')

D

DierkErdmann

Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider 1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) andprints
('de_DE', 'cp1252')

TIA.

Dierk
 
R

Robin Becker

Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) andprints
('de_DE', 'cp1252')

TIA.

Dierk
we tried this in a javascript version and it seems to work sorry for long line
and possible bad translation to Python


#coding: cp1252
def _deSpell(a):
u = a.decode('cp1252')
return
u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae').replace(u'\u00D6','OE').replace(u'\u00f6','oe').replace(u'\u00DC','Ue').replace(u'\u00fc','ue').replace(u'\u00C5','Ao').replace(u'\u00e5','ao')
def deSort(a,b):
return cmp(_deSpell(a),_deSpell(b))

l = ["Aber", "Ärger", "Beere"]
l.sort(deSort)
print l
 
P

Peter Otten

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",

I don't think so:
sorted(["Ast", "Ärger", "Ara"], locale.strcoll)
['Ara', '\xc3\x84rger', 'Ast']
sorted(["Ast", "Aerger", "Ara"])
['Aerger', 'Ara', 'Ast']
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) andprints
('de_DE', 'cp1252')

The default locale is not used by default; you have to set it explicitly
-1

By the way, you will avoid a lot of "Ärger"* if you use unicode right from
the start.

Finally, for efficient sorting, a key function is preferable over a cmp
function:
sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"
 
H

Hallvard B Furuseth

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä shold
be sorted together with A, but after A if the words are otherwise equal.
E.g. Antwort, Ärger, Beere. A proper strcoll handles that by
translating "Ärger" to e.g. ["Arger", <something like "E\0\0\0\0">],
then it can sort first by the un-accentified name and then by the rest.
 
B

Bjoern Schliessmann

Hallvard said:
(e-mail address removed) writes:
For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä
shold be sorted together with A, but after A if the words are
otherwise equal.

In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Regards,


Björn
 
D

DierkErdmann

There are several way of defining the sorting order. The variant "ä
equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
equals ä) complies with DIN 5007-1. Therefore both options are
possible.
The default locale is not used by default; you have to set it explicitly


-1

On my machinegives
'German_Germany.1252'

But this does not affect the sorting order as it does on your
computer.yields 1 in both cases.

Thank you for your hint using unicode from the beginning on, see the
difference:-1

compared to
1

Thanks for your help.

Dierk



['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"
 
R

Robin Becker

Bjoern said:
........

In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Björn, in one of our projects we are sorting in javascript in several languages
English, German, Scandinavian languages, Japanese; from somewhere (I cannot
actually remember) we got this sort spelling function for scandic languages

a
..replace(/\u00C4/g,'A~') //A umlaut
..replace(/\u00e4/g,'a~') //a umlaut
..replace(/\u00D6/g,'O~') //O umlaut
..replace(/\u00f6/g,'o~') //o umlaut
..replace(/\u00DC/g,'U~') //U umlaut
..replace(/\u00fc/g,'u~') //u umlaut
..replace(/\u00C5/g,'A~~') //A ring
..replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
 
B

Bjoern Schliessmann

Robin said:
Björn, in one of our projects we are sorting in javascript in
several languages English, German, Scandinavian languages,
Japanese; from somewhere (I cannot actually remember) we got this
sort spelling function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?

If I'm not mistaken, this would sort all umlauts after the "pure"
vowels. This is, according to <http://de.wikipedia.org/wiki/
Alphabetische_Sortierung>, used in Austria.

If you can't understand german, the rules given there in
section "Einsortierungsregeln" (roughly: ordering rules) translate
as follows:

"X und Y sind gleich": "X equals Y"
"X kommt nach Y": "X comes after Y"

Regards&HTH,


Björn
 
J

Jussi Salmela

Robin Becker kirjoitti:
Björn, in one of our projects we are sorting in javascript in several
languages English, German, Scandinavian languages, Japanese; from
somewhere (I cannot actually remember) we got this sort spelling
function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?

I think this order is not correct for Finnish, which is one of the
Scandinavian languages. The Finnish alphabet in alphabetical order is:

a-z, å, ä, ö

If I understand correctly your replacements cause the order of the last
3 characters to be

ä, å, ö

which is wrong.

HTH,
Jussi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top