Sorting a list of Unicode strings?

O

oliver

Aug 19, 2007

#1

Hey Guys,

Maybe I'm missing something fundamental here, but if I have a list of
Unicode strings, and I want to sort these alphabetically, then it
places those that begin with unicode characters at the bottom. Is
there a way to avoid this, and make it sort them properly?

I'm sure that this is the "proper way" programatically with character
entities etc. - but when I have a list of countries, and I have Åland
Islands right at the bottom, it just doesn't look right.

Any help would be really appreciated.

Thanks,
Oliver

S

Stefan Behnel

Aug 19, 2007

#2

Hey Guys,

.... and girls - maybe ...

Maybe I'm missing something fundamental here, but if I have a list of
Unicode strings, and I want to sort these alphabetically, then it
places those that begin with unicode characters at the bottom.

That's because "Unicode" is more than one alphabet. unicode objects compare
based on the Unicode character value, so sort() does alike.

Stefan

O

oliver

Aug 19, 2007

#3

... and girls - maybe ...

That's because "Unicode" is more than one alphabet. unicode objects compare
based on the Unicode character value, so sort() does alike.

Stefan

Thanks for putting me right -- gals indeed!

Anyway, I know _why_ it does this, but I really do need it to sort
them correctly based on how humans would look at it.

Any ideas?

A

Alex Martelli

Aug 19, 2007

#4

...
Anyway, I know _why_ it does this, but I really do need it to sort
them correctly based on how humans would look at it.

Depending on the nationality of those humans, you may need very
different sorting criteria; indeed, in some countries, different sorting
criteria apply to different use cases (such as sorting surnames versus
sorting book titles, etc; sorry, I don't recall specific examples, but
if you delve on sites about i18n issues you'll find some).

In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
letter Z in the alphabet; so, having Åaland (where I'm using Aa for
A-with-ring, since this newsreader has some problem in letting me enter
non-ascii characters;-) sort "right at the bottom", while it "doesn't
look right" to YOU (maybe an English-speaker?) may look right to the
inhabitants of that locality (be they Danes or Swedes -- but I believe
Norwegian may also work similarly in terms of sorting).

The Unicode consortium does define a standard collation algorithm (UCA)
and table (DUCET) to use when you need a locale-independent ordering; at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>
you'll be able to obtain James Tauber's Python implementation of UCA, to
work with the DUCET found at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>.

I suspect you won't like the collation order you obtain this way, but
you might start from there, subsetting and tweaking the DUCET into an
OUCET (Oliver Unicode Collation Element Table;-) that suits you better.

A simpler, rougher approach, if you think the "right" collation is
obtained by ignoring accents, diacritics, etc (even though the speakers
of many languages that include diacritics, &c, disagree;-) is to use the
key=coll argument in your sorting call, passing a function coll that
maps any Unicode string to what you _think_ it should be like for
sorting purposes. The .translate method of Unicode string objects may
help there: it takes a dict mapping Unicode ordinals to ordinals or
string (or None for characters you want to delete as part of the
translation).

For example, suppose that what we want is the following somewhat silly
collation: we only care about ISO-8859-1 characters, and want to ignore
for sorting purposes any accent (be it grave, acute or circumflex),
umlauts, slashes through letters, tildes, cedillas. htmlentitydefs has
a useful dict called codepoint2name that helps us identify those "weirdy
decorated foreign characters".

def make_transdict():
import htmlentitydefs
cp2n = htmlentitydefs.codepoint2name
suffixes = 'acute crave circ uml slash tilde cedil'.split()
td = {}
for x in range(128, 256):
if x not in cp2n: continue
n = cp2n[x]
for s in suffixes:
if n.endswith(s):
td[x] = unicode(n[-len(s)])
break
return td

def coll(us, td=make_transdict()):
return us.translate(td)

listofus.sort(key=coll)

I haven't tested this code, but it should be reasonably easy to fix any
problems it might have, as well as making make_transdict "richer" to
meet your goals. Just be aware that the resulting collation (e.g.,
sorting a-ring just as if it was a plain a) will be ABSOLUTELY WEIRD to
anybody who knows something about Scandinavian languages...!!!-)

Alex

S

Steve Holden

Aug 20, 2007

#5

Alex said:
Depending on the nationality of those humans, you may need very
different sorting criteria; indeed, in some countries, different sorting
criteria apply to different use cases (such as sorting surnames versus
sorting book titles, etc; sorry, I don't recall specific examples, but
if you delve on sites about i18n issues you'll find some).

Just one example from my own experience. When sorting names in Scotland
(and technically in the rest of the UK too in deference to Scotland,
though this is often ignored) named beginning with "Mc" have to be
sorted /as though/ they began with "Mac". Since the two prefixes are
indistinguishable phonetically it would otherwise mean twice as much
work to look up one of those names.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

T

thebjorn

Aug 20, 2007

#6

On Aug 19, 8:09 pm, (e-mail address removed) (Alex Martelli) wrote:
[...]

In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
letter Z in the alphabet; so, having Åaland (where I'm using Aa for
A-with-ring, since this newsreader has some problem in letting me enter
non-ascii characters;-) sort "right at the bottom", while it "doesn't
look right" to YOU (maybe an English-speaker?) may look right to the
inhabitants of that locality (be they Danes or Swedes -- but I believe
Norwegian may also work similarly in terms of sorting).

You're absolutely correct, the Norwegian and Danish alphabets end
with ..xyzæøå, while the Swedish alphabet ends with ..xyzåäö and sort
order follows placement. Indeed, my first reaction to the op was:
where else would Åland be but at the end? One, perhaps interesting,
tidbit, is that Åland "belongs" to Finland (it's an autonomous,
demilitarized, monolingually Swedish-speaking administrative province
of Finland). The Finnish alphabet is identical to the Swedish
alphabet, including sort order (at least in this case)

For the ascii-speakers out there, the key point to remember is that
the letter Å (pronounced like the au in brittish autumn) is not an
ascii A with a ring on top. The ring-on-top is an intrinsic part of
the letter, in the same way the tail on the letter Q isn't a
decoration of the letter O.

-- bjorn

T

Tommy Nordgren

Aug 20, 2007

#7

Hey Guys,

Maybe I'm missing something fundamental here, but if I have a list of
Unicode strings, and I want to sort these alphabetically, then it
places those that begin with unicode characters at the bottom. Is
there a way to avoid this, and make it sort them properly?

I'm sure that this is the "proper way" programatically with character
entities etc. - but when I have a list of countries, and I have Åland
Islands right at the bottom, it just doesn't look right.

Any help would be really appreciated.

Thanks,
Oliver

--
http://mail.python.org/mailman/listinfo/python-list

That is the correct alfabetic sort order for Åland.
The Swedish letters Å , Ä and Ö sorts last in Alphabetic order.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
(e-mail address removed)

O

oliver

Aug 20, 2007

#8

Thank you all for your very quick and informative replies. I was
basing this assumption that Å was classed as a standard 'A' from a
list of countries I was looking at (Wikipedia sorts it like this, too
- though this isn't what I was using http://en.wikipedia.org/wiki/List_of_countries#A)

I will leave it as it is, with Å at the bottom, if this is the correct
ordering.

Once again, thank you!

Oliver

K

koen.vanleemput

Aug 30, 2007

#9

Wikipedia in Suomi lists it at the bottom ;-)

http://sv.wikipedia.org/wiki/Lista_över_länder#.C3.85

Cheers
~K

split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Searching for a list of strings in a file with Python	3	Oct 14, 2013
Sorting a hierarchical table (SQL)	0	Jan 30, 2013
Right solution to unicode error?	21	Nov 7, 2012
Trouble sorting a list of objects by attributes	3	Feb 6, 2009
sorting list of complex numbers	20	Nov 9, 2008
A 'Sorting' Puzzle	22	Mar 8, 2011

oliver

Stefan Behnel

oliver

Alex Martelli

Steve Holden

thebjorn

Tommy Nordgren

oliver

koen.vanleemput

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads