Alphabetical sorts

Discussion in 'Python' started by Ron Adam, Oct 16, 2006.

  1. Ron Adam

    Ron Adam Guest

    I have several applications where I want to sort lists in alphabetical order.
    Most examples of sorting usually sort on the ord() order of the character set as
    an approximation. But that is not always what you want.

    The solution of converting everything to lowercase or uppercase is closer, but
    it would be nice if capitalized words come before lowercase words of the same
    spellings. And I suspect ord() order may not be correct for some character sets.

    So I'm wandering what others have done and weather there is something in the
    standard library I haven't found for doing this.

    Below is my current way of doing it, but I think it can probably be improved
    quite a bit.

    This partial solution also allows ignoring leading characters such as spaces,
    tabs, and underscores by specifying what not to ignore. So '__ABC__' will be
    next to 'ABC'. But this aspect isn't my main concern.

    Maybe some sort of alphabetical order string could be easily referenced for
    various alphabets instead of having to create them manually?

    Also it would be nice if strings with multiple words were ordered correctly.


    Cheers,
    _Ron



    def stripto(s, goodchrs):
    """ Removes leading and trailing characters from string s
    which are not in the string goodchrs.
    """
    badchrs = set(s)
    for c in goodchrs:
    if c in badchrs:
    badchrs.remove(c)
    badchrs = ''.join(badchrs)
    return s.strip(badchrs)


    def alpha_sorted(seq):
    """ Sort a list of strings in 123AaBbCc... order.
    """
    order = ( '0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNn'
    'OoPpQqRrSsTtUuVvWwXxYyZz' )
    def chr_index(value, sortorder):
    """ Make a sortable numeric list
    """
    result = []
    for c in stripto(value, order):
    cindex = sortorder.find(c)
    if cindex == -1:
    cindex = len(sortorder)+ord(c)
    result.append(cindex)
    return result

    deco = [(chr_index(a, order), a) for a in seq]
    deco.sort()
    return list(x[1] for x in deco)
    Ron Adam, Oct 16, 2006
    #1
    1. Advertising

  2. Ron Adam

    Neil Cerutti Guest

    On 2006-10-16, Ron Adam <> wrote:
    >
    > I have several applications where I want to sort lists in
    > alphabetical order. Most examples of sorting usually sort on
    > the ord() order of the character set as an approximation. But
    > that is not always what you want.


    Check out strxfrm in the locale module.

    >>> a = ["Neil", "Cerutti", "neil", "cerutti"]
    >>> a.sort()
    >>> a

    ['Cerutti', 'Neil', 'cerutti', 'neil']
    >>> import locale
    >>> locale.setlocale(locale.LC_ALL, '')

    'English_United States.1252'
    >>> a.sort(key=locale.strxfrm)
    >>> a

    ['cerutti', 'Cerutti', 'neil', 'Neil']

    --
    Neil Cerutti
    Neil Cerutti, Oct 16, 2006
    #2
    1. Advertising

  3. Ron Adam

    Tuomas Guest

    My application needs to handle different language sorts. Do you know a
    way to apply strxfrm dynamically i.e. without setting the locale?

    Tuomas

    Neil Cerutti wrote:
    > On 2006-10-16, Ron Adam <> wrote:
    >
    >>I have several applications where I want to sort lists in
    >>alphabetical order. Most examples of sorting usually sort on
    >>the ord() order of the character set as an approximation. But
    >>that is not always what you want.

    >
    >
    > Check out strxfrm in the locale module.
    >
    >
    >>>>a = ["Neil", "Cerutti", "neil", "cerutti"]
    >>>>a.sort()
    >>>>a

    >
    > ['Cerutti', 'Neil', 'cerutti', 'neil']
    >
    >>>>import locale
    >>>>locale.setlocale(locale.LC_ALL, '')

    >
    > 'English_United States.1252'
    >
    >>>>a.sort(key=locale.strxfrm)
    >>>>a

    >
    > ['cerutti', 'Cerutti', 'neil', 'Neil']
    >
    Tuomas, Oct 16, 2006
    #3
  4. Ron Adam

    Leo Kislov Guest

    On Oct 16, 2:39 pm, Tuomas <> wrote:
    > My application needs to handle different language sorts. Do you know a
    > way to apply strxfrm dynamically i.e. without setting the locale?


    Collation is almost always locale dependant. So you have to set locale.
    One day I needed collation that worked on Windows and Linux. It's not
    that polished and not that tested but it worked for me:

    import locale, os, codecs

    current_encoding = 'ascii'
    current_locale = ''

    def get_collate_encoding(s):
    '''Grab character encoding from locale name'''
    split_name = s.split('.')
    if len(split_name) != 2:
    return 'ascii'
    encoding = split_name[1]
    if os.name == "nt":
    encoding = 'cp' + encoding
    try:
    codecs.lookup(encoding)
    return encoding
    except LookupError:
    return 'ascii'

    def setup_locale(locale_name):
    '''Switch to new collation locale or do nothing if locale
    is the same'''
    global current_locale, current_encoding
    if current_locale == locale_name:
    return
    current_encoding = get_collate_encoding(
    locale.setlocale(locale.LC_COLLATE, locale_name))
    current_locale = locale_name

    def collate_key(s):
    '''Return collation weight of a string'''
    return locale.strxfrm(s.encode(current_encoding, 'ignore'))

    def collate(lst, locale_name):
    '''Sort a list of unicode strings according to locale rules.
    Locale is specified as 2 letter code'''
    setup_locale(locale_name)
    return sorted(lst, key = collate_key)


    words = u'c ch f'.split()
    print ' '.join(collate(words, 'en'))
    print ' '.join(collate(words, 'cz'))

    Prints:

    c ch f
    c f ch
    Leo Kislov, Oct 17, 2006
    #4
  5. Ron Adam

    Ron Adam Guest

    Neil Cerutti wrote:
    > On 2006-10-16, Ron Adam <> wrote:
    >> I have several applications where I want to sort lists in
    >> alphabetical order. Most examples of sorting usually sort on
    >> the ord() order of the character set as an approximation. But
    >> that is not always what you want.

    >
    > Check out strxfrm in the locale module.
    >
    >>>> a = ["Neil", "Cerutti", "neil", "cerutti"]
    >>>> a.sort()
    >>>> a

    > ['Cerutti', 'Neil', 'cerutti', 'neil']
    >>>> import locale
    >>>> locale.setlocale(locale.LC_ALL, '')

    > 'English_United States.1252'
    >>>> a.sort(key=locale.strxfrm)
    >>>> a

    > ['cerutti', 'Cerutti', 'neil', 'Neil']


    Thanks, that helps.

    The documentation for local.strxfrm() certainly could be more complete. And the
    name isn't intuitive at all. It also coorisponds to the C funciton for
    translating strings which isn't the same thing.

    For that matter locale.strcoll() isn't documented any better.



    I see this is actually a very complex subject. A littler searching, found the
    following link on Wikipedia.

    http://en.wikipedia.org/wiki/Alphabetical_order#Compound_words_and_special_characters

    And from there a very informative report:

    http://www.unicode.org/unicode/reports/tr10/


    It looks to me this would be a good candidate for a configurable class.
    Something preferably in the string module where it could be found easier.

    Is there anyway to change the behavior of strxfrm or strcoll? For example have
    caps before lowercase, instead of after?


    Cheers,
    Ron
    Ron Adam, Oct 17, 2006
    #5
  6. Ron Adam

    Neil Cerutti Guest

    On 2006-10-17, Ron Adam <> wrote:
    > Neil Cerutti wrote:
    >> On 2006-10-16, Ron Adam <> wrote:
    >>> I have several applications where I want to sort lists in
    >>> alphabetical order. Most examples of sorting usually sort on
    >>> the ord() order of the character set as an approximation.
    >>> But that is not always what you want.

    >>
    >> Check out strxfrm in the locale module.

    >
    > It looks to me this would be a good candidate for a
    > configurable class. Something preferably in the string module
    > where it could be found easier.
    >
    > Is there anyway to change the behavior of strxfrm or strcoll?
    > For example have caps before lowercase, instead of after?


    You can probably get away with writing a strxfrm function that
    spits out numbers that fit your definition of sorting.

    --
    Neil Cerutti
    Whenever I see a homeless guy, I always run back and give him
    money, because I think: Oh my God, what if that was Jesus?
    --Pamela Anderson
    Neil Cerutti, Oct 17, 2006
    #6
  7. Ron Adam

    Jorgen Grahn Guest

    On Mon, 16 Oct 2006 22:22:47 -0500, Ron Adam <> wrote:
    ....
    > I see this is actually a very complex subject.

    ....
    > It looks to me this would be a good candidate for a configurable class.
    > Something preferably in the string module where it could be found easier.


    /And/ choosing a locale shouldn't mean changing a process-global state.
    Sometimes you want to perform something locale-depending in locale A,
    followed by doing it in locale B. Switching locales today takes time and has
    the same problems as global variables (unless there is another interface I
    am not aware of).

    But I suspect that is already a well-known problem.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ snipabacken.dyndns.org> R'lyeh wgah'nagl fhtagn!
    Jorgen Grahn, Oct 17, 2006
    #7
  8. Ron Adam

    Ron Adam Guest

    Neil Cerutti wrote:
    > On 2006-10-17, Ron Adam <> wrote:
    >> Neil Cerutti wrote:
    >>> On 2006-10-16, Ron Adam <> wrote:
    >>>> I have several applications where I want to sort lists in
    >>>> alphabetical order. Most examples of sorting usually sort on
    >>>> the ord() order of the character set as an approximation.
    >>>> But that is not always what you want.
    >>> Check out strxfrm in the locale module.

    >> It looks to me this would be a good candidate for a
    >> configurable class. Something preferably in the string module
    >> where it could be found easier.
    >>
    >> Is there anyway to change the behavior of strxfrm or strcoll?
    >> For example have caps before lowercase, instead of after?

    >
    > You can probably get away with writing a strxfrm function that
    > spits out numbers that fit your definition of sorting.



    Since that function is 'C' coded in the builtin _locale, it can't be modified by
    python code.

    Looking around some more I found the documentation for the corresponding C
    functions and data structures. It looks like python may just wrap these.

    http://opengroup.org/onlinepubs/007908799/xbd/locale.html


    Here's one example of how to rewrite the Unicode collate in python.

    http://jtauber.com/blog/2006/01

    I haven't tried changing it's behavior, but I did notice it treats words with
    hyphen in them differently than strxfrm.



    Here's one way to change caps order.

    a = ["Neil", "Cerutti", "neil", "cerutti"]

    locale.setlocale(locale.LC_ALL, '')
    tmp = [x.swapcase() for x in a]
    tmp.sort(key=locale.strxfrm)
    tmp = [x.swapcase() for x in tmp]
    print tmp


    ['Cerutti', 'cerutti', 'Neil', 'neil']



    Cheers,
    Ron
    Ron Adam, Oct 17, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?YmVub2l0?=

    ListItemCollection Sort Alphabetical

    =?Utf-8?B?YmVub2l0?=, Nov 3, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    11,295
    =?Utf-8?B?U3JlZWppdGggUmFt?=
    Nov 3, 2005
  2. David

    the Alphabetical Disorder

    David, Feb 27, 2004, in forum: Java
    Replies:
    4
    Views:
    485
    Collin VanDyck
    Feb 27, 2004
  3. Replies:
    4
    Views:
    445
    Peter Flynn
    Oct 23, 2005
  4. Eric Lilja
    Replies:
    1
    Views:
    587
    Rapscallion
    Jun 4, 2005
  5. Replies:
    7
    Views:
    389
Loading...

Share This Page