Re: Custom alphabetical sort

Discussion in 'Python' started by Roy Smith, Dec 24, 2012.

  1. Roy Smith

    Roy Smith Guest

    In article <>,
    Pander Musubi <> wrote:

    > Hi all,
    >
    > I would like to sort according to this order:
    >
    > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
    > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',
    > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',
    > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',
    > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',
    > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',
    > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',
    > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    >
    > How can I do this? The default sorted() does not give the desired result.


    I'm assuming that doesn't correspond to some standard locale's collating
    order, so we really do need to roll our own encoding (and that you have
    a good reason for wanting to do this). I'm also assuming that what I'm
    seeing as question marks are really accented characters in some encoding
    that my news reader just isn't dealing with (it seems to think your post
    was in ISO-2022-CN (Simplified Chinese).

    I'm further assuming that you're starting with a list of unicode
    strings, the contents of which are limited to the above alphabet. I'm
    even further assuming that the volume of data you need to sort is small
    enough that efficiency is not a huge concern.

    Given all that, I would start by writing some code which turned your
    alphabet into a pair of dicts. One maps from the code point to a
    collating sequence number (i.e. ordinals), the other maps back.
    Something like (for python 2.7):

    alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
    '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
    [...]
    'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')

    map1 = {c: n for n, c in enumerate(alphabet)}
    map2 = {n: c for n, c in enumerate(alphabet)}

    Next, I would write some functions which encode your strings as lists of
    ordinals (and back again)

    def encode(s):
    "encode('foo') ==> [34, 19, 19]" # made-up ordinals
    return [map1[c] for c in s]

    def decode(l):
    "decode([34, 19, 19]) ==> 'foo'"
    return ''.join(map2 for i in l)

    Use these to convert your strings to lists of ints which will sort as
    per your specified collating order, and then back again:

    encoded_strings = [encode(s) for s in original_list]
    encoded_strings.sort()
    sorted_strings = [decode(l) for l in encoded_strings]

    That's just a rough sketch, and completely untested, but it should get
    you headed in the right direction. Or at least one plausible direction.
    Old-time perl hackers will recognize this as the Schwartzian Transform.
     
    Roy Smith, Dec 24, 2012
    #1
    1. Advertising

  2. > > Hi all,
    >
    > >

    >
    > > I would like to sort according to this order:

    >
    > >

    >
    > > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',

    >
    > > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',

    >
    > > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',

    >
    > > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',

    >
    > > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',

    >
    > > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',

    >
    > > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',

    >
    > > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')

    >
    > >

    >
    > > How can I do this? The default sorted() does not give the desired result.

    >
    >
    >
    > I'm assuming that doesn't correspond to some standard locale's collating
    >
    > order, so we really do need to roll our own encoding (and that you have
    >
    > a good reason for wanting to do this).


    It is for creating a Dutch dictionary. This sorting order is not to be found in an existing locale.

    > I'm also assuming that what I'm
    >
    > seeing as question marks are really accented characters in some encoding
    >
    > that my news reader just isn't dealing with (it seems to think your post
    >
    > was in ISO-2022-CN (Simplified Chinese).
    >
    >
    >
    > I'm further assuming that you're starting with a list of unicode
    >
    > strings, the contents of which are limited to the above alphabet.


    Correct.

    > I'm
    >
    > even further assuming that the volume of data you need to sort is small
    >
    > enough that efficiency is not a huge concern.


    Well, it is for 200,000 - 450,000 words but the code is allowed be slow. It will not be used for web application or something which requires a quick response.

    > Given all that, I would start by writing some code which turned your
    >
    > alphabet into a pair of dicts. One maps from the code point to a
    >
    > collating sequence number (i.e. ordinals), the other maps back.
    >
    > Something like (for python 2.7):
    >
    >
    >
    > alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
    >
    > '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
    >
    > [...]
    >
    > 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    >
    >
    >
    > map1 = {c: n for n, c in enumerate(alphabet)}
    >
    > map2 = {n: c for n, c in enumerate(alphabet)}


    OK, similar to Thomas' proposal.

    > Next, I would write some functions which encode your strings as lists of
    >
    > ordinals (and back again)
    >
    >
    >
    > def encode(s):
    >
    > "encode('foo') ==> [34, 19, 19]" # made-up ordinals
    >
    > return [map1[c] for c in s]
    >
    >
    >
    > def decode(l):
    >
    > "decode([34, 19, 19]) ==> 'foo'"
    >
    > return ''.join(map2 for i in l)
    >
    >
    >
    > Use these to convert your strings to lists of ints which will sort as
    >
    > per your specified collating order, and then back again:
    >
    >
    >
    > encoded_strings = [encode(s) for s in original_list]
    >
    > encoded_strings.sort()
    >
    > sorted_strings = [decode(l) for l in encoded_strings]
    >
    >
    >
    > That's just a rough sketch, and completely untested, but it should get
    >
    > you headed in the right direction. Or at least one plausible direction.
    >
    > Old-time perl hackers will recognize this as the Schwartzian Transform.


    I will test it and let you know. :) Pander
     
    Pander Musubi, Dec 24, 2012
    #2
    1. Advertising

  3. Roy Smith

    Roy Smith Guest

    In article <>,
    Pander Musubi <> wrote:

    > > I'm assuming that doesn't correspond to some standard locale's collating
    > > order, so we really do need to roll our own encoding (and that you have
    > > a good reason for wanting to do this).

    >
    > It is for creating a Dutch dictionary.


    Wait a minute. You're telling me that Python, of all languages, doesn't
    have a built-in way to sort Dutch words???
     
    Roy Smith, Dec 24, 2012
    #3
  4. >
    >
    >
    > > > I'm assuming that doesn't correspond to some standard locale's collating

    >
    > > > order, so we really do need to roll our own encoding (and that you have

    >
    > > > a good reason for wanting to do this).

    >
    > >

    >
    > > It is for creating a Dutch dictionary.

    >
    >
    >
    > Wait a minute. You're telling me that Python, of all languages, doesn't
    >
    > have a built-in way to sort Dutch words???


    Not when you want Roman characters with diacritics to be sorted in the normal a-Z range.
     
    Pander Musubi, Dec 24, 2012
    #4
  5. On 24/12/2012 17:40, Roy Smith wrote:
    > In article <>,
    > Pander Musubi <> wrote:
    >
    >>> I'm assuming that doesn't correspond to some standard locale's collating
    >>> order, so we really do need to roll our own encoding (and that you have
    >>> a good reason for wanting to do this).

    >>
    >> It is for creating a Dutch dictionary.

    >
    > Wait a minute. You're telling me that Python, of all languages, doesn't
    > have a built-in way to sort Dutch words???
    >


    There's a built-in called secret that's only available to those who are
    Dutch and members of the PSU.

    A slight aside, I understand that the BDFL is currently on holiday. For
    those who want a revolution now is as good a time as any :)

    --
    Cheers.

    Mark Lawrence.
     
    Mark Lawrence, Dec 24, 2012
    #5
  6. On 24 December 2012 16:18, Roy Smith <> wrote:

    > In article <>,
    > Pander Musubi <> wrote:
    >
    > > Hi all,
    > >
    > > I would like to sort according to this order:
    > >
    > > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',

    > 'a',
    > > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c',

    > 'C',
    > > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?',

    > 'f',
    > > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?',

    > '?',
    > > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O',

    > '?',
    > > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r',

    > 'R',
    > > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?',

    > 'v',
    > > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    > >
    > > How can I do this? The default sorted() does not give the desired result.

    >


    <snip>

    Given all that, I would start by writing some code which turned your
    > alphabet into a pair of dicts. One maps from the code point to a
    > collating sequence number (i.e. ordinals), the other maps back.
    > Something like (for python 2.7):
    >
    > alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
    > '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
    > [...]
    > 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    >
    > map1 = {c: n for n, c in enumerate(alphabet)}
    > map2 = {n: c for n, c in enumerate(alphabet)}
    >
    > Next, I would write some functions which encode your strings as lists of
    > ordinals (and back again)
    >
    > def encode(s):
    > "encode('foo') ==> [34, 19, 19]" # made-up ordinals
    > return [map1[c] for c in s]
    >
    > def decode(l):
    > "decode([34, 19, 19]) ==> 'foo'"
    > return ''.join(map2 for i in l)
    >
    > Use these to convert your strings to lists of ints which will sort as
    > per your specified collating order, and then back again:
    >
    > encoded_strings = [encode(s) for s in original_list]
    > encoded_strings.sort()
    > sorted_strings = [decode(l) for l in encoded_strings]
    >


    This isn't needed and the not-so-new way to do this is through .sort's key
    attribute.

    encoded_strings = [encode(s) for s in original_list]
    encoded_strings.sort()
    sorted_strings = [decode(l) for l in encoded_strings]

    changes to

    encoded_strings.sort(key=encode)

    [Which happens to be faster </reasonable_guess>]

    Hence you neither need map2 or decode:

    ## CODE ##

    alphabet = (
    ' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a',
    'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â',
    'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë',
    'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È',
    'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î','Î', 'ì',
    'Ì', 'j', 'J', 'k', 'K', 'l', 'L',
    'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò',
    'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q',
    'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û','Û', 'ù',
    'Ù', 'v', 'V', 'w', 'W', 'x', 'X',
    'y', 'Y', 'z', 'Z'
    )

    hashindex = {character:index for index, character in enumerate(alphabet)}
    def string2sortlist(string):
    return [hashindex for s in string]

    # Quickly make some stuff to sort. Let's try 200k, as that's what's
    suggested.
    import random
    things_to_sort = ["".join(random.sample(alphabet, random.randint(4, 6)))
    for _ in range(200000)]

    print(things_to_sort[:15])

    things_to_sort.sort(key=string2sortlist)

    print(things_to_sort[:15])

    ## END CODE ##

    Not-so-coincidentally, this is exactly the same as Ian Kelly's extension to
    Tomas Bach's method.
     
    Joshua Landau, Dec 24, 2012
    #6
  7. On Mon, 24 Dec 2012 11:18:37 -0500, Roy Smith wrote:

    > In article <>,
    > Pander Musubi <> wrote:
    >
    >> Hi all,
    >>
    >> I would like to sort according to this order:

    [...]
    > I'm assuming that doesn't correspond to some standard locale's collating
    > order, so we really do need to roll our own encoding (and that you have
    > a good reason for wanting to do this). I'm also assuming that what I'm
    > seeing as question marks are really accented characters in some encoding
    > that my news reader just isn't dealing with (it seems to think your post
    > was in ISO-2022-CN (Simplified Chinese).


    Good lord man, what sort of crappy newsreader software are you using? (It
    claims to be "MT-NewsWatcher/3.5.3b3 (Intel Mac OS X)" -- I think
    anything as bad as that shouldn't advertise what it is.) The OP's post
    was correctly labelled with an encoding, and not an obscure one:

    Content-Type: text/plain; charset=ISO-8859-1

    which if I remember correctly is Latin-1. If your newsreader can't handle
    that, surely it should default to UTF-8, which should give you the right
    results sans question marks.




    --
    Steven
     
    Steven D'Aprano, Dec 24, 2012
    #7
  8. On Monday, December 24, 2012 7:12:43 PM UTC+1, Joshua Landau wrote:
    > On 24 December 2012 16:18, Roy Smith <> wrote:
    >
    >
    >
    >
    > In article <>,
    >
    >  Pander Musubi <> wrote:
    >
    >
    >
    > > Hi all,

    >
    >
    > >

    >
    > > I would like to sort according to this order:

    >
    > >

    >
    > > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9','a',

    >
    > > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',

    >
    > > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',

    >
    > > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',

    >
    > > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',

    >
    > > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',

    >
    > > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',

    >
    >
    > > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')

    >
    > >

    >
    >
    > > How can I do this? The default sorted() does not give the desired result.

    >
    >
    >
    > <snip> 
    >
    >
    >
    >
    > Given all that, I would start by writing some code which turned your
    >
    > alphabet into a pair of dicts.  One maps from the code point to a
    >
    > collating sequence number (i.e. ordinals), the other maps back.
    >
    > Something like (for python 2.7):
    >
    >
    >
    > alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
    >
    >             '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
    >
    >             [...]
    >
    >
    >             'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    >
    >
    >
    > map1 = {c: n for n, c in enumerate(alphabet)}
    >
    > map2 = {n: c for n, c in enumerate(alphabet)}
    >
    >
    >
    > Next, I would write some functions which encode your strings as lists of
    >
    > ordinals (and back again)
    >
    >
    >
    > def encode(s):
    >
    >    "encode('foo') ==> [34, 19, 19]"  # made-up ordinals
    >
    >    return [map1[c] for c in s]
    >
    >
    >
    > def decode(l):
    >
    >    "decode([34, 19, 19]) ==> 'foo'"
    >
    >     return ''.join(map2 for i in l)
    >
    >
    >
    > Use these to convert your strings to lists of ints which will sort as
    >
    > per your specified collating order, and then back again:
    >
    >
    >
    > encoded_strings = [encode(s) for s in original_list]
    >
    > encoded_strings.sort()
    >
    > sorted_strings = [decode(l) for l in encoded_strings]
    >
    >
    >
    > This isn't needed and the not-so-new way to do this is through .sort's key attribute.
    >
    >
    >
    >
    > encoded_strings = [encode(s) for s in original_list]
    > encoded_strings.sort()
    > sorted_strings = [decode(l) for l in encoded_strings]
    >
    >
    >
    > changes to
    >
    >
    >
    >
    > encoded_strings.sort(key=encode)
    >
    >
    >
    > [Which happens to be faster </reasonable_guess>]
    >
    >
    >
    >
    > Hence you neither need map2 or decode:
    >
    >
    > ## CODE ##
    >
    >
    >
    >
    >
    > alphabet = (
    > ' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â',
    >
    >
    > 'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È',
    >
    >
    > 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 'Ì', 'j', 'J', 'k', 'K', 'l', 'L',
    >
    >
    > 'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q',
    >
    >
    > 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 'V', 'w', 'W', 'x', 'X',
    >
    >
    > 'y', 'Y', 'z', 'Z'
    > )
    >
    >
    >
    > hashindex = {character:index for index, character in enumerate(alphabet)}
    >
    > def string2sortlist(string):
    > return [hashindex for s in string]
    >
    >
    >
    >
    > # Quickly make some stuff to sort. Let's try 200k, as that's what's suggested.
    > import random
    > things_to_sort = ["".join(random.sample(alphabet, random.randint(4, 6))) for _ in range(200000)]
    >
    >
    >
    >
    > print(things_to_sort[:15])
    >
    >
    > things_to_sort.sort(key=string2sortlist)
    >
    >
    >
    >
    > print(things_to_sort[:15])
    >
    >
    > ## END CODE ##
    >
    >
    >
    >
    > Not-so-coincidentally, this is exactly the same as Ian Kelly's extension to Tomas Bach's method.


    With Python2.7 I had to use

    alphabet = (
    u' ', u'.', u'\'', u'-', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9', u'a', u'A', u'ä', u'Ä', u'á', u'Á', u'â', u'Â',
    u'à', u'À', u'å', u'Å', u'b', u'B', u'c', u'C', u'ç', u'Ç', u'd', u'D', u'e', u'E', u'ë', u'Ë', u'é', u'É', u'ê', u'Ê', u'è', u'È',
    u'f', u'F', u'g', u'G', u'h', u'H', u'i', u'I', u'ï', u'Ï', u'í', u'Í', u'î', u'Î', u'ì', u'Ì', u'j', u'J', u'k', u'K', u'l', u'L',
    u'm', u'M', u'n', u'ñ', u'N', u'Ñ', u'o', u'O', u'ö', u'Ö', u'ó',u'Ó', u'ô', u'Ô', u'ò', u'Ò', u'ø', u'Ø', u'p', u'P', u'q', u'Q',
    u'r', u'R', u's', u'S', u't', u'T', u'u', u'U', u'ü', u'Ü', u'ú', u'Ú', u'û', u'Û', u'ù', u'Ù', u'v', u'V', u'w', u'W', u'x', u'X',
    u'y', u'Y', u'z', u'Z'
    )

    to prevent

    Traceback (most recent call last):
    File "./sort.py", line 23, in <module>
    things_to_sort.sort(key=string2sortlist)
    File "./sort.py", line 15, in string2sortlist
    return [hashindex for s in string]
    KeyError: '\xc3'

    Thanks very much for this efficient code.
     
    Pander Musubi, Dec 24, 2012
    #8
  9. On Monday, December 24, 2012 7:12:43 PM UTC+1, Joshua Landau wrote:
    > On 24 December 2012 16:18, Roy Smith <> wrote:
    >
    >
    >
    >
    > In article <>,
    >
    >  Pander Musubi <> wrote:
    >
    >
    >
    > > Hi all,

    >
    >
    > >

    >
    > > I would like to sort according to this order:

    >
    > >

    >
    > > (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9','a',

    >
    > > 'A', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', 'b', 'B', 'c', 'C',

    >
    > > '?', '?', 'd', 'D', 'e', 'E', '?', '?', '?', '?', '?', '?', '?', '?', 'f',

    >
    > > 'F', 'g', 'G', 'h', 'H', 'i', 'I', '?', '?', '?', '?', '?', '?', '?', '?',

    >
    > > 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', '?', 'N', '?', 'o', 'O', '?',

    >
    > > '?', '?', '?', '?', '?', '?', '?', '?', '?', 'p', 'P', 'q', 'Q', 'r', 'R',

    >
    > > 's', 'S', 't', 'T', 'u', 'U', '?', '?', '?', '?', '?', '?', '?', '?', 'v',

    >
    >
    > > 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')

    >
    > >

    >
    >
    > > How can I do this? The default sorted() does not give the desired result.

    >
    >
    >
    > <snip> 
    >
    >
    >
    >
    > Given all that, I would start by writing some code which turned your
    >
    > alphabet into a pair of dicts.  One maps from the code point to a
    >
    > collating sequence number (i.e. ordinals), the other maps back.
    >
    > Something like (for python 2.7):
    >
    >
    >
    > alphabet = (' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5',
    >
    >             '6', '7', '8', '9', 'a', 'A', '?', '?', '?', '?',
    >
    >             [...]
    >
    >
    >             'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z')
    >
    >
    >
    > map1 = {c: n for n, c in enumerate(alphabet)}
    >
    > map2 = {n: c for n, c in enumerate(alphabet)}
    >
    >
    >
    > Next, I would write some functions which encode your strings as lists of
    >
    > ordinals (and back again)
    >
    >
    >
    > def encode(s):
    >
    >    "encode('foo') ==> [34, 19, 19]"  # made-up ordinals
    >
    >    return [map1[c] for c in s]
    >
    >
    >
    > def decode(l):
    >
    >    "decode([34, 19, 19]) ==> 'foo'"
    >
    >     return ''.join(map2 for i in l)
    >
    >
    >
    > Use these to convert your strings to lists of ints which will sort as
    >
    > per your specified collating order, and then back again:
    >
    >
    >
    > encoded_strings = [encode(s) for s in original_list]
    >
    > encoded_strings.sort()
    >
    > sorted_strings = [decode(l) for l in encoded_strings]
    >
    >
    >
    > This isn't needed and the not-so-new way to do this is through .sort's key attribute.
    >
    >
    >
    >
    > encoded_strings = [encode(s) for s in original_list]
    > encoded_strings.sort()
    > sorted_strings = [decode(l) for l in encoded_strings]
    >
    >
    >
    > changes to
    >
    >
    >
    >
    > encoded_strings.sort(key=encode)
    >
    >
    >
    > [Which happens to be faster </reasonable_guess>]
    >
    >
    >
    >
    > Hence you neither need map2 or decode:
    >
    >
    > ## CODE ##
    >
    >
    >
    >
    >
    > alphabet = (
    > ' ', '.', '\'', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'A', 'ä', 'Ä', 'á', 'Á', 'â', 'Â',
    >
    >
    > 'à', 'À', 'å', 'Å', 'b', 'B', 'c', 'C', 'ç', 'Ç', 'd', 'D', 'e', 'E', 'ë', 'Ë', 'é', 'É', 'ê', 'Ê', 'è', 'È',
    >
    >
    > 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'ï', 'Ï', 'í', 'Í', 'î', 'Î', 'ì', 'Ì', 'j', 'J', 'k', 'K', 'l', 'L',
    >
    >
    > 'm', 'M', 'n', 'ñ', 'N', 'Ñ', 'o', 'O', 'ö', 'Ö', 'ó', 'Ó', 'ô', 'Ô', 'ò', 'Ò', 'ø', 'Ø', 'p', 'P', 'q', 'Q',
    >
    >
    > 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'ü', 'Ü', 'ú', 'Ú', 'û', 'Û', 'ù', 'Ù', 'v', 'V', 'w', 'W', 'x', 'X',
    >
    >
    > 'y', 'Y', 'z', 'Z'
    > )
    >
    >
    >
    > hashindex = {character:index for index, character in enumerate(alphabet)}
    >
    > def string2sortlist(string):
    > return [hashindex for s in string]
    >
    >
    >
    >
    > # Quickly make some stuff to sort. Let's try 200k, as that's what's suggested.
    > import random
    > things_to_sort = ["".join(random.sample(alphabet, random.randint(4, 6))) for _ in range(200000)]
    >
    >
    >
    >
    > print(things_to_sort[:15])
    >
    >
    > things_to_sort.sort(key=string2sortlist)
    >
    >
    >
    >
    > print(things_to_sort[:15])
    >
    >
    > ## END CODE ##
    >
    >
    >
    >
    > Not-so-coincidentally, this is exactly the same as Ian Kelly's extension to Tomas Bach's method.


    With Python2.7 I had to use

    alphabet = (
    u' ', u'.', u'\'', u'-', u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9', u'a', u'A', u'ä', u'Ä', u'á', u'Á', u'â', u'Â',
    u'à', u'À', u'å', u'Å', u'b', u'B', u'c', u'C', u'ç', u'Ç', u'd', u'D', u'e', u'E', u'ë', u'Ë', u'é', u'É', u'ê', u'Ê', u'è', u'È',
    u'f', u'F', u'g', u'G', u'h', u'H', u'i', u'I', u'ï', u'Ï', u'í', u'Í', u'î', u'Î', u'ì', u'Ì', u'j', u'J', u'k', u'K', u'l', u'L',
    u'm', u'M', u'n', u'ñ', u'N', u'Ñ', u'o', u'O', u'ö', u'Ö', u'ó',u'Ó', u'ô', u'Ô', u'ò', u'Ò', u'ø', u'Ø', u'p', u'P', u'q', u'Q',
    u'r', u'R', u's', u'S', u't', u'T', u'u', u'U', u'ü', u'Ü', u'ú', u'Ú', u'û', u'Û', u'ù', u'Ù', u'v', u'V', u'w', u'W', u'x', u'X',
    u'y', u'Y', u'z', u'Z'
    )

    to prevent

    Traceback (most recent call last):
    File "./sort.py", line 23, in <module>
    things_to_sort.sort(key=string2sortlist)
    File "./sort.py", line 15, in string2sortlist
    return [hashindex for s in string]
    KeyError: '\xc3'

    Thanks very much for this efficient code.
     
    Pander Musubi, Dec 24, 2012
    #9
  10. Roy Smith

    Dave Angel Guest

    On 12/24/2012 06:19 PM, Pander Musubi wrote:
    > <snip>


    > to prevent
    >
    > Traceback (most recent call last):
    > File "./sort.py", line 23, in <module>
    > things_to_sort.sort(key=string2sortlist)
    > File "./sort.py", line 15, in string2sortlist
    > return [hashindex for s in string]
    > KeyError: '\xc3'
    >
    > Thanks very much for this efficient code.


    Perhaps you missed Ian Kelly's correction of Thomas Bach's approach:

    d = { k: v for v, k in enumerate(cs) }


    def collate(x):
    return list(map(d.get, x))

    sorted(data, key=collate)

    I'd use Ian Kelly's approach. It's not only more compact, it shouldn't
    give an exception for a character not in the table. At least, not for
    Python 2.x. I'm not sure about Python 3, since it can give an exception
    comparing None to int.


    --

    DaveA
     
    Dave Angel, Dec 25, 2012
    #10
  11. On 25 December 2012 06:18, Dave Angel <> wrote:

    > On 12/24/2012 06:19 PM, Pander Musubi wrote:


    <snip>

    > > Thanks very much for this efficient code.

    >
    > Perhaps you missed Ian Kelly's correction of Thomas Bach's approach:
    >
    > d = { k: v for v, k in enumerate(cs) }
    >
    >
    > def collate(x):
    > return list(map(d.get, x))
    >
    > sorted(data, key=collate)
    >
    > I'd use Ian Kelly's approach.



    Well, he was first to it :p


    > It's not only more compact,



    I take offence* here! The only difference was "list(map(d.get, x))" vs
    "[hashindex for s in string]" (11 chars) and my longer naming scheme. If
    you really care enough about those to sway your judgement, shame on you! ;)

    * Not really

    it shouldn't
    > give an exception for a character not in the table.



    That was a choice, not a bug. I didn't want undefined behaviour, so I
    thought I'd leave it to crash on "bad" input than sort in a way that may be
    unwanted. Even Ian Kelly gave this as way of coding it.


    > At least, not for
    > Python 2.x. I'm not sure about Python 3, since it can give an exception
    > comparing None to int.




    Please not that this post was done in humour (but with truth) to delay
    sleep. No offence to Ian or you intended ;).

    Happy After-Christmas!
     
    Joshua Landau, Dec 27, 2012
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?YmVub2l0?=

    ListItemCollection Sort Alphabetical

    =?Utf-8?B?YmVub2l0?=, Nov 3, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    11,482
    =?Utf-8?B?U3JlZWppdGggUmFt?=
    Nov 3, 2005
  2. David

    the Alphabetical Disorder

    David, Feb 27, 2004, in forum: Java
    Replies:
    4
    Views:
    501
    Collin VanDyck
    Feb 27, 2004
  3. Replies:
    4
    Views:
    459
    Peter Flynn
    Oct 23, 2005
  4. Replies:
    7
    Views:
    416
  5. Pander Musubi

    Custom alphabetical sort

    Pander Musubi, Dec 24, 2012, in forum: Python
    Replies:
    8
    Views:
    204
Loading...

Share This Page