sort order for strings of digits

Discussion in 'Python' started by djc, Oct 31, 2012.

  1. djc

    djc Guest

    I learn lots of useful things from the list, some not always welcome. No
    sooner had I found a solution to a minor inconvenience in my code, than
    a recent thread here drew my attention to the fact that it will not work
    for python 3. So suggestions please:

    TODO 2012-10-22: sort order numbers first then alphanumeric('a', 'ab', 'acd', 'bcd', '1a', 'a1', '222 bb', 'b a 4')
    ['1', '10', '101', '13', '1a', '2', '2000', '222 bb', '3', '31', '40',
    'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']



    Possibly there is a better way but for Python 2.7 this gives the
    required result

    Python 2.7.3 (default, Sep 26 2012, 21:51:14)
    [1, 2, 3, 10, 13, 31, 40, 101, 2000, '1a', '222 bb', 'a', 'a1', 'ab',
    'acd', 'b a 4', 'bcd']


    [str(x) for x in sorted(int(x) if x.isdigit() else x for x in n+s)]
    ['1', '2', '3', '10', '13', '31', '40', '101', '2000', '1a', '222 bb',
    'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']


    But not for Python 3
    Python 3.2.3 (default, Oct 19 2012, 19:53:16)
    ['1', '10', '101', '13', '1a', '2', '2000', '222 bb', '3', '31', '40',
    'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']
    Traceback (most recent call last):

    The best I can think of is to split the input sequence into two lists,
    sort each and then join them.
     
    djc, Oct 31, 2012
    #1
    1. Advertisements

  2. djc

    Hans Mulder Guest

    ['1', '10', '101', '13', '1a', '2', '2000', '222 bb', '3', '31', '40',
    'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']
    That might well be the most readable solution.


    Hope this helps,

    -- HansM
     
    Hans Mulder, Oct 31, 2012
    #2
    1. Advertisements

  3. djc

    Ian Kelly Guest

    In the example you have given they already seem to be split, so you
    could just do:

    sorted(n, key=int) + sorted(s)

    If that's not really the case, then you could construct (str, int)
    tuples as sort keys:

    sorted(n+s, key=lambda x: ('', int(x)) if x.isdigit() else (x, -1))

    Note that the empty string sorts before all numbers here, which may or
    may not be desirable.
     
    Ian Kelly, Oct 31, 2012
    #3
  4. Nope. I'm busy porting my own code from 2.7 to 3.3 and cmp seems to be
    very dead.

    This doesn't help either.

    c:\Users\Mark\Cash\Python>2to3.py
    Traceback (most recent call last):
    File "C:\Python33\Tools\Scripts\2to3.py", line 3, in <module>
    from lib2to3.main import main
    ImportError: No module named main
     
    Mark Lawrence, Oct 31, 2012
    #4
  5. OUCH... Just another reason for my to hang onto the 2.x series as
    long as possible (I only installed 2.7 this summer, I'd been using
    2.5... And a project at my former employment was still running 2.3 and
    having conflicts with a program with a binary extension written for 2.2
    -- and we couldn't find the source for the extension to rebuild it!)
     
    Dennis Lee Bieber, Oct 31, 2012
    #5
  6. According to your example code, you don't have to split the input because
    you already have two lists, one filled with numbers and one filled with
    strings.

    But I think that what you actually have is a single list of strings, and
    you are supposed to sort the strings such that they come in numeric order
    first, then alphanumerical. E.g.:

    ['9', '1000', 'abc2', '55', '1', 'abc', '55a', '1a']
    => ['1', '1a', '9', '55', '55a', '1000', 'abc', 'abc2']

    At least that is what I would expect as the useful thing to do when
    sorting.

    The trick is to take each string and split it into a leading number and a
    trailing alphanumeric string. Either part may be "empty". Here's a pure
    Python solution:

    from sys import maxsize # use maxint in Python 2
    def split(s):
    for i, c in enumerate(s):
    if not c.isdigit():
    break
    else: # aligned with the FOR, not the IF
    return (int(s), '')
    return (int(s[:i] or maxsize), s[i:])

    Now sort using this as a key function:

    py> L = ['9', '1000', 'abc2', '55', '1', 'abc', '55a', '1a']
    py> sorted(L, key=split)
    ['1', '1a', '9', '55', '55a', '1000', 'abc', 'abc2']


    The above solution is not quite general:

    * it doesn't handle negative numbers or numbers with a decimal point;

    * it doesn't handle the empty string in any meaningful way;

    * in practice, you may or may not want to ignore leading whitespace,
    or trailing whitespace after the number part;

    * there's a subtle bug if a string contains a very large numeric prefix,
    finding and fixing that is left as an exercise.
     
    Steven D'Aprano, Oct 31, 2012
    #6
  7. On the contrary. If you are using cmp with sort, your sorts are slow, and
    you should upgrade to using a key function as soon as possible.

    For small lists, you may not notice, but for large lists using a
    comparison function is a BAD IDEA.

    Here's an example: sorting a list of numbers by absolute value.

    py> L = [5, -6, 1, -2, 9, -8, 4, 3, -7, 2, -3]
    py> sorted(L, key=abs)
    [1, -2, 2, 3, -3, 4, 5, -6, -7, -8, 9]
    py> sorted(L, lambda a, b: cmp(abs(a), abs(b)))
    [1, -2, 2, 3, -3, 4, 5, -6, -7, -8, 9]

    But the amount of work done is radically different. Let's temporarily
    shadow the built-ins with patched versions:

    py> _abs = abs
    py> _abs, _cmp = abs, cmp
    py> c1 = c2 = 0
    py> def abs(x):
    .... global c1
    .... c1 += 1
    .... return _abs(x)
    ....
    py> def cmp(a, b):
    .... global c2
    .... c2 += 1
    .... return _cmp(a, b)
    ....

    Now we can see just how much work is done under the hood using a key
    function vs a comparison function:

    py> sorted(L, key=abs)
    [1, -2, 2, 3, -3, 4, 5, -6, -7, -8, 9]
    py> c1
    11

    So the key function is called once for each item in the list. But:


    py> c1 = 0 # reset the count
    py> sorted(L, lambda a, b: cmp(abs(a), abs(b)))
    [1, -2, 2, 3, -3, 4, 5, -6, -7, -8, 9]
    py> c1, c2
    (54, 27)

    The comparison function is called 27 times for a list of nine items (a
    average of 2.5 calls to cmp per item), and abs is called twice for each
    call to cmp. (Well, duh.)

    If the list is bigger, it gets worse:

    py> c2 = 0
    py> x = sorted(L*10, lambda a, b: cmp(abs(a), abs(b)))
    py> c2
    592

    That's an average of 5.4 calls to cmp per item. And it gets even worse as
    the list gets bigger.

    As your lists get bigger, the amount of work done calling the comparison
    function gets ever bigger still. Sorting large lists with a comparison
    function is SLOOOW.

    py> del abs, cmp # remove the monkey-patched versions
    py> L = L*1000000
    py> with Timer():
    .... x = sorted(L, key=abs)
    ....
    time taken: 9.165448 seconds
    py> with Timer():
    .... x = sorted(L, lambda a, b: cmp(abs(a), abs(b)))
    ....
    time taken: 63.579679 seconds


    The Timer() context manager used can be found here:

    http://code.activestate.com/recipes/577896
     
    Steven D'Aprano, Oct 31, 2012
    #7
  8. djc

    DJC Guest

    Sorry for the confusion, the pair of strings was just a way of testing
    variations on the input. So a sequence with any combination of strings
    that can be read as numbers and strings of chars that don't look like
    numbers (even if that string includes digits) is the expected input
    Not quite, what I want is to ensure that if the strings look like
    numbers they are placed in numerical order. ie 1 2 3 10 100 not 1 10 100
    2 3. Cases where a string has some leading digits can be treated as
    strings like any other.
    Well it depends on the use case. In my case the strings are column and
    row labels for a report. I want them to be presented in a convenient to
    read sequence. Which the lexical sorting of the strings that look like
    numbers is not. I want a reasonable do-what-i-mean default sort order
    that can handle whatever strings are used.

    That looks more than general enough for my purposes! I will experiment
    along those lines, thank you.
     
    DJC, Oct 31, 2012
    #8
  9. Correct, now fixed, thanks.
    As it's my own small code base I've blown away all references to cmp,
    it's rich comparisons all the way.
     
    Mark Lawrence, Nov 1, 2012
    #9
  10. But cmp_to_key doesn't actually improve anything. So I'm not sure how
    Py3 has achieved anything; Py2 supported key-based sorting already.

    ChrisA
     
    Chris Angelico, Nov 1, 2012
    #10
  11. You don't actually need to split the string, it's enough to return a
    pair consisting of the number of leading digits followed by the string
    as the key. Here's an implementation using takewhile:
    .... return sum(1 for c in takewhile(str.isdigit, s)) or 1000, s
    ....
    ['1', '1a', '9', '55', '55a', '1000', 'abc', 'abc2']

    Here's why it works:
    [(1, '9'), (4, '1000'), (1000, 'abc2'), (2, '55'), (1, '1'), (1000,
    'abc'), (2, '55a'), (1, '1a')]
     
    Arnaud Delobelle, Nov 1, 2012
    #11
  12. djc

    wxjmfauth Guest

    Le mercredi 31 octobre 2012 16:17:19 UTC+1, djc a écrit :
    .... '222 bb', '3', '31', '40', 'a', 'a1', 'ab',
    .... 'acd', 'b a 4', 'bcd'
    .... ]
    .... if e.isdigit():
    .... n.append(int(e))
    .... else:
    .... s.append(e)
    ....
    ['1', '2', '3', '10', '13', '31', '40', '101', '2000', '1a',
    '222 bb', 'a', 'a1', 'ab', 'acd', 'b a 4', 'bcd']

    jmf
     
    wxjmfauth, Nov 1, 2012
    #12
  13. Yes, but there is a lot of old code pre-dating key-based sorts. There's
    also some examples of code where it isn't obvious how to write it as a
    key-based sort, but a comparison function is simple. And people coming
    from other languages that only support comparison-based sorts (C?) will
    probably continue with what they know.

    Even though key-based sorting is better, there's a lot of comparison
    sorting that falls under "if it ain't broke, don't fix it".

    So even though key-based sorts are better, there are still comparison-
    based sorts in the wild. Python 2 has to support them. Python 3, which is
    allowed to break backwards compatibility, does not. So when porting to 3,
    you have to change the sorts.

    Most of the time it is simple to convert a comparison-based sort to a key-
    based sort. For the cases where you either can't come up with a good key
    function yourself, or were you want to do so mechanically, Python
    provides cmp_to_key.
     
    Steven D'Aprano, Nov 2, 2012
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.