# RE: outline-style sorting algorithm

Discussion in 'Python' started by Delaney, Timothy C (Timothy), Apr 29, 2004.

1. ### Delaney, Timothy C (Timothy)Guest

Thorsten Kampe wrote:

>>>>> foo

>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>> '1.20.1', '1.30']
>>>>> foo.sort()
>>>>> foo

>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>> '1.30', '1.9']
>>
>> Obviously 1.20.1 should be after 1.9 if we look at this as
>> dot-delimited integers, not as decimal numbers.

>
> You need some general approach to avoid the DSU thing:
>
> def funcsort(seq, func):
> """ sort seq by func(item) """
> seq = seq[:]
> seq.sort(lambda x, y: cmp(func(x), func(y)))
> return seq
>
> funcsort(foo, lambda x: map(int, x.split('.')))

I've seen you give this advice several times, today, and IMO it's
completely wrong.

DSU is *exactly* what you want to do. If you want to wrap it inside a
general function, that's fine, but DSU is in almost all cases preferred
to passing a comparison function - it's much faster.

def sorted (seq, cmp=None, key=None):
""" sort seq by func(item) """
if key is None:
seq = seq[:]
else:
seq = [(key(e), i, e) for i, e in enumerate(seq)]

seq.sort(cmp)

if key is None:
return seq
else:
return [i[-1] for i in seq]

foo = ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11',
'1.20', '1.20.1', '1.30']

foo = sorted(foo, key=lambda x: map(int, x.split('.')))
print foo

Note that Python 2.4 will have DSU as a built-in idiom to `list.sort`
i.e. `list.sort(key=func)` but will be somewhat more efficient than the
above. Likewise there will be a new builtin `sorted` which has exactly
the same semantics as the above - it is stable, and it does not ever
compare the actual value if a key function is supplied - only the key
value (the above also compares the position, but that's an
implementation detail and has no visible effect).

Tim Delaney

Delaney, Timothy C (Timothy), Apr 29, 2004

2. ### Thorsten KampeGuest

* Delaney, Timothy C (Timothy) (2004-04-29 02:52 +0100)
> Thorsten Kampe wrote:
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>>> '1.20.1', '1.30']
>>>>>> foo.sort()
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>>> '1.30', '1.9']
>>>
>>> Obviously 1.20.1 should be after 1.9 if we look at this as
>>> dot-delimited integers, not as decimal numbers.

>>
>> You need some general approach to avoid the DSU thing:
>>
>> def funcsort(seq, func):
>> """ sort seq by func(item) """
>> seq = seq[:]
>> seq.sort(lambda x, y: cmp(func(x), func(y)))
>> return seq
>>
>> funcsort(foo, lambda x: map(int, x.split('.')))

>
> I've seen you give this advice several times, today, and IMO it's
> completely wrong.
>
> DSU is *exactly* what you want to do. If you want to wrap it inside a
> general function, that's fine, but DSU is in almost all cases preferred
> to passing a comparison function - it's much faster.

My advice was to use a more general approach like 'funcsort' to avoid
reiventing the wheel for every problem. DSU is okay for funcsort.

> def sorted (seq, cmp=None, key=None):

What is 'cmp=None' for?

> """ sort seq by func(item) """
> if key is None:
> seq = seq[:]
> else:
> seq = [(key(e), i, e) for i, e in enumerate(seq)]
>
> seq.sort(cmp)

Are you passing a second comparison function? And if, isn't that the
same as I did and to which you said "preferred to passing a comparison
function"? Are you shadowing the builtin cmp function?

> if key is None:
> return seq
> else:
> return [i[-1] for i in seq]

What is 'key=None' for? Is that for speeding up if someone wants to
pass the indentity function (f(x)=x; lambda x: x)?

> Note that Python 2.4 will have DSU as a built-in idiom to `list.sort`
> i.e. `list.sort(key=func)` but will be somewhat more efficient than the
> above. Likewise there will be a new builtin `sorted` which has exactly
> the same semantics as the above - it is stable, and it does not ever
> compare the actual value if a key function is supplied - only the key
> value (the above also compares the position, but that's an
> implementation detail and has no visible effect).

If it has no visible effects than it would be useless. In my opinion
it has the effect that two items that have the same func(x) stay in
the same position as before. Is that desirable? Was that your goal?

>>> sorted([4, 2], None, lambda x: x % 2)

[4, 2], but [2, 4] if the index was omitted

Thorsten

Thorsten Kampe, Apr 29, 2004

3. ### Thorsten KampeGuest

* Delaney, Timothy C (Timothy) (2004-04-29 02:52 +0100)
> Thorsten Kampe wrote:
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.2', '1.9', '1.10', '1.11', '1.20',
>>> '1.20.1', '1.30']
>>>>>> foo.sort()
>>>>>> foo
>>> ['1.0', '1.0.1', '1.1.1', '1.10', '1.11', '1.2', '1.20', '1.20.1',
>>> '1.30', '1.9']
>>>
>>> Obviously 1.20.1 should be after 1.9 if we look at this as
>>> dot-delimited integers, not as decimal numbers.

>>
>> You need some general approach to avoid the DSU thing:
>>
>> def funcsort(seq, func):
>> """ sort seq by func(item) """
>> seq = seq[:]
>> seq.sort(lambda x, y: cmp(func(x), func(y)))
>> return seq
>>
>> funcsort(foo, lambda x: map(int, x.split('.')))

>
> I've seen you give this advice several times, today, and IMO it's
> completely wrong.
>
> DSU is *exactly* what you want to do. If you want to wrap it inside a
> general function, that's fine, but DSU is in almost all cases preferred
> to passing a comparison function - it's much faster.

If made some tests with lists of one million items:

The first test was with a randomized sequence of numbers from 1 -
1000000. Duplicate numbers were allowed. The comparing function was
the identity (lambda x: x).

You DSU approach took about 43 s and mine took about 86 s.
Than I tested range(1000000) in reverse order:
Your sorted program took about 24 s and mine about 5!

So you cannot simply state that DSU is faster than passing a function
if the structure of the input you want to sort is unknown...

Thorsten

Thorsten Kampe, Apr 29, 2004
4. ### Anton VredegoorGuest

"Delaney, Timothy C (Timothy)" <> wrote:

> foo = sorted(foo, key=lambda x: map(int, x.split('.')))
> print foo

Generality is also about making as little assumptions about the data
as possible and still provide minimal functionality. For example
dropping the "integer" assumption:

def _cmp(x,y):
L,R = x.split('.'),y.split('.')
for a,b in zip(L,R):
r = cmp(len(a),len(b)) or cmp(a,b)
if r:
return r
return cmp(len(L),len(R))

def test():
L = ['1.0', '1.0.1', '1.1.1', '1.2', '1.10',
'1.11', '1.20', '1.20.1', '1.30', '1.9']
L.sort(_cmp)
print L

if __name__=='__main__':
test()

Anton

Anton Vredegoor, Apr 29, 2004
5. ### Terry ReedyGuest

"Thorsten Kampe" <> wrote in message
news:...
> If made some tests with lists of one million items:
>
> The first test was with a randomized sequence of numbers from 1 -
> 1000000. Duplicate numbers were allowed. The comparing function was
> the identity (lambda x: x).

I do not understand what you did, and therefore cannot judge results. A
compare function takes two args (the objects to be compared) and
returns -1, 0, 1. (I believe 0, 1 for <= or >t may be sufficient
currently). How did you use the one-param identity?

Terry J. Reedy

Terry Reedy, Apr 29, 2004
6. ### Thorsten KampeGuest

* Terry Reedy (2004-04-29 20:03 +0100)
> "Thorsten Kampe" <> wrote in message
> news:...
>> If made some tests with lists of one million items:
>>
>> The first test was with a randomized sequence of numbers from 1 -
>> 1000000. Duplicate numbers were allowed. The comparing function was
>> the identity (lambda x: x).

>
> I do not understand what you did, and therefore cannot judge results. A
> compare function takes two args (the objects to be compared) and
> returns -1, 0, 1. (I believe 0, 1 for <= or >t may be sufficient
> currently).

That's the "traditional" approach (and it doesn't make much sense -
see the "Searching and Sorting FAQ" in the Python Cookbook from Tim
Peters: "Why does Python use the three out-come cmp for sorting? Why
doesn't it use a simple less-than comparison instead?". It better to
abstract the function from the comparison.

> How did you use the one-param identity?

I assembled my "pass function" approach and TCD's DSU approach into a
single program[1]

Then I generated two lists[2]:
>>> a = range(1000000); a.reverse()

and
>>> b = randseq(1, 1000000, 1000000, 'repeat')

(meaning: create a list with one million integers 1 <= n <= 1000000;
allow duplicate numbers)

Then I timed [3]
timer(funcsort, a, lambda x: x, 'dsu')) -> 43 sec
timer(funcsort, a, lambda x: x, 'pass')) -> 86 sec
timer(funcsort, b, lambda x: x, 'dsu')) -> 24 sec
timer(funcsort, b, lambda x: x, 'pass')) -> 5 sec

[1]
def funcsort(seq, func, modus = 'dsu'):
""" sort seq by func(item) """
if func is None:
seq = seq[:]
seq.sort()
return seq
elif modus == 'dsu':
seq = [(func(item), index, item) for index, item in enumerate(seq)]
seq.sort()
return [item[2] for item in seq]
elif modus == 'pass':
seq = seq[:]
seq.sort(lambda x, y: cmp(func(x), func(y)))
return seq
[2]
def randseq(range_start, range_end = 0, count = 0, modus = 'norepeat'):
from random import randint

if modus == 'repeat':
return [randint(range_start, range_end) for index in range(count)]

elif isinstance(range_start, int):
return randseq(range(range_start, range_end + 1))[:count]

else: # 'range_start' is a seq, so shuffle
range_start = range_start[:]
randomseq = []

for i in range(len(range_start)):
itemindex = randint(0, len(range_start) - 1)
randomseq.append(range_start[itemindex])
del range_start[itemindex]
return randomseq
[3]
def timer(iteration, *func_and_args):
"""
print the time elapsed (in seconds) evaluating func iteration
times
(default is '1') """
import gc, \
time
from colour import ltyellow, \
reset_color

if isinstance(iteration, int):
func, args = func_and_args[0], func_and_args[1:]
else:
# if first argument is not a number, set func to iteration and iteration
# to '1'
iteration, func, args = 1, iteration, func_and_args

iteration = range(iteration)
gc.collect() # force garbage collection
start_time_cpu = time.clock()
start_time_total = time.time()

for i in iteration:
func(*args)
print ltyellow, 'cpu: %(cpu).3f,' % {'cpu': time.clock() - start_time_cpu}, \
'total: %(total).3f' % {'total': time.time() - start_time_total}, reset_color

Thorsten Kampe, Apr 29, 2004

## Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.