no more comparisons

C

Carl Banks

I was surprised to see that
comparison is slated for death
in Python 3000.

For example:http://www.python.org/dev/peps/pep-3100/
list.sort() and builtin.sorted() methods: eliminate cmp parameter [27] [done]


Hmm, wasn't aware they were taking it that far. You should almost
always avoid using the cmp parameter because it's very inefficient;
instead, look at the key parameter which is a function that maps
objects in a your sequence to a sort key and sorts on that.

So instead of (for a simple example):

s.sort(cmp=lambda a,b: cmp(a.get_id(),b.get_id()))

You would use:

s.sort(key=lambda a:a.get_id())

(However, there are rare cases where you can't easily map your items
to a sortable builtin. I suppose in those cases you'll have to use a
custom comparison proxy. But I digress.)

But there is a rumor of a PEP to restore comparisons.http://mail.python.org/pipermail/python-3000/2008-January/011764.html

Is that going anywhere?
No.


Also, what is the core motivation for removing this functionality?

The basically replaced it with a better one. Instead of the cmp
methods above, use the key method. Instead of __cmp__ method for
overriding object operators, use the rich comparison methods: __lt__,
__gt__, and so on.

Python 2.x currently implements both cmp and rich comparisons at the
same time, but that creates a lot of underlying complexity, so they
got rid of it.


Carl Banks
 
P

Paul Rubin

Carl Banks said:
For example:http://www.python.org/dev/peps/pep-3100/
list.sort() and builtin.sorted() methods: eliminate cmp parameter [27] [done]

Hmm, wasn't aware they were taking it that far. You should almost
always avoid using the cmp parameter because it's very inefficient;

I don't see what's so inefficient about it necessarily. The key
argument invokes a DSU scheme that allocates and releases potentially
a lot of memory, which both uses time and increases the program's
memory region. The cmp option should not be removed. However,
requiring it to be specified as a keyword parameter instead of
just passed as an unlabelled arg is fine.
 
A

Alan Isaac

Paul said:
The cmp option should not be removed. However, requiring
it to be specified as a keyword parameter instead of just
passed as an unlabelled arg is fine.



Sure; I would have no problem with that.

But that is not what is happening.

As for Carl's suggestion to use ``key``:
this is already possible when it is convenient,
but it is not always convenient. (Even aside
from memory considerations.)



By the way, I even saw mention of even removing the

``cmp`` built-in.



Cheers,

Alan Isaac
 
T

Terry Reedy

| > Hmm, wasn't aware they were taking it that far. You should almost
| > always avoid using the cmp parameter because it's very inefficient;
|
| I don't see what's so inefficient about it necessarily.

The key function is called once per list item, for n calls total. The
comparision function is called once per comparision. There are at least
n-1 such calls and typically something on the order of n * lg2(n) calls.
 
D

Dan Bishop

Sure; I would have no problem with that.

But that is not what is happening.

As for Carl's suggestion to use ``key``:
this is already possible when it is convenient,
but it is not always convenient. (Even aside
from memory considerations.)

def cmp_key(cmp_fn):
class CmpWrapper(object):
def __init__(self, obj):
self.obj = obj
def __cmp__(self, other):
return cmp_fn(self.obj, other.obj)
return CmpWrapper
 
P

Paul Rubin

Terry Reedy said:
| I don't see what's so inefficient about it necessarily.

The key function is called once per list item, for n calls total. The
comparision function is called once per comparision. There are at least
n-1 such calls and typically something on the order of n * lg2(n) calls.

Right. Depending on the application, calling the comparison function
lg(n) times may be faster than calling the key function once.

Example: you want to sort a list of strings case-independently, each
string containing a few pages worth of text. There are 10000 strings,
each maybe 10000 chars long. Then (untested):

x.sort(xs, key=lower)

makes a copy of each of the 10000 strings, makes 10000 tuples, sorts,
then releases the temporary strings and tuples. At least 100
megabytes of memory allocated and 100 million chars copied, even
though any give pair of strings will usually differ somewhere in the
first few characters and there's no need to look past those.

from string import lower
from operator import eq
from itertools import *

def xcmp(a,b):
for x,y in izip(a,b)):
c = cmp(x.lower() y.lower())
if c: return c
return 0

x.sort(xcmp)

runs the comparison function about 14000 times, looking at only a few
bytes on most of the calls, and using only a small constant amount of
auxiliary storage, and is likely to be much faster than all that
copying.

Hmm, there is a slight problem with xcmp-- it will fail if a is a
prefix of b. That's fixable but anyway it still makes its point as
written.
 
T

Terry Reedy

| > | I don't see what's so inefficient about it necessarily.
| >
| > The key function is called once per list item, for n calls total. The
| > comparision function is called once per comparision. There are at
least
| > n-1 such calls and typically something on the order of n * lg2(n)
calls.
|
| Right. Depending on the application, calling the comparison function
| lg(n) times may be faster than calling the key function once.
|
| Example: you want to sort a list of strings case-independently, each
| string containing a few pages worth of text. There are 10000 strings,
| each maybe 10000 chars long.

If the strings are pulled in from a file, or stored off to a file, as I
would expect in and real app of this sort, the in-memory data to be sorted
could/should be tuples of the lowercased strings and files positions. Then
no key param is needed.

| Then (untested):
|
| x.sort(xs, key=lower)
|
| makes a copy of each of the 10000 strings, makes 10000 tuples, sorts,
| then releases the temporary strings and tuples. At least 100
| megabytes of memory allocated and 100 million chars copied, even
| though any give pair of strings will usually differ somewhere in the
| first few characters and there's no need to look past those.
|
| from string import lower
| from operator import eq
| from itertools import *
|
| def xcmp(a,b):
| for x,y in izip(a,b)):
| c = cmp(x.lower() y.lower())
| if c: return c
| return 0
|
| x.sort(xcmp)
|
| runs the comparison function about 14000 times,

14 * 10000 = 140000, not 14000. For each of these, you have a call to izip
and multiple calls to .next, cmp, and .lower.

That said, this use case of where the potentially needed key is long
whereas the actually needed key is much shorter is a better use case than
anything I have seen so far.

Still, as I figure it, the initial part of each text is lowercased around
28 times, on average. So there can only be a speed saving if the text is
more than 28 times the effective actual key.

To actually do such a thing, at least more than once, I would be tempted to
have the key function take just the first k chars, for k perhaps 50. If
the texts average, say, 1500 chars, then the key array would be 1/30 *
perhaps 2 (for object overhead) = 1/15 of the original array space. Then
run over the 'sorted' array to see if there are any cases where the first
50 chars match and if so, check more to see if a swap needs to be made.
Easiest would be to do a simple bubble or insert sort with the full
compare.

tjr
 
D

Duncan Booth

Paul Rubin said:
Right. Depending on the application, calling the comparison function
lg(n) times may be faster than calling the key function once.

Example: you want to sort a list of strings case-independently, each
string containing a few pages worth of text. There are 10000 strings,
each maybe 10000 chars long. Then (untested):

x.sort(xs, key=lower)

makes a copy of each of the 10000 strings, makes 10000 tuples, sorts,
then releases the temporary strings and tuples. At least 100
megabytes of memory allocated and 100 million chars copied, even
though any give pair of strings will usually differ somewhere in the
first few characters and there's no need to look past those.

Not tuples actually, there's a special sortwrapper type which is used.
This is slightly better as it stops the comparison leaking through to
the original values without having to store an index value.

As Dan has pointed out, for those rare cases where it actually is better
to use cmp than key you can easily create a wrapper. The only change is
that you now have to work at using cmp instead of it being the first
thing you think of. So here you could use his wrapper and your
comparison function together:

x.sort(key=cmp_key(xcmp))

I'd agree though that the comparison wrapper should be reasonably well
publicised: I'd remembered that there was an easy way to implement cmp
in terms of key but not what it was.
 
A

Alan Isaac

C

Carl Banks

Apparently I'm overlooking something obvious ...

how is this supposed to work if __cmp__ is no longer

being called? (Which was my understanding.)


It won't. In Python 3.0 you'd have to write this class in terms of
rich comparisons (__lt__, __gt__, etc.).


Carl Banks
 
A

Alan Isaac

Dan said:
On Mar 13, 12:38 pm, Alan Isaac wrote:





Carl said:
It won't. In Python 3.0 you'd have to write this class in terms of
rich comparisons (__lt__, __gt__, etc.).





Exactly. So something simple (define an anonymous function)

has become a bit of a pain.



On the other hand, I've looked through my extant code and

have not found a use of ``cmp`` that I cannot work around.

So maybe this is not as bad as I feared. What are some use

cases that will clearly be harder (i.e., at least require

a slightly elaborate wrapper) after this change?



Cheers,

Alan Isaac
 
M

Mark Dickinson

So maybe this is not as bad as I feared.  What are some use

cases that will clearly be harder (i.e., at least require

a slightly elaborate wrapper) after this change?

Sorting tuples, where the second item in the tuple should
have the opposite ordering to the first is going to be a bit
of a pain. Or even worse, where the ordering of the second
item depends on the value of the first item in the tuple.

For example, suppose that (for whatever contrived reason)
you're representing integers in (sign, magnitude) format
by tuples (s, i), where s = 0 or 1 (0 for positive, 1 for
negative) and i is a string representing the absolute
value of the integer. So

(0, '3') represents 3
(1, '14') represents 14

and you want to sort according to numeric value. One
can certainly come up with a key that works, but it's
not quite as straightforward as writing a custom __cmp__.

This isn't a totally contrived example: the internal
Decimal format isn't so different from this.

Mark
 
A

Alan Isaac

Mark said:
Sorting tuples, where the second item in the tuple should
have the opposite ordering to the first is going to be
a bit of a pain. Or even worse, where the ordering of the
second item depends on the value of the first item in the



This is like some examples where I had used cmp,

but if I understand correctly I think it is not a problem.


For example, suppose that (for whatever contrived reason)
you're representing integers in (sign, magnitude) format
by tuples (s, i), where s = 0 or 1 (0 for positive, 1 for
negative) and i is a string representing the absolute
value of the integer. So



Does this do it? ::



key= lambda x: (-x[1],int(x2))



Here I am depending on the lexicographic sorting of tuples.
Without that there would be real trouble.



Cheers,

Alan Isaac
 
D

Dan Bishop

Mark said:
Sorting tuples, where the second item in the tuple should
have the opposite ordering to the first is going to be
a bit of a pain. Or even worse, where the ordering of the
second item depends on the value of the first item in the
tuple.

This is like some examples where I had used cmp,

but if I understand correctly I think it is not a problem.
For example, suppose that (for whatever contrived reason)
you're representing integers in (sign, magnitude) format
by tuples (s, i), where s = 0 or 1 (0 for positive, 1 for
negative) and i is a string representing the absolute
value of the integer. So

Does this do it? ::

key= lambda x: (-x[1],int(x2))

Here I am depending on the lexicographic sorting of tuples.
Without that there would be real trouble.

Even assuming you meant (-x[0], int(x[1])), that sorts negative
numbers in the wrong order. You want

key = lambda x: (-1 if x[0] else 1) * int(x[1])
 
M

Mark Dickinson

Does this do it? ::

    key= lambda x: (-x[1],int(x2))

Here I am depending on the lexicographic sorting of tuples.
Without that there would be real trouble.

Close. :)

I think it needs to be something like:

key = lambda x: (-x[0], int(x[1]) if x[0] == 0 else -int(x[1]))

Relying on lexicographic sorting of tuples seems okay to me.

What bugs me more about this example is that one has to rely
on the existence of a way to negate the usual ordering. This
is fine for numbers (where you can use the unary - operator)
or for numeric strings (where you can convert to int and then
use -), but it's not too much of a stretch to imagine cases
where neither of those options is available (e.g. non-numeric
strings), and where one ends up moving further and further
from readable code and into the realm of arcane trickery...

Mark
 
A

Alan Isaac

Dan said:
Even assuming you meant (-x[0], int(x[1])), that sorts negative
numbers in the wrong order. You want
key = lambda x: (-1 if x[0] else 1) * int(x[1])





Sorry, was speed typing (very badly) and the question

actually had two different problem descriptions.

As you suggest, what I meant was::



key= lambda x: (-x[0], int(x[1]))



which was meant to address:



"Sorting tuples, where the second item in the tuple

should have the opposite ordering to the first is going

to be a bit of a pain."



But you are quite right, the problem was then specified to

numerical order, for which I like your solution. Or even::



key= lambda x: -int(x[1]) if x[0] else int(x[1])





The key point being, if you will, that this use case does

not support the idea that relying on ``key`` will be so bad.

So, what is a case that is really uncomfortable?



Thanks,

Alan Isaac
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top