Sorting: too different times. Why?

N

n00m

Any comment:

class Vector:
def __init__(self, x, y):
self.x = x
self.y = y
def __cmp__(self, v):
if self.x < v.x and self.y > v.y:
return -1
return 0

def v_cmp(v1, v2):
if v1.x < v2.x and v1.y > v2.y:
return -1
return 0

from random import randint
from time import time

a = []
for i in range(200000):
a += [Vector(randint(0, 500000), randint(0, 500000))]
b = a[:]
c = a[:]

print 'Sorting...'

t = time()
b.sort(cmp=v_cmp)
print time() - t

t = time()
c.sort()
print time() - t

print b == c


Sorting...
0.906000137329
6.57799983025
True
 
N

n00m

I was expecting the 1st method would be *slower* than the 2nd one :)
Or at least equal... Just random ("intuitive") expectations
 
S

Steven D'Aprano

In the subject line, you write "too different times". You actually want
"two", the number, not "too" as in "too many", "too much". Lots of native
English speakers get this wrong too :)

Any comment:

class Vector:
def __init__(self, x, y):
self.x = x
self.y = y
def __cmp__(self, v):
if self.x < v.x and self.y > v.y:
return -1
return 0


Modern versions of Python (since 2.2 I think?) use __lt__ or __gt__ for
sorting. If the class doesn't have a __lt__ method, Python falls back on
__cmp__.
b.sort(cmp=v_cmp)

This is relatively fast, because you pass a comparison function directly,
so Python doesn't need to look for a __lt__ method then fall back to
__cmp__. It just uses v_cmp, every time.


This is slower, because every comparison looks up the __lt__ and fails,
then tries the __cmp__.

If you change the definition of Vector to include rich comparison methods
as detailed here:

http://docs.python.org/reference/datamodel.html#object.__lt__

sorting will probably be significantly faster still.
 
D

Diez B. Roggisch

n00m said:
Any comment:

class Vector:
def __init__(self, x, y):
self.x = x
self.y = y
def __cmp__(self, v):
if self.x < v.x and self.y > v.y:
return -1
return 0

def v_cmp(v1, v2):
if v1.x < v2.x and v1.y > v2.y:
return -1
return 0

from random import randint
from time import time

a = []
for i in range(200000):

Use xrange instead (unless you are under python3), because for loops you
don't need the full list range creates - xrange is just a generator.
a += [Vector(randint(0, 500000), randint(0, 500000))]

Better use .append here, looks nicer and should also be a bit faster.
b = a[:]
c = a[:]

print 'Sorting...'

t = time()
b.sort(cmp=v_cmp)
print time() - t

t = time()
c.sort()
print time() - t

print b == c


Sorting...
0.906000137329
6.57799983025

I think the main reason is that method-dispatch is more expensive than
function-dispatch. The former must create a bound method before calling,
the latter just works out of the box.

Things get better if you do this:

t = time()
c.sort(cmp=Vector.__cmp__)
print time() - t


Although not the exact same performance - I get

Sorting...
0.677843093872
1.4283311367
True

Diez
 
C

Chris Rebert

I was expecting the 1st method would be *slower* than the 2nd one :)
Or at least equal... Just random ("intuitive") expectations

The second method repeatedly looks up left_item.__class__.__cmp__
(i.e. Vector.__cmp__) when doing the necessary comparisons between the
list items; while these lookups are optimized and are fast, they are
not free and cannot be skipped because Python doesn't know the list
contains only Vectors.
The first method uses the single provided comparison function and thus
does no such lookups; hence, it's faster.

That's my guess anyway.

Cheers,
Chris
 
M

Mark Dickinson

Any comment:

class Vector:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __cmp__(self, v):
        if self.x < v.x and self.y > v.y:
            return -1
        return 0

def v_cmp(v1, v2):
    if v1.x < v2.x and v1.y > v2.y:
        return -1
    return 0

from random import randint
from time import time

a = []
for i in range(200000):
    a += [Vector(randint(0, 500000), randint(0, 500000))]
b = a[:]
c = a[:]

print 'Sorting...'

t = time()
b.sort(cmp=v_cmp)
print time() - t

t = time()
c.sort()
print time() - t

print b == c

Sorting...
0.906000137329
6.57799983025
True

Do you get the same magnitude difference if you make Vector a new-
style
class? (I.e., use "class Vector(object)" instead of "class Vector
()".)

Mark
 
D

Dave Angel

n00m said:
Any comment:

<snip>
def v_cmp(v1, v2):
if v1.x < v2.x and v1.y > v2.y:
return -1
return 0
The second part of the compound if is backwards. So if this is headed
for production code, it better get fixed.

DaveA
 
N

n00m

Do you get the same magnitude difference
if you make Vector a new-style class?

Yes (I mean "No"): new-style's much faster

And now it's elephants instead of vectors.
Def: an elephant is smarter than another one IIF
its size is strictly less but its IQ is strictly
greater

I.e. you can't compare (2, 8) to (20, 50)
or let count them as equally smart elephants.
================================================

class Elephant(object):
def __init__(self, size, iq):
self.size = size
self.iq = iq
def __cmp__(self, e):
if self.size < e.size and self.iq > e.iq:
return -1
if self.size > e.size and self.iq < e.iq:
return 1
return 0

def e_cmp(e1, e2):
if e1.size < e2.size and e1.iq > e2.iq:
return -1
if e1.size > e2.size and e1.iq < e2.iq:
return 1
return 0

from random import randint
from time import time

a = []
for i in xrange(200000):
a.append(Elephant(randint(1, 50000), randint(1, 50000)))
b = a[:]
c = a[:]

print 'Sorting...'

t = time()
b.sort(cmp=e_cmp)
print time() - t

t = time()
c.sort()
print time() - t

print b == c


Sorting...
1.56299996376
1.95300006866
True
 
N

n00m

The second part of the compound if is backwards.  So if this is headed
for production code, it better get fixed.

DaveA

Not sure I'm understanding your remark.
 
M

MRAB

Steven said:
In the subject line, you write "too different times". You actually want
"two", the number, not "too" as in "too many", "too much". Lots of native
English speakers get this wrong too :)
[snip]
It could mean that the times are not just different, they're _too_
different, ie a lot more than they are expected to be.
 
N

n00m

Here "meaningful order" is:
if
elephant "a" is smarter than elephant "a[j]"
then "i" must be strictly less than "j"

Of course, to the same effect we could sort them simply
by sizes, but then time of sorting would increase by ~
2 times -- due to decreasing of number of equally smart
things.

But here it does not matter -- for my initial question.
I like all above explanations. Especially that by Chris Rebert.
 
M

MRAB

n00m said:
:) Of course, by "too" I meant "too", as in "tooooo much"

Although it's OK in English to say "too much x" or "too many x", it's
somewhat unnatural to say "too different xs"; it would have to be "the
xs are too different". Nobody said English was logical! :)
 
N

n00m

it's somewhat unnatural to say "too different xs"

Aha. Thanks.
PS
For years I thought that song's title "No Woman No Cry" by Bob Marley
means "No Woman -- No Cry". As if a man got rid of his woman and
stopped
crying, out of her bad behaviour etc.
It turned out to mean "No, woman,.. no cry..."

Or take "Drips" by Eminem. What on earth do the drips mean?

Album: The Eminem Show
Song: Drips

[Eminem] Obie.. yo
[Trice] {*coughing*} I'm sick
[Eminem] Damn, you straight dog?

[Chorus]
That's why I ain't got no time
for these games and stupid tricks
or these bitches on my dick
That's how dudes be gettin sick
That's how dicks be gettin drips
Fallin victims to this shit...
....
....
 
D

Dave Angel

n00m said:
Not sure I'm understanding your remark.
Well, others in the thread have observed the same thing, so maybe it
doesn't matter. But the quoted code had only one if statement:


And the first part of the compound if is a "<" comparison, while the
second part is a ">" comparison

This produces a different sort order than the default tuple comparison.
So it needs fixing.

DaveA
 
M

Mel

MRAB said:
Although it's OK in English to say "too much x" or "too many x", it's
somewhat unnatural to say "too different xs"; it would have to be "the
xs are too different". Nobody said English was logical! :)

Now that James Joyce has written _Finnegans Wake_ there are no grammar
errors and there are no spelling errors. There are only unexpected
meanings.

Mel.
 
S

Steven D'Aprano

and that still isn't a relationship where you can get any meaningful
order out of sorting them:

Not everything has, or need have, a total order. There are relationships
which are only partial (i.e. not all the items are comparable), or
missing transitivity.

A real-world example is pecking order in chickens, or social hierarchies
in general. Using the > operator to mean "higher ranking", you often get
non-transitive hierarchies like the following:

A > B, C, D, E
B > C, E
C > D, E
D > B, E

That is, A > B > C > D > E except that D > B also.

Economic preference is also non-transitive: people may prefer X to Y, and
prefer Y to Z, but prefer Z to X.

It is perfectly legitimate to sort a non-total ordered list, provided you
understand the limitations, including that the order you get will depend
on the order you started with.
 
S

Steven D'Aprano

Or take "Drips" by Eminem. What on earth do the drips mean?

When you have a cold or flu, your nose drips.

Some sexually transmitted diseases make your genitals drip.
 
N

n00m

Some sexually transmitted diseases make your genitals drip.

I suspected this :) Eminem is a famous misogynist
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top