Performance: sets vs dicts.

J

John Nagle

Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

John Nagle
 
A

Arnaud Delobelle

John Nagle said:
Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

John Nagle

IIRC Frozensets are implemented more or less as sets with a hash
function and immutability so I would expect "in" to perform exactly the
same as for sets. For dicts, I would think that the set implementation
is very close to the dict one.

So I wouldn't expect any significant difference between any of them.
 
P

Peter Otten

John said:
Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

As Arnaud suspects: no significant difference:

$ python dictperf.py
dict --> 0.210289001465
set --> 0.202902793884
frozenset --> 0.198950052261

$ cat dictperf.py
import random
import timeit

with open("/usr/share/dict/words") as instream:
words = [line.strip() for line in instream]

#random.seed(42)
sample = random.sample(words, 501)

n = sample.pop()
y = random.choice(sample)

d = dict.fromkeys(sample)
s = set(sample)
f = frozenset(sample)


for lookup in d, s, f:
print type(lookup).__name__, "-->", timeit.timeit(
"n in lookup; y in lookup",
"from __main__ import lookup, n, y")

Peter
 
S

Seth Rees

John said:
Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

As Arnaud suspects: no significant difference:

$ python dictperf.py
dict --> 0.210289001465
set --> 0.202902793884
frozenset --> 0.198950052261

$ cat dictperf.py
import random
import timeit

with open("/usr/share/dict/words") as instream:
words = [line.strip() for line in instream]

#random.seed(42)
sample = random.sample(words, 501)

n = sample.pop()
y = random.choice(sample)

d = dict.fromkeys(sample)
s = set(sample)
f = frozenset(sample)


for lookup in d, s, f:
print type(lookup).__name__, "-->", timeit.timeit(
"n in lookup; y in lookup",
"from __main__ import lookup, n, y")

Peter
What about lists versus tuples?
 
R

Raymond Hettinger

    Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"?  Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

There is no significant difference.
All three are implemented using substantially the same code.


Raymond
 
P

Peter Otten

Seth said:
John said:
Is the "in" test faster for a dict or a set?
Is "frozenset" faster than "set"? Use case is
for things like applying "in" on a list of 500 or so words
while checking a large body of text.

As Arnaud suspects: no significant difference:

$ python dictperf.py
dict --> 0.210289001465
set --> 0.202902793884
frozenset --> 0.198950052261

$ cat dictperf.py
import random
import timeit

with open("/usr/share/dict/words") as instream:
words = [line.strip() for line in instream]

#random.seed(42)
sample = random.sample(words, 501)

n = sample.pop()
y = random.choice(sample)

d = dict.fromkeys(sample)
s = set(sample)
f = frozenset(sample)


for lookup in d, s, f:
print type(lookup).__name__, "-->", timeit.timeit(
"n in lookup; y in lookup",
"from __main__ import lookup, n, y")

Peter
What about lists versus tuples?

You trade O(1) for O(N) with both, a bad idea for all but the smallest
lists/tuples.

$ python dictperf.py
dict --> 0.221934080124
set --> 0.220952987671
frozenset --> 0.225043058395
list --> 21.5041530132
tuple --> 20.8655071259

By the way, the script will run on your computer, too ;)

Peter
 
A

Aahz

There is no significant difference. All three are implemented using
substantially the same code.

That reminds me: one co-worker (who really should have known better ;-)
had the impression that sets were O(N) rather than O(1). Although
writing that off as a brain-fart seems appropriate, it's also the case
that the docs don't really make that clear, it's implied from requiring
elements to be hashable. Do you agree that there should be a comment?
 
P

Paul Rubin

That reminds me: one co-worker (who really should have known better ;-)
had the impression that sets were O(N) rather than O(1). Although
writing that off as a brain-fart seems appropriate, it's also the case
that the docs don't really make that clear, it's implied from requiring
elements to be hashable. Do you agree that there should be a comment?

It's O(1) with reasonable input distributions but can be O(N) for
adverse distributions. The docs should say something like that, and
include this link:

http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html
 
A

Aahz

For settling exactly this kind of confusion, Python's standard library
comes with a module, the timeit module. Your co-worker should have
known better: don't guess about timing performance, measure it.

Or am I missing something here?

Possibly; IMO, people should not need to run timeit to determine basic
algorithmic speed for standard Python datatypes.
 
P

Paul Rubin

Possibly; IMO, people should not need to run timeit to determine basic
algorithmic speed for standard Python datatypes.

Indeed. Alex Stepanov (designer of C++ Standard Template Library) was
emphatic that algorithm complexity assertions should be part of the
interface of any STL function:

http://www.sgi.com/tech/stl/drdobbs-interview.html

He also said it should be part of the "unwritten contract" between the
module and its user, but I don't understand why he said "unwritten",
since in the C++ STL the complexity statements are part of the written
spec.
 
A

Arnaud Delobelle

Paul Rubin said:
Indeed. Alex Stepanov (designer of C++ Standard Template Library) was
emphatic that algorithm complexity assertions should be part of the
interface of any STL function:

http://www.sgi.com/tech/stl/drdobbs-interview.html

He also said it should be part of the "unwritten contract" between the
module and its user, but I don't understand why he said "unwritten",
since in the C++ STL the complexity statements are part of the written
spec.

The spec is written, not the contract :)
 
J

Jerry Hill

Possibly; IMO, people should not need to run timeit to determine basic
algorithmic speed for standard Python datatypes.

http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
last time this came up, there was some resistance to making promises
about time complexity in the official docs, since that would make
those numbers part of the language, and thus binding on other
implementations.
 
A

Aahz

http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
last time this came up, there was some resistance to making promises
about time complexity in the official docs, since that would make
those numbers part of the language, and thus binding on other
implementations.

I'm thoroughly aware of that page and updated it yesterday to make it
easier to find. ;-)

However, I think there are some rock-bottom basic guarantees we can make
regardless of implementation. Does anyone seriously think that an
implementation would be accepted that had anything other than O(1) for
index access into tuples and lists? Dicts that were not O(1) for access
with non-pathological hashing? That we would accept sets having O()
performance worse than dicts?

I suggest that we should agree on these guarantees and document them in
the core.
 
S

Stefan Behnel

Jerry Hill, 31.08.2010 14:24:
http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
last time this came up, there was some resistance to making promises
about time complexity in the official docs, since that would make
those numbers part of the language, and thus binding on other
implementations.

The docs start getting clearer about what's language specification and
what's implementation specific to CPython, so that would just add another
couple of CPython-only bits to what's there already. Although I do consider
facts like list indexing being O(1) pretty much at the
language-rather-than-implementation level. Most Python code simply expects
that and it would degrade a lot of code if that ever changed in whatever
implementation.

Stefan
 
J

Jerry Hill

I suggest that we should agree on these guarantees and document them in
the core.

I can't get to the online python-dev archives from work (stupid
filter!) so I can't give you a link to the archives, but the original
thread that resulted in the creation of that wiki page was started on
March 9th, 2008 and was titled "Complexity documentation request".

At the time, opposition to formally documenting this seemed pretty
widespread, including from yourself and Guido. You've obviously
changed your mind on the subject, so maybe it's something that would
be worth revisiting, assuming someone wants to write the doc change.
 
T

Terry Reedy

I'm thoroughly aware of that page and updated it yesterday to make it
easier to find. ;-)

However, I think there are some rock-bottom basic guarantees we can make
regardless of implementation. Does anyone seriously think that an
implementation would be accepted that had anything other than O(1) for
index access into tuples and lists?

Does anyone seriously think that an implementation should be rejected as
an implementation if it intellegently did seq[n] lookups in log2(n)/31
time units for all n (as humans would do), instead of stupidly taking 1
time unit for all n < 2**31 and rejecting all larger values (as 32-bit
CPython does)?
> Dicts that were not O(1) for access with non-pathological hashing?

You would actually be unhappy if small dicts ran even faster than they
do now (while large dicts ran the same)? Not me!
I suggest that we should agree on these guarantees and document them in
the core.

I disagree for several reasons.

1. It would a category mistake. Python is an abstract algorithm
language. The concept of speed is not applicable. I happen to think it
one of the best, which is why I once dubbed it 'executable pseudocode',
to compare it to similar but inferior, non-executable algorithm
languages too often used still today. To human readers, as readers,
speed of CPython or anything else is not relevant.

2. A Python interpreter is an abstract algorithm. Algorithms can be
characterized by operation counts, but not by physical universe time.

3. Algorithms, including Python interpreters, can be embodied in
anything that can compute, including humans, VonNeuman machines of
various types, and possible future non-VonNeuman machines. To limit
'python interpreter' to current vonNeuman machines would be short
sighted. Human interpretation of Python code (and other specifications)
is part of development, testing, and debugging.

4. Guarantee? Who will pay what to who if what is not performed ;-?. The
Python license intentionally disclaims any guarantee of performance. If
you are using 'guarantee' metaphorically, perhaps try another word.

5. Accepted? If you want to claim that an interpreter that is useless
for your purposes is not really an interpreter, you are free to, I
suppose. Asking the PSF to impose your view on me would, to me, violate
its diversity policy.

6. O-notation was developed for characterizing the asymptotic (large-N)
behavior of abstract algorithms in terms of abstract operation counts.
Left out are differences between operations, lower order terms,
multiplicative constants, how large N must be to get large-N behavior,
and what happens for smaller N. These and other factors make the
translation of O(f(n)) into real time extremely mushy or even wrong.

I already suggested that O(1) lookup is a symption of simple-minded
arithmetic stupidity, of doing unnecessary work when adding small
numbers. When it is a result doing most additions slower than necessary,
it is a bad thing, not a good thing, and a bad criteria for an
acceptable algorithm and acceptable interpreter. Similarly, books say
that O(n*logn) sorting is 'better' that O(n**2) sorting. However, Tim
Peters verified that the opposite is true for real times for small n,
and hence the current list.sort smartly uses a fast O(n**2) algorithm
for small lists (len < 64) and small runs in larger lists. Rejecting
list.sort for this reason would be stupid.
 
P

Paul Rubin

Terry Reedy said:
Does anyone seriously think that an implementation should be rejected
as an implementation if it intellegently did seq[n] lookups in
log2(n)/31 time units for all n (as humans would do), instead of
stupidly taking 1 time unit for all n < 2**31 and rejecting all larger
values (as 32-bit CPython does)?

Er, how can one handle n > 2**31 at all, in 32-bit CPython?
You would actually be unhappy if small dicts ran even faster than they
do now (while large dicts ran the same)? Not me!

Not sure what you are referring to by that. I'd actually be ok with
using O(log(n)) dicts to eliminate the O(n) behavior of the hash-based
implementation on adverse input sets.
Similarly, books say that O(n*logn) sorting is 'better' that O(n**2)
sorting. However, Tim Peters verified that the opposite is true for
real times for small n, and hence the current list.sort smartly uses a
fast O(n**2) algorithm for small lists (len < 64) and small runs in
larger lists. Rejecting list.sort for this reason would be stupid.

That algorithm is still O(n log n) since the n**2 behavior up to
some constant n is effectively an additive constant that asymptotically
is within the specified big-O. 64 actually sounds suspiciously large
for the cutover, but I'm sure Tim tuned it carefully.
 
S

Stefan Behnel

Lie Ryan, 01.09.2010 15:46:
While I think documenting them would be great for all programmers that
care about practical and theoretical execution speed; I think including
these implementation details in core documentation as a "guarantee"
would be a bad idea for the reasons Terry outlined.

One way of resolving that is by having two documentations (or two
separate sections in the documentation) for:
- Python -- the language -- documenting Python as an abstract language,
this is the documentation which can be shared across all Python
implementations. This will also be the specification for Python Language
which other implementations will be measured to.
- CPython -- the Python interpreter -- documents implementation details
and performance metrics. It should be properly noted that these are not
part of the language per se. This will be the playground for CPython
experts that need to fine tune their applications to the last drop of
blood and don't mind their application going nuts with the next release
of CPython.

I disagree. I think putting the "obvious" guarantees right into the normal
documentation will actually make programmers aware that there *are*
different implementations (and differences between implementations), simply
because it wouldn't just say "O(1)" but "the CPython implementation of this
method has an algorithmic complexity of O(1), other Python implementations
are known to perform alike at the time of this writing". Maybe without the
last half of the sentence if we really don't know how other implementations
work here, or if we expect that there may well be a reason they may choose
to behave different, but in most cases, it shouldn't be hard to make that
complete statement.

After all, we basically know what other implementations there are, and we
also know that they tend to match the algorithmic complexities at least for
the major builtin types. It seems quite clear to me as a developer that the
set of builtin types and "collections" types was chosen in order to cover a
certain set of algorithmic complexities and not just arbitrary interfaces.

Stefan
 
L

Lie Ryan

I'm thoroughly aware of that page and updated it yesterday to make it
easier to find. ;-)

However, I think there are some rock-bottom basic guarantees we can make
regardless of implementation. Does anyone seriously think that an
implementation would be accepted that had anything other than O(1) for
index access into tuples and lists? Dicts that were not O(1) for access
with non-pathological hashing? That we would accept sets having O()
performance worse than dicts?

I suggest that we should agree on these guarantees and document them in
the core.

While I think documenting them would be great for all programmers that
care about practical and theoretical execution speed; I think including
these implementation details in core documentation as a "guarantee"
would be a bad idea for the reasons Terry outlined.

One way of resolving that is by having two documentations (or two
separate sections in the documentation) for:
- Python -- the language -- documenting Python as an abstract language,
this is the documentation which can be shared across all Python
implementations. This will also be the specification for Python Language
which other implementations will be measured to.
- CPython -- the Python interpreter -- documents implementation details
and performance metrics. It should be properly noted that these are not
part of the language per se. This will be the playground for CPython
experts that need to fine tune their applications to the last drop of
blood and don't mind their application going nuts with the next release
of CPython.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top