Performance: sets vs dicts.

Aahz · Sep 1, 2010

I can't get to the online python-dev archives from work (stupid
filter!) so I can't give you a link to the archives, but the original
thread that resulted in the creation of that wiki page was started on
March 9th, 2008 and was titled "Complexity documentation request".
http://mail.python.org/pipermail/python-dev/2008-March/077499.html

At the time, opposition to formally documenting this seemed pretty
widespread, including from yourself and Guido. You've obviously
changed your mind on the subject, so maybe it's something that would
be worth revisiting, assuming someone wants to write the doc change.

Looking back at that thread, it's less that I've changed my mind as that
I've gotten a bit more nuanced. I still think that making a full set of
algorithmic guarantees is a Bad Idea, but I think that any implementation
that doesn't have O(1) for list element access is fundamentally broken,
and we should probably document that somewhere.

Stefan Behnel · Sep 1, 2010

Aahz, 01.09.2010 17:40:

I still think that making a full set of
algorithmic guarantees is a Bad Idea, but I think that any implementation
that doesn't have O(1) for list element access is fundamentally broken,
and we should probably document that somewhere.

+1

Stefan

Terry Reedy · Sep 1, 2010

I think that any implementation
that doesn't have O(1) for list element access is fundamentally broken,

Whereas I think that that claim is fundamentally broken in multiple ways.

and we should probably document that somewhere.

I agree that *current* algorithmic behavior of parts of CPython on
typical *current* hardware should be documented not just 'somewhere'
(which I understand it is, in the Wiki) but in a CPython doc included in
the doc set distributed with each release.

Perhaps someone or some group could write a HowTo on Programming with
CPython's Builtin Classes that would describe both the implementation
and performance and also the implications for coding style. In
particular, it could compare CPython's array lists and tuples to singly
linked lists (which are easily created in Python also).

But such a document, after stating that array access may be thought of
as constant time on current hardware to a useful first approximation,
should also state that repeated seqeuntial accessess may be *much*
faster than repeated random accessess. People in the high-performance
computing community are quite aware of this difference between
simplified lies and messy truth. Because of this, array algorithms are
(should be) written differently in Fortran and C because Fortran stores
arrays by columns and C by rows and because it is usually much faster to
access the next item than one far away.

Arnaud Delobelle · Sep 1, 2010

Terry Reedy said:
Whereas I think that that claim is fundamentally broken in multiple ways.

I agree that *current* algorithmic behavior of parts of CPython on
typical *current* hardware should be documented not just 'somewhere'
(which I understand it is, in the Wiki) but in a CPython doc included
in the doc set distributed with each release.

Perhaps someone or some group could write a HowTo on Programming with
CPython's Builtin Classes that would describe both the implementation
and performance and also the implications for coding style. In
particular, it could compare CPython's array lists and tuples to
singly linked lists (which are easily created in Python also).

But such a document, after stating that array access may be thought of
as constant time on current hardware to a useful first approximation,
should also state that repeated seqeuntial accessess may be *much*
faster than repeated random accessess. People in the high-performance
computing community are quite aware of this difference between
simplified lies and messy truth. Because of this, array algorithms are
(should be) written differently in Fortran and C because Fortran
stores arrays by columns and C by rows and because it is usually much
faster to access the next item than one far away.

I don't understand what you're trying to say. Aahz didn't claim that
random list element access was constant time, he said it was O(1) (and
that it should be part of the Python spec that it is).

Terry Reedy · Sep 1, 2010

Terry Reedy said:
Terry Reedy said:

Does anyone seriously think that an implementation should be rejected
as an implementation if it intellegently did seq[n] lookups in
log2(n)/31 time units for all n (as humans would do), instead of
stupidly taking 1 time unit for all n< 2**31 and rejecting all larger
values (as 32-bit CPython does)?

Click to expand...

Er, how can one handle n> 2**31 at all, in 32-bit CPython?

I am not sure of what you mean by 'handle'. Ints (longs in 2.x) are not
limited, but indexes are. 2**31 and bigger are summarily rejected as
impossibly too large, even though they might not actually be so these days.

>>> s=b''
>>> s[1]

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
s[1]
IndexError: index out of range

>>> s[2**32]

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
s[2**32]
IndexError: cannot fit 'int' into an index-sized integer

As far as I know, this is undocumented. In any case, this means that if
it were possible to create a byte array longer than 2**31 on an
otherwise loaded 32-bit linux machine with 2**32 memory, then indexing
the end elements would not be possible, which is to say, O(1) would jump
to O(INF). I do not have such a machine to test whether
big = open('2.01.gigabytes', 'rb').read()
executes or raises an exception. Array size limits are also not documented.

John Bokma · Sep 1, 2010

Arnaud Delobelle said:
I don't understand what you're trying to say. Aahz didn't claim that
random list element access was constant time, he said it was O(1) (and
that it should be part of the Python spec that it is).

Uhm, O(1) /is/ constant time, see page 45 of Introduction to Algorithms,
2nd edition.

Robert Kern · Sep 1, 2010

Uhm, O(1) /is/ constant time, see page 45 of Introduction to Algorithms,
2nd edition.

While we often use the term "constant time" to as a synonym for O(1) complexity
of an algorithm, Arnaud and Terry are using the term here to mean "an
implementation takes roughly the same amount of wall-clock time every time".

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Raymond Hettinger · Sep 1, 2010

That reminds me: one co-worker (who really should have known better ;-)
had the impression that sets were O(N) rather than O(1). Although
writing that off as a brain-fart seems appropriate, it's also the case
that the docs don't really make that clear, it's implied from requiring
elements to be hashable. Do you agree that there should be a comment?

There probably ought to be a HOWTO or FAQ entry on algorithmic
complexity
that covers classes and functions where the algorithms are
interesting.
That will concentrate the knowledge in one place where performance is
a
main theme and where the various alternatives can be compared and
contrasted.

I think most users of sets rarely read the docs for sets. The few
lines
in the tutorial are enough so that most folks "just get it" and don't
read
more detail unless they attempting something exotic. Our docs have
gotten
somewhat voluminous, so it's unlikely that adding that particular
needle to the haystack would have cured your colleague's "brain-fart"
unless he had been focused on a single document talking about the
performance
characteristics of various data structures.

Raymond

Raymond

Terry Reedy · Sep 2, 2010

Most generally, that I view Python as an general algorithm language and
not just as a VonNeuman machine programming language.

More specifically, that O() claims can be inapplicable, confusing,
misleading, incomplete, or false, especially when applied to real time
and to real systems with finite limits.

Yes, I switched, because 'constant time' is a comprehensible claim that
can be refuted and because that is how some will interpret O(1) (see
below for proof;-).

If one takes O(1) to mean bounded, which I believe is the usual
technical meaning, then all Python built-in sequence operations take
bounded time because of the hard size limit. If sequences were not
bounded in length, then access time would not be bounded either.

My most specific point is that O(1), interpreted as more-or-less
constant time across a range of problem sizes, can be either a virute or
vice depending on whether the constancy is a result of speeding up large
problems or slowing down small problems. I furthermore contend that
Python sequences on current hardware exhibit both virtue and vice and
that is would be absurd to reject a system that kept the virtue without
the vice and that such absurdity should not be built into the language
definition.

My fourth point is that we can meet the reasonable goal of helping some
people make better use of current Python/CPython on current hardware
without big-O controversy and without screwing around with the language
definition and locking out the future.

John Bokma · Sep 2, 2010

Terry Reedy said:
On 9/1/2010 5:40 PM, John Bokma wrote:
[..]

Yes, I switched, because 'constant time' is a comprehensible claim
that can be refuted and because that is how some will interpret O(1)
(see below for proof;-).

You make it now sound alsof this interpretation is not correct or out of
place. People who have bothered to read ItA will use O(1) and constant
time interchangeably while talking of the order of growth of the running
time algorithms and most of those are aware that 'big oh' hides a
constant, and that in the real world a O(log n) algorithm can outperform
an O(1) algorithm for small values of n.

John Bokma · Sep 2, 2010

Robert Kern said:
[...]

Uhm, O(1) /is/ constant time, see page 45 of Introduction to Algorithms,
2nd edition.

Click to expand...

While we often use the term "constant time" to as a synonym for O(1)
complexity of an algorithm, Arnaud and Terry are using the term here
to mean "an implementation takes roughly the same amount of wall-clock
time every time".

Now that's confusing in a discussion that earlier on provided a link to
a page using big O notation. At least for people following this
partially, like I do.

rurpy · Sep 2, 2010

There probably ought to be a HOWTO or FAQ entry on algorithmic
complexity
that covers classes and functions where the algorithms are
interesting.
That will concentrate the knowledge in one place where performance is
a
main theme and where the various alternatives can be compared and
contrasted.

I think most users of sets rarely read the docs for sets. The few lines
in the tutorial are enough so that most folks "just get it" and don't read
more detail unless they attempting something exotic.

I think that attitude is very dangerous. There is
a long history in this world of one group of people
presuming what another group of people does or does
not do or think. This seems to be a characteristic
of human beings and is often used to promote one's
own ideology. And even if you have hard evidence
for what you say, why should 60% of people who don't
read docs justify providing poor quality docs to
the 40% that do?

So while you may "think" most people rarely read
the docs for basic language features and objects
(I presume you don't mean to restrict your statement
to only sets), I and most people I know *do* read
them. And when read them I expect them, as any good
reference documentation does, to completely and
accurately describe the behavior of the item I am
reading about. If big-O performance is deemed an
intrinsic behavior of an (operation of) an object,
it should be described in the documentation for
that object.

Your use of the word "exotic" is also suspect.
I learned long ago to always click the "advanced
options" box on dialogs because most developers/-
designers really don't have a clue about what
users need access to.

Our docs have gotten
somewhat voluminous,

No they haven't (relative to what they attempt to
describe). The biggest problem with the docs is
that they are too terse. They often appear to have
been written by people playing a game of "who can
describe X in the minimum number of words that can
still be defended as correct." While that may be
fun, good docs are produced by considering how to
describe something to the reader, completely and
accurately, as effectively as possible. The test
is not how few words were used, but how quickly
the reader can understand the object or find the
information being sought about the object.

so it's unlikely that adding that particular
needle to the haystack would have cured your colleague's "brain-fart"
unless he had been focused on a single document talking about the
performance
characteristics of various data structures.

I don't know the colleague any more that you so I
feel comfortable saying that having it very likely
*would* have cured that brain-fart. That is, he
or she very likely would have needed to check some
behavior of sets at some point and would have either
noted the big-O characteristics in passing, or would
have noted that such information was available, and
would have returned to the documentation when the
need for that information arose. The reference
description of sets is the *one* canonical place to
look for information about sets.

There are people who don't read documentation, but
one has to be very careful not use the existence
of such people as an excuse to justify sub-standard
documentation.

So I think relegating algorithmic complexity information
to some remote document far from the description of the
object it pertains to, is exactly the wrong approach.
This is not to say that a performance HOWTO or FAQ
in addition to the reference manual would not be good.

Terry Reedy · Sep 2, 2010

So while you may "think" most people rarely read
the docs for basic language features and objects
(I presume you don't mean to restrict your statement
to only sets), I and most people I know *do* read
them. And when read them I expect them, as any good
reference documentation does, to completely and
accurately describe the behavior of the item I am
reading about. If big-O performance is deemed an
intrinsic behavior of an (operation of) an object,
it should be described in the documentation for
that object.

However, big-O performance is intentionally NOT so deemed. And I have
and would continue to argue that it should not be, for multiple reasons.

Performance is a feature of implementations, and I think they should be
documented.

This is not to say that a performance HOWTO or FAQ
in addition to the reference manual would not be good.

I have writing a draft of such for CPython on my personal todo list.

Terry Reedy · Sep 2, 2010

Terry Reedy said:
Terry Reedy said:

On 9/1/2010 5:40 PM, John Bokma wrote:
[..]

Yes, I switched, because 'constant time' is a comprehensible claim
that can be refuted and because that is how some will interpret O(1)
(see below for proof;-).

Click to expand...

You make it now sound as if this interpretation is not correct or out of
place.

Saying that an interpretation is comprehensible and common is hardly a
putdown ;-). It simply is not unique. I also said that the other
interpretation is not coherent for size-limited problems. However, if
you mean 'constant', whatever you mean by that, why not say so?

Here is the technical definition given in
https://secure.wikimedia.org/wikipedia/en/wiki/Big_O_notation
I assure you that it is the standard one invented by a mathematiciam
over a century ago and used by mathematicians ever since. It was
popularized in computer science by Donald Knuth in 1976 and used in The
Art of Computer Programming. It is related to the definition of limits.
'''
Formal definition

Let f(x) and g(x) be two functions defined on some subset of the real
numbers. One writes

f(x)=O(g(x))\mbox{ as }x\to\infty\,

if and only if, for sufficiently large values of x, f(x) is at most a
constant multiplied by g(x) in absolute value. That is, f(x) = O(g(x))
if and only if there exists a positive real number M and a real number
x0 such that

|f(x)| \le \; M |g(x)|\mbox{ for all }x>x_0.
'''
For g(x) == 1, this reduces to |f(x)| < M for some M (and large x),
which is to say, f(x) is bounded (at least for large x).

Hmmm. By that definition, 1/x is O(1). Let x0 be anything but 0 and M =
1/x0. Some might find that disturbing. But it is the nature of the
notation, as with limit notation, that is says *nothing* about the
behavior of f for small x.

> People who have bothered to read ItA will use O(1) and constant
time interchangeably while talking of the order of growth of the running
time algorithms

People who have bothered to read the Bidle will use terms the way they
are used in the Bible. So what? In other words, I do not regard ItA as
scripture in the way you seem to. (I presume you mean the Cormen, etal
book. I partially read a library copy over 20 years ago but never
bothered to buy a copy. I see it is now into its third edition.) If they
disagree with Knuth and most others (which I would not claim without
seeing their actual text) then so much the worse for them.

> and most of those are aware that 'big oh' hides a
constant, and that in the real world a O(log n) algorithm can outperform
an O(1) algorithm for small values of n.

Right. And if 'small values of n' include all possible values, then
rejecting a particular O(log n) algorithm as 'unacceptable' relative to
all O(1) algorithms is pretty absurd.

Given that the Wikipedia article references that section also, I wonder
if it really disagrees with the definition above.

Arnaud Delobelle · Sep 3, 2010

Terry Reedy said:
On 9/1/2010 8:11 PM, John Bokma wrote: [...]
Given that the Wikipedia article references that section also, I
wonder if it really disagrees with the definition above.

In simple terms, O(1) is *bounded* time.

This is not the same as constant, even though it is true that the
misleading expression "constant time" is widely used. I emphasized the
difference as you seemed to say that describing list element access as
O(1) was misleading because for example accessing contiguous elements
would be significantly faster that accessing distant ones. This may or
may not be the case, but it remains true that this time is bounded and
this is what O(1) really means.

rurpy · Sep 3, 2010

However, big-O performance is intentionally NOT so deemed.

The discussion, as I understood it, was about whether
or not it *should* be so deemed.

And I have
and would continue to argue that it should not be, for multiple reasons.

Yes, you have. And others have argued the opposite.
Personally, I did not find your arguments very convincing,
particularly that it would be misleading or that the
limits necessarily imposed by a real implementation
somehow invalidates the usefulness of O() documentation.
But I acknowledged that there was not universal agreement
that O() behavior should be documented in the the reference
docs by qualifying my statement with the word "if".

But mostly my comments were directed towards some of the
side comments in Raymond's post I thought should not pass
unchallenged. I think that some of the attitudes expressed
(and shared by others) are likely the direct cause of many
of the faults I find in the currrent documentation.

John Bokma · Sep 3, 2010

Terry Reedy said:
On 9/1/2010 8:11 PM, John Bokma wrote:
[...]

Right. And if 'small values of n' include all possible values, then
rejecting a particular O(log n) algorithm as 'unacceptable' relative
to all O(1) algorithms is pretty absurd.

I have little time, but want to reply to this one: anyone using big-Oh
and claiming that an O(log n) algorithm is 'unacceptable' relative to
all O(1) algorithms has no clue what he/she is talking about.

big-Oh says something about the order of /growth/ of the running time of
an algorithm. And since 1 is a constant O(1) means that the order of
/growth/ of the running time is constant (independend of the input
size. Since "the growth of the running time is constant" is quite a
mouth full, it's often shortened to 'constant time' since from the
context it's clear what's being meant. But this doesn't mean
that if the algorithm gets an input size of 100000 versus 1 that it
takes the same number of seconds for the latter to process.

Robert Kern · Sep 3, 2010

Terry Reedy said:
Terry Reedy said:

On 9/1/2010 8:11 PM, John Bokma wrote:
[...]

Right. And if 'small values of n' include all possible values, then
rejecting a particular O(log n) algorithm as 'unacceptable' relative
to all O(1) algorithms is pretty absurd.

Click to expand...

I have little time, but want to reply to this one: anyone using big-Oh
and claiming that an O(log n) algorithm is 'unacceptable' relative to
all O(1) algorithms has no clue what he/she is talking about.

big-Oh says something about the order of /growth/ of the running time of
an algorithm. And since 1 is a constant O(1) means that the order of
/growth/ of the running time is constant (independend of the input
size.

That's an ambiguous wording that is likely to confuse people. It seems like you
are saying that the O() behavior is the order of the first derivative of the
running time as a function of some interesting parameter of the problem, which
it is not. O() notation *describes* the growth, but it *is not* the order of the
growth itself.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

generating unique set of dicts from a list of dicts	0	Jan 10, 2012
MultiReplace (performance and unordered dicts)	0	Jun 4, 2009
Performance of list vs. set equality operations	19	Apr 6, 2010
How to iterate through two dicts efficiently	4	Jul 19, 2011
range() vs xrange() Python2\|3 issues for performance	11	Aug 2, 2011
dicts,instances,containers, slotted instances, et cetera.	8	Jan 28, 2009
Help with sets	24	Oct 5, 2010
xslt1 problem (sets?)	3	Apr 29, 2011

Performance: sets vs dicts.

Aahz

Stefan Behnel

Terry Reedy

Arnaud Delobelle

Terry Reedy

John Bokma

Robert Kern

Raymond Hettinger

Terry Reedy

John Bokma

John Bokma

rurpy

Terry Reedy

Terry Reedy

Arnaud Delobelle

rurpy

John Bokma

Robert Kern

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads