Performance: sets vs dicts.

Discussion in 'Python' started by John Nagle, Aug 29, 2010.

  1. John Nagle

    John Nagle Guest

    Is the "in" test faster for a dict or a set?
    Is "frozenset" faster than "set"? Use case is
    for things like applying "in" on a list of 500 or so words
    while checking a large body of text.

    John Nagle
    John Nagle, Aug 29, 2010
    #1
    1. Advertising

  2. John Nagle <> writes:

    > Is the "in" test faster for a dict or a set?
    > Is "frozenset" faster than "set"? Use case is
    > for things like applying "in" on a list of 500 or so words
    > while checking a large body of text.
    >
    > John Nagle


    IIRC Frozensets are implemented more or less as sets with a hash
    function and immutability so I would expect "in" to perform exactly the
    same as for sets. For dicts, I would think that the set implementation
    is very close to the dict one.

    So I wouldn't expect any significant difference between any of them.

    --
    Arnaud
    Arnaud Delobelle, Aug 29, 2010
    #2
    1. Advertising

  3. John Nagle

    Peter Otten Guest

    John Nagle wrote:

    > Is the "in" test faster for a dict or a set?
    > Is "frozenset" faster than "set"? Use case is
    > for things like applying "in" on a list of 500 or so words
    > while checking a large body of text.


    As Arnaud suspects: no significant difference:

    $ python dictperf.py
    dict --> 0.210289001465
    set --> 0.202902793884
    frozenset --> 0.198950052261

    $ cat dictperf.py
    import random
    import timeit

    with open("/usr/share/dict/words") as instream:
    words = [line.strip() for line in instream]

    #random.seed(42)
    sample = random.sample(words, 501)

    n = sample.pop()
    y = random.choice(sample)

    d = dict.fromkeys(sample)
    s = set(sample)
    f = frozenset(sample)


    for lookup in d, s, f:
    print type(lookup).__name__, "-->", timeit.timeit(
    "n in lookup; y in lookup",
    "from __main__ import lookup, n, y")

    Peter
    Peter Otten, Aug 29, 2010
    #3
  4. John Nagle

    Seth Rees Guest

    On 08/29/10 14:43, Peter Otten wrote:
    > John Nagle wrote:
    >
    >> Is the "in" test faster for a dict or a set?
    >> Is "frozenset" faster than "set"? Use case is
    >> for things like applying "in" on a list of 500 or so words
    >> while checking a large body of text.

    >
    > As Arnaud suspects: no significant difference:
    >
    > $ python dictperf.py
    > dict --> 0.210289001465
    > set --> 0.202902793884
    > frozenset --> 0.198950052261
    >
    > $ cat dictperf.py
    > import random
    > import timeit
    >
    > with open("/usr/share/dict/words") as instream:
    > words = [line.strip() for line in instream]
    >
    > #random.seed(42)
    > sample = random.sample(words, 501)
    >
    > n = sample.pop()
    > y = random.choice(sample)
    >
    > d = dict.fromkeys(sample)
    > s = set(sample)
    > f = frozenset(sample)
    >
    >
    > for lookup in d, s, f:
    > print type(lookup).__name__, "-->", timeit.timeit(
    > "n in lookup; y in lookup",
    > "from __main__ import lookup, n, y")
    >
    > Peter

    What about lists versus tuples?
    Seth Rees, Aug 30, 2010
    #4
  5. On Aug 29, 12:12 pm, John Nagle <> wrote:
    >     Is the "in" test faster for a dict or a set?
    > Is "frozenset" faster than "set"?  Use case is
    > for things like applying "in" on a list of 500 or so words
    > while checking a large body of text.


    There is no significant difference.
    All three are implemented using substantially the same code.


    Raymond
    Raymond Hettinger, Aug 30, 2010
    #5
  6. John Nagle

    Peter Otten Guest

    Seth Rees wrote:

    > On 08/29/10 14:43, Peter Otten wrote:
    >> John Nagle wrote:
    >>
    >>> Is the "in" test faster for a dict or a set?
    >>> Is "frozenset" faster than "set"? Use case is
    >>> for things like applying "in" on a list of 500 or so words
    >>> while checking a large body of text.

    >>
    >> As Arnaud suspects: no significant difference:
    >>
    >> $ python dictperf.py
    >> dict --> 0.210289001465
    >> set --> 0.202902793884
    >> frozenset --> 0.198950052261
    >>
    >> $ cat dictperf.py
    >> import random
    >> import timeit
    >>
    >> with open("/usr/share/dict/words") as instream:
    >> words = [line.strip() for line in instream]
    >>
    >> #random.seed(42)
    >> sample = random.sample(words, 501)
    >>
    >> n = sample.pop()
    >> y = random.choice(sample)
    >>
    >> d = dict.fromkeys(sample)
    >> s = set(sample)
    >> f = frozenset(sample)
    >>
    >>
    >> for lookup in d, s, f:
    >> print type(lookup).__name__, "-->", timeit.timeit(
    >> "n in lookup; y in lookup",
    >> "from __main__ import lookup, n, y")
    >>
    >> Peter

    > What about lists versus tuples?


    You trade O(1) for O(N) with both, a bad idea for all but the smallest
    lists/tuples.

    $ python dictperf.py
    dict --> 0.221934080124
    set --> 0.220952987671
    frozenset --> 0.225043058395
    list --> 21.5041530132
    tuple --> 20.8655071259

    By the way, the script will run on your computer, too ;)

    Peter
    Peter Otten, Aug 30, 2010
    #6
  7. John Nagle

    Aahz Guest

    In article <>,
    Raymond Hettinger <> wrote:
    >On Aug 29, 12:12=A0pm, John Nagle <> wrote:
    >>
    >> Is the "in" test faster for a dict or a set? Is "frozenset" faster
    >> than "set"? Use case is for things like applying "in" on a list of
    >> 500 or so words while checking a large body of text.

    >
    >There is no significant difference. All three are implemented using
    >substantially the same code.


    That reminds me: one co-worker (who really should have known better ;-)
    had the impression that sets were O(N) rather than O(1). Although
    writing that off as a brain-fart seems appropriate, it's also the case
    that the docs don't really make that clear, it's implied from requiring
    elements to be hashable. Do you agree that there should be a comment?
    --
    Aahz () <*> http://www.pythoncraft.com/

    "...if I were on life-support, I'd rather have it run by a Gameboy than a
    Windows box." --Cliff Wells
    Aahz, Aug 30, 2010
    #7
  8. John Nagle

    Paul Rubin Guest

    (Aahz) writes:
    > That reminds me: one co-worker (who really should have known better ;-)
    > had the impression that sets were O(N) rather than O(1). Although
    > writing that off as a brain-fart seems appropriate, it's also the case
    > that the docs don't really make that clear, it's implied from requiring
    > elements to be hashable. Do you agree that there should be a comment?


    It's O(1) with reasonable input distributions but can be O(N) for
    adverse distributions. The docs should say something like that, and
    include this link:

    http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html
    Paul Rubin, Aug 30, 2010
    #8
  9. John Nagle

    Aahz Guest

    In article <>,
    Ben Finney <> wrote:
    > (Aahz) writes:
    >>
    >> That reminds me: one co-worker (who really should have known better ;-)
    >> had the impression that sets were O(N) rather than O(1).

    >
    >For settling exactly this kind of confusion, Python's standard library
    >comes with a module, the timeit module. Your co-worker should have
    >known better: don't guess about timing performance, measure it.
    >
    >Or am I missing something here?


    Possibly; IMO, people should not need to run timeit to determine basic
    algorithmic speed for standard Python datatypes.
    --
    Aahz () <*> http://www.pythoncraft.com/

    "...if I were on life-support, I'd rather have it run by a Gameboy than a
    Windows box." --Cliff Wells
    Aahz, Aug 31, 2010
    #9
  10. John Nagle

    Paul Rubin Guest

    (Aahz) writes:
    > Possibly; IMO, people should not need to run timeit to determine basic
    > algorithmic speed for standard Python datatypes.


    Indeed. Alex Stepanov (designer of C++ Standard Template Library) was
    emphatic that algorithm complexity assertions should be part of the
    interface of any STL function:

    http://www.sgi.com/tech/stl/drdobbs-interview.html

    He also said it should be part of the "unwritten contract" between the
    module and its user, but I don't understand why he said "unwritten",
    since in the C++ STL the complexity statements are part of the written
    spec.
    Paul Rubin, Aug 31, 2010
    #10
  11. Paul Rubin <> writes:

    > (Aahz) writes:
    >> Possibly; IMO, people should not need to run timeit to determine basic
    >> algorithmic speed for standard Python datatypes.

    >
    > Indeed. Alex Stepanov (designer of C++ Standard Template Library) was
    > emphatic that algorithm complexity assertions should be part of the
    > interface of any STL function:
    >
    > http://www.sgi.com/tech/stl/drdobbs-interview.html
    >
    > He also said it should be part of the "unwritten contract" between the
    > module and its user, but I don't understand why he said "unwritten",
    > since in the C++ STL the complexity statements are part of the written
    > spec.


    The spec is written, not the contract :)

    --
    Arnaud
    Arnaud Delobelle, Aug 31, 2010
    #11
  12. John Nagle

    Jerry Hill Guest

    On Mon, Aug 30, 2010 at 7:42 PM, Aahz <> wrote:
    > Possibly; IMO, people should not need to run timeit to determine basic
    > algorithmic speed for standard Python datatypes.


    http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
    last time this came up, there was some resistance to making promises
    about time complexity in the official docs, since that would make
    those numbers part of the language, and thus binding on other
    implementations.

    --
    Jerry
    Jerry Hill, Aug 31, 2010
    #12
  13. John Nagle

    Aahz Guest

    In article <>,
    Jerry Hill <> wrote:
    >On Mon, Aug 30, 2010 at 7:42 PM, Aahz <> wrote:
    >>
    >> Possibly; IMO, people should not need to run timeit to determine basic
    >> algorithmic speed for standard Python datatypes.

    >
    >http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
    >last time this came up, there was some resistance to making promises
    >about time complexity in the official docs, since that would make
    >those numbers part of the language, and thus binding on other
    >implementations.


    I'm thoroughly aware of that page and updated it yesterday to make it
    easier to find. ;-)

    However, I think there are some rock-bottom basic guarantees we can make
    regardless of implementation. Does anyone seriously think that an
    implementation would be accepted that had anything other than O(1) for
    index access into tuples and lists? Dicts that were not O(1) for access
    with non-pathological hashing? That we would accept sets having O()
    performance worse than dicts?

    I suggest that we should agree on these guarantees and document them in
    the core.
    --
    Aahz () <*> http://www.pythoncraft.com/

    "...if I were on life-support, I'd rather have it run by a Gameboy than a
    Windows box." --Cliff Wells
    Aahz, Aug 31, 2010
    #13
  14. Jerry Hill, 31.08.2010 14:24:
    > On Mon, Aug 30, 2010 at 7:42 PM, Aahz wrote:
    >> Possibly; IMO, people should not need to run timeit to determine basic
    >> algorithmic speed for standard Python datatypes.

    >
    > http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
    > last time this came up, there was some resistance to making promises
    > about time complexity in the official docs, since that would make
    > those numbers part of the language, and thus binding on other
    > implementations.


    The docs start getting clearer about what's language specification and
    what's implementation specific to CPython, so that would just add another
    couple of CPython-only bits to what's there already. Although I do consider
    facts like list indexing being O(1) pretty much at the
    language-rather-than-implementation level. Most Python code simply expects
    that and it would degrade a lot of code if that ever changed in whatever
    implementation.

    Stefan
    Stefan Behnel, Aug 31, 2010
    #14
  15. John Nagle

    Ian Guest

    On 31/08/2010 15:09, Aahz wrote:
    > I suggest that we should agree on these guarantees and document them in
    > the core.

    I suspect that documenting them will be the easy part ;)
    Ian, Aug 31, 2010
    #15
  16. John Nagle

    Jerry Hill Guest

    On Tue, Aug 31, 2010 at 10:09 AM, Aahz <> wrote:
    > I suggest that we should agree on these guarantees and document them in
    > the core.


    I can't get to the online python-dev archives from work (stupid
    filter!) so I can't give you a link to the archives, but the original
    thread that resulted in the creation of that wiki page was started on
    March 9th, 2008 and was titled "Complexity documentation request".

    At the time, opposition to formally documenting this seemed pretty
    widespread, including from yourself and Guido. You've obviously
    changed your mind on the subject, so maybe it's something that would
    be worth revisiting, assuming someone wants to write the doc change.

    --
    Jerry
    Jerry Hill, Aug 31, 2010
    #16
  17. John Nagle

    Terry Reedy Guest

    On 8/31/2010 10:09 AM, Aahz wrote:
    > In article<>,
    > Jerry Hill<> wrote:
    >> On Mon, Aug 30, 2010 at 7:42 PM, Aahz<> wrote:
    >>>
    >>> Possibly; IMO, people should not need to run timeit to determine basic
    >>> algorithmic speed for standard Python datatypes.

    >>
    >> http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
    >> last time this came up, there was some resistance to making promises
    >> about time complexity in the official docs, since that would make
    >> those numbers part of the language, and thus binding on other
    >> implementations.

    >
    > I'm thoroughly aware of that page and updated it yesterday to make it
    > easier to find. ;-)
    >
    > However, I think there are some rock-bottom basic guarantees we can make
    > regardless of implementation. Does anyone seriously think that an
    > implementation would be accepted that had anything other than O(1) for
    > index access into tuples and lists?


    Does anyone seriously think that an implementation should be rejected as
    an implementation if it intellegently did seq[n] lookups in log2(n)/31
    time units for all n (as humans would do), instead of stupidly taking 1
    time unit for all n < 2**31 and rejecting all larger values (as 32-bit
    CPython does)?

    > Dicts that were not O(1) for access with non-pathological hashing?


    You would actually be unhappy if small dicts ran even faster than they
    do now (while large dicts ran the same)? Not me!

    > I suggest that we should agree on these guarantees and document them in
    > the core.


    I disagree for several reasons.

    1. It would a category mistake. Python is an abstract algorithm
    language. The concept of speed is not applicable. I happen to think it
    one of the best, which is why I once dubbed it 'executable pseudocode',
    to compare it to similar but inferior, non-executable algorithm
    languages too often used still today. To human readers, as readers,
    speed of CPython or anything else is not relevant.

    2. A Python interpreter is an abstract algorithm. Algorithms can be
    characterized by operation counts, but not by physical universe time.

    3. Algorithms, including Python interpreters, can be embodied in
    anything that can compute, including humans, VonNeuman machines of
    various types, and possible future non-VonNeuman machines. To limit
    'python interpreter' to current vonNeuman machines would be short
    sighted. Human interpretation of Python code (and other specifications)
    is part of development, testing, and debugging.

    4. Guarantee? Who will pay what to who if what is not performed ;-?. The
    Python license intentionally disclaims any guarantee of performance. If
    you are using 'guarantee' metaphorically, perhaps try another word.

    5. Accepted? If you want to claim that an interpreter that is useless
    for your purposes is not really an interpreter, you are free to, I
    suppose. Asking the PSF to impose your view on me would, to me, violate
    its diversity policy.

    6. O-notation was developed for characterizing the asymptotic (large-N)
    behavior of abstract algorithms in terms of abstract operation counts.
    Left out are differences between operations, lower order terms,
    multiplicative constants, how large N must be to get large-N behavior,
    and what happens for smaller N. These and other factors make the
    translation of O(f(n)) into real time extremely mushy or even wrong.

    I already suggested that O(1) lookup is a symption of simple-minded
    arithmetic stupidity, of doing unnecessary work when adding small
    numbers. When it is a result doing most additions slower than necessary,
    it is a bad thing, not a good thing, and a bad criteria for an
    acceptable algorithm and acceptable interpreter. Similarly, books say
    that O(n*logn) sorting is 'better' that O(n**2) sorting. However, Tim
    Peters verified that the opposite is true for real times for small n,
    and hence the current list.sort smartly uses a fast O(n**2) algorithm
    for small lists (len < 64) and small runs in larger lists. Rejecting
    list.sort for this reason would be stupid.

    --
    Terry Jan Reedy
    Terry Reedy, Sep 1, 2010
    #17
  18. John Nagle

    Paul Rubin Guest

    Terry Reedy <> writes:
    > Does anyone seriously think that an implementation should be rejected
    > as an implementation if it intellegently did seq[n] lookups in
    > log2(n)/31 time units for all n (as humans would do), instead of
    > stupidly taking 1 time unit for all n < 2**31 and rejecting all larger
    > values (as 32-bit CPython does)?


    Er, how can one handle n > 2**31 at all, in 32-bit CPython?

    >> Dicts that were not O(1) for access with non-pathological hashing?

    >
    > You would actually be unhappy if small dicts ran even faster than they
    > do now (while large dicts ran the same)? Not me!


    Not sure what you are referring to by that. I'd actually be ok with
    using O(log(n)) dicts to eliminate the O(n) behavior of the hash-based
    implementation on adverse input sets.

    > Similarly, books say that O(n*logn) sorting is 'better' that O(n**2)
    > sorting. However, Tim Peters verified that the opposite is true for
    > real times for small n, and hence the current list.sort smartly uses a
    > fast O(n**2) algorithm for small lists (len < 64) and small runs in
    > larger lists. Rejecting list.sort for this reason would be stupid.


    That algorithm is still O(n log n) since the n**2 behavior up to
    some constant n is effectively an additive constant that asymptotically
    is within the specified big-O. 64 actually sounds suspiciously large
    for the cutover, but I'm sure Tim tuned it carefully.
    Paul Rubin, Sep 1, 2010
    #18
  19. Lie Ryan, 01.09.2010 15:46:
    > On 09/01/10 00:09, Aahz wrote:
    >> However, I think there are some rock-bottom basic guarantees we can make
    >> regardless of implementation. Does anyone seriously think that an
    >> implementation would be accepted that had anything other than O(1) for
    >> index access into tuples and lists? Dicts that were not O(1) for access
    >> with non-pathological hashing? That we would accept sets having O()
    >> performance worse than dicts?
    >>
    >> I suggest that we should agree on these guarantees and document them in
    >> the core.

    >
    > While I think documenting them would be great for all programmers that
    > care about practical and theoretical execution speed; I think including
    > these implementation details in core documentation as a "guarantee"
    > would be a bad idea for the reasons Terry outlined.
    >
    > One way of resolving that is by having two documentations (or two
    > separate sections in the documentation) for:
    > - Python -- the language -- documenting Python as an abstract language,
    > this is the documentation which can be shared across all Python
    > implementations. This will also be the specification for Python Language
    > which other implementations will be measured to.
    > - CPython -- the Python interpreter -- documents implementation details
    > and performance metrics. It should be properly noted that these are not
    > part of the language per se. This will be the playground for CPython
    > experts that need to fine tune their applications to the last drop of
    > blood and don't mind their application going nuts with the next release
    > of CPython.


    I disagree. I think putting the "obvious" guarantees right into the normal
    documentation will actually make programmers aware that there *are*
    different implementations (and differences between implementations), simply
    because it wouldn't just say "O(1)" but "the CPython implementation of this
    method has an algorithmic complexity of O(1), other Python implementations
    are known to perform alike at the time of this writing". Maybe without the
    last half of the sentence if we really don't know how other implementations
    work here, or if we expect that there may well be a reason they may choose
    to behave different, but in most cases, it shouldn't be hard to make that
    complete statement.

    After all, we basically know what other implementations there are, and we
    also know that they tend to match the algorithmic complexities at least for
    the major builtin types. It seems quite clear to me as a developer that the
    set of builtin types and "collections" types was chosen in order to cover a
    certain set of algorithmic complexities and not just arbitrary interfaces.

    Stefan
    Stefan Behnel, Sep 1, 2010
    #19
  20. John Nagle

    Lie Ryan Guest

    On 09/01/10 00:09, Aahz wrote:
    > In article <>,
    > Jerry Hill <> wrote:
    >> On Mon, Aug 30, 2010 at 7:42 PM, Aahz <> wrote:
    >>>
    >>> Possibly; IMO, people should not need to run timeit to determine basic
    >>> algorithmic speed for standard Python datatypes.

    >>
    >> http://wiki.python.org/moin/TimeComplexity takes a stab at it. IIRC,
    >> last time this came up, there was some resistance to making promises
    >> about time complexity in the official docs, since that would make
    >> those numbers part of the language, and thus binding on other
    >> implementations.

    >
    > I'm thoroughly aware of that page and updated it yesterday to make it
    > easier to find. ;-)
    >
    > However, I think there are some rock-bottom basic guarantees we can make
    > regardless of implementation. Does anyone seriously think that an
    > implementation would be accepted that had anything other than O(1) for
    > index access into tuples and lists? Dicts that were not O(1) for access
    > with non-pathological hashing? That we would accept sets having O()
    > performance worse than dicts?
    >
    > I suggest that we should agree on these guarantees and document them in
    > the core.


    While I think documenting them would be great for all programmers that
    care about practical and theoretical execution speed; I think including
    these implementation details in core documentation as a "guarantee"
    would be a bad idea for the reasons Terry outlined.

    One way of resolving that is by having two documentations (or two
    separate sections in the documentation) for:
    - Python -- the language -- documenting Python as an abstract language,
    this is the documentation which can be shared across all Python
    implementations. This will also be the specification for Python Language
    which other implementations will be measured to.
    - CPython -- the Python interpreter -- documents implementation details
    and performance metrics. It should be properly noted that these are not
    part of the language per se. This will be the playground for CPython
    experts that need to fine tune their applications to the last drop of
    blood and don't mind their application going nuts with the next release
    of CPython.
    Lie Ryan, Sep 1, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kamilche

    Dicts 5x Faster than Sets

    Kamilche, Jun 9, 2004, in forum: Python
    Replies:
    3
    Views:
    396
  2. Alan Isaac

    docs patch: dicts and sets

    Alan Isaac, May 12, 2007, in forum: Python
    Replies:
    5
    Views:
    251
  3. Alan Isaac

    docs patch: dicts and sets

    Alan Isaac, May 15, 2007, in forum: Python
    Replies:
    12
    Views:
    485
  4. Joseph Reagle
    Replies:
    0
    Views:
    216
    Joseph Reagle
    Jun 4, 2009
  5. bruce
    Replies:
    0
    Views:
    234
    bruce
    Jan 10, 2012
Loading...

Share This Page