item access time: sets v. lists

D

David Isaac

Is it expected for access to set elements to be much
slower than access to list elements? Explanation?
Thanks,
Alan Isaac
3.9823075279120701
 
D

Dennis Benzinger

Is it expected for access to set elements to be much
slower than access to list elements? Explanation?
Thanks,
Alan Isaac
[...]

You're measuring the time for creating the xrange and the set/list too.
Create them before you call Timer() and repeat your timing.


Dennis Benzinger
 
P

Paul McGuire

David Isaac said:
Is it expected for access to set elements to be much
slower than access to list elements? Explanation?
Thanks,
Alan Isaac

3.9823075279120701
A couple of points:
1. Your use of timeit is a bit flawed. You are not only timing the access
into the set or list, but also the construction of the set/list, *every
time*. Use the second string arg (the one you are passing as "") for
one-time initialization, so that you measure only access, which is what you
say your are interested in.
2. Ooops, it turns out you're not really *accessing* the set/list, you are
*iterating* over it. Given that sets are implemented internally using a
tree structure, I would expect that iterating over a list could be faster,
although only marginally so.
3. Here are some stats for a little more correct timeits:

Construct list/set
set -> 1.89524618352
list -> 0.299499796762

Iterate over list/set
set -> 0.646887523414
list -> 0.552001162159

Random access to first item in list/set
set -> 0.000189409547861
list -> 0.000160076210804

Random access to item in list/set when item exists
set -> 0.000241650824337
list -> 0.0245168031132

Random access to item in list/set when item does not exist
set -> 0.000187733357172
list -> 0.522086186932


The code:
import timeit

print "\nConstruct list/set"
t1=timeit.Timer("z = set(xrange(10000))","")
t2=timeit.Timer("z = list(xrange(10000))","")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)

print "\nIterate over list/set"
t1=timeit.Timer("for i in z: pass","z = set(xrange(10000))")
t2=timeit.Timer("for i in z: pass", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)

print "\nRandom access to first item in list/set"
t1=timeit.Timer("0 in z","z = set(xrange(10000))")
t2=timeit.Timer("0 in z", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)

print "\nRandom access to item in list/set when item exists"
t1=timeit.Timer("500 in z","z = set(xrange(10000))")
t2=timeit.Timer("500 in z", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)

print "\nRandom access to item in list/set when item does not exist"
t1=timeit.Timer("10500 in z","z = set(xrange(10000))")
t2=timeit.Timer("10500 in z", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)


-- Paul
 
D

David Isaac

Paul said:
Random access to item in list/set when item exists
set -> 0.000241650824337
list -> 0.0245168031132

Random access to item in list/set when item does not exist
set -> 0.000187733357172
list -> 0.522086186932


OK, that's a much better set of answers
including to questions I did not
even know I wanted to ask until
I saw your post. But, how to
explain the above??

Thanks,
Alan Isaac
 
N

Neil Cerutti

OK, that's a much better set of answers
including to questions I did not
even know I wanted to ask until
I saw your post. But, how to
explain the above??

Look at the code again. It's not testing what it says it's
testing.

print "\nRandom access to first item in list/set"
t1=timeit.Timer("0 in z","z = set(xrange(10000))")
t2=timeit.Timer("0 in z", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)

print "\nRandom access to item in list/set when item exists"
t1=timeit.Timer("500 in z","z = set(xrange(10000))")
t2=timeit.Timer("500 in z", "z = list(xrange(10000))")
print "set ->", t1.timeit(1000)
print "list ->", t2.timeit(1000)
 
P

Paul McGuire

Look at the code again. It's not testing what it says it's
testing.

It isnt?

The only quibble I can see is that there really is no "first" element in a
set. I picked the "0 in set" and "0 in list" to pick the fastest case for
list, which does a linear search through the list elements.

Where did I go wrong on the test descriptions?

-- Paul
 
P

Paul McGuire

David Isaac said:
OK, that's a much better set of answers
including to questions I did not
even know I wanted to ask until
I saw your post. But, how to
explain the above??

Thanks,
Alan Isaac

Searching for an item in a list is a linear search, and all 10000 items must
be checked before determining that a non-existent item is not in the list,
and sure enough, our list takes about 20 times as long to find that 10500 is
not in the range 10000 as it does to find 500 in the range 10000. By
contrast, a set (and also the keys in a dict) use a tree structure to index
more quickly into the list of items, so this access (and determination of
non-existence) will be much faster, especially so for a large list.

-- Paul
 
D

Duncan Booth

Paul McGuire said:
By contrast, a set (and also the keys in a dict) use a tree structure
to index more quickly into the list of items

'dict' and I believe also 'set' use a hash table, not a tree structure.
 
N

Neil Cerutti

It isnt?

The only quibble I can see is that there really is no "first"
element in a set. I picked the "0 in set" and "0 in list" to
pick the fastest case for list, which does a linear search
through the list elements.

Where did I go wrong on the test descriptions?

It seems to be timing "testing for membership", not "random
access". Random access is just seq[n]; at least, that's the
assumption I made.

Perhaps I read the test description out of context and got
confused.
 
S

sjdevnull

David said:
OK, that's a much better set of answers
including to questions I did not
even know I wanted to ask until
I saw your post. But, how to
explain the above??

The code was flawed, but it illustrates a point. To find an arbitrary
item in a list takes time proportional to the length of the list--on
average, you have to scan n/2 items in a list of length n to find it.
"x in list" really has to check item 0, item 1, item 2, etc until it
finds x or reaches the end of the list.

Obviously in a sorted list like this, finding 0 will be much faster
than finding 9999 since you have to scan almost the whole list to find
the latter. Of course, if you know the list is sorted you could use a
binary search (which will find the item in log(n) comparisons on
average).

So really to find a random item that is in the list is going to take
more like 0.250 seconds on average, give or take a little.

When the item doesn't occur in the list, you wind up having to check
every element to see if it's there. (binary search would work for this
case as well for a sorted list).

A set, on the other hand, uses a hash table so finding an element takes
constant time (it's one hash lookup, independent of the size of the
set)--and determining an item isn't there is likewise constant time.

(if your data is degenerate and you have many things in the set that
all hash to the same value then the lookup could take longer, but
that's pretty unlikely to make a big difference in practice with python
sets and user data).
 
P

Paul McGuire

Neil Cerutti said:
It seems to be timing "testing for membership", not "random
access". Random access is just seq[n]; at least, that's the
assumption I made.

Perhaps I read the test description out of context and got
confused.

Oh, poor wording on my part, sorry.

You got me wondering what seq[n] would give me in the case of a set:
Traceback (most recent call last):

It really doesn't make any sense since sets are unordered. So we couldn't
compare lists and sets in this manner anyway.

I *did* learn that dicts and sets use a hash table, not a tree, which also
helps explain why the initial construction of the 10000-element set takes so
much longer than just simple assembly of the range into a list.

-- Paul
 
P

Paul McGuire

Duncan Booth said:
'dict' and I believe also 'set' use a hash table, not a tree structure.

Thanks, I stand corrected. How do they know how big a hash table to use? I
think this is an interesting problem.

-- Paul
 
D

Duncan Booth

A set, on the other hand, uses a hash table so finding an element
takes constant time (it's one hash lookup, independent of the size of
the set)--and determining an item isn't there is likewise constant
time.
One hash calculation, but there could be collisions. Worst case for an n
element dict/set could be that it takes n attempts to find a value,
although unless you try to do it deliberately by ensuring that the keys
have identical hash values this won't happen in practice.

Paul said:
Thanks, I stand corrected. How do they know how big a hash table to
use? I think this is an interesting problem.

If I read the code correctly:

Minimum size is 8. Whenever more than 2/3 of the slots are in use
(including slots where the element has been deleted) the dictionary is
resized to the smallest power of 2 (greater than 8) which is greater than 4
times the number of currently occupied slots (or 2 times the number of
occupied slots when more than 50000 slots are occupied). This can reduce
the size if a lot of elements have been deleted.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top