Unexpected comparisons in dict lookup

S

Steven D'Aprano

I stumbled across this unexpected behaviour with Python 2.7 and 3.3. When
you look up a key in a dict, the key is sometimes compared against other
keys twice instead of just once.

First, a class that reports when it is being tested for equality, with a
fixed hash so that we get collisions inside the dict:

class Spam:
def __init__(self, label):
self.label = label
def __str__(self):
return self.label
def __hash__(self):
return 100
def __eq__(self, other):
print("comparing %s with %s" % (self, other))
return self is other


Create a dict and prepare for collisions:

x = Spam("x")
y = Spam("y")
d = {100: 1} # hash(100) == 100


When we add x to the dict, it collides with the existing key 100, so I
expect one call to Spam.__eq__, and that's exactly what I get:

py> d[x] = 200
comparing x with 100


But when I try adding y to the dict, it collides with 100, then it
collides with x, then it apparently collides with x a second time:

py> d[y] = 300
comparing y with 100
comparing x with y
comparing x with y


I expected only two calls to __eq__, not three: first comparing with 100,
then comparing with x. Checking for keys gives the same result:

py> x in d
comparing x with 100
True
py> y in d
comparing y with 100
comparing x with y
comparing x with y
True


What's more, checking for a key which is not present also compares three
times instead of twice:

py> Spam("z") in d
comparing z with 100
comparing x with z
comparing x with z
comparing y with z
False


I expected it to compare z with 100, then x *once*, then y, then return
False.

Why is the dict lookup comparing against x twice? It doesn't seem to be
the fault of x. If I create a slightly different dict, with the
collisions in a different order:

py> e = {x: 100}
py> e[100] = 200
comparing x with 100
py> e[y] = 300
comparing x with y
comparing y with 100
comparing y with 100
 
I

Ian Kelly

I stumbled across this unexpected behaviour with Python 2.7 and 3.3. When
you look up a key in a dict, the key is sometimes compared against other
keys twice instead of just once.
From what I can see in the code, it adds a perturbation based on the
upper bits of the hash value to the probing scheme, to reduce
collisions for keys with unequal hashes. On the downside, this
cancels the guarantee that each bucket can only be checked at most
once. The perturbation gradually shifts to 0 after a few iterations,
so every bucket can still be reached within O(n) iterations.

See the comments starting at "Major subtleties ahead":
http://hg.python.org/cpython/file/f8b40d33e45d/Objects/dictobject.c#l106
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top