Why is dictionary.keys() a list and not a set?

  • Thread starter Christoph Zwerschke
  • Start date
M

Mike Meyer

A related issue, from the doc :

Set elements are like dictionary keys; they need to define both
__hash__ and __eq__ methods.

This is wrong. The objects must have a hash value to be used as a set
element or a dictionary key. Objects without a __hash__ use their id
as a hash value if they don't define their own comparisons. This makes
sense, because objects that don't define their own comparisons use id
testing for equality testing.

The __eq__ comment is even worse - __eq__ is unused on set elements or
dictionary keys. However, the *semantics* of sets and keys are such
that if hash(x) == hash(y), then x == y should also be true. If that's
false, then lots of the statements about dictionaries and sets are
going to be wrong.

The behavior of tuples is a different issue: hashing a non-hashable
object raises a TypeError. So does hashing a tuple that contains a
non-hashable object. This doesn't matter for most containers: lists
aren't hashable in any case; strings, unicode strings and xrange's, as
they contain only hashable objects. Buffers are poorly documented.

The dictionary doc comment is closer to right:

A dictionary's keys are almost arbitrary values. Only values
containing lists, dictionaries or other mutable types (that are
compared by value rather than by object identity) may not be used as
keys.

Proposed new wording, for both places:

A dictionaries keys (set element) can be any hashable value. Immutable
objects are hashable, except that a tuple containing a non-hashable
object isn't hashable. Mutable objects are hashable if they define
__hash__ or if they don't define comparison. A class or type that
defines both comparison and __hash__ should define them so that two
objects that have the same hash compare equal, otherwise using them as
dictionary keys (set elements) will not have the expected behavior.

What do the rest of you think?

<mike
 
B

bonono

Duncan said:
As for the (k,v) vs (v,k), I still don't think it is a good example. I
can always use index to access the tuple elements and most other
functions expect the first element to be the key. For example :

a=d.items()
do something about a
b = dict(a)

But using the imaginary example :

a = zip(d.values(), d.keys())
do something about a
b = dict(a) # probably not what I want.

The typical use case for the (v,k) tuple would be when you want sort the
contents of a dictionary by value. e.g. counting the number of occurrences
of each word in a document.

a = zip(d.values(), d.keys())
a.sort()
a.reverse()
print "Today's top 10:"
print str.join(',', [ k for (v,k) in a[:10]])

Ok, so today you can do that another way, but before there was a key
parameter to sort you had to build the tuple somehow:

print str.join(',', sorted(a.iterkeys(),
key=a.__getitem__, reverse=True)[:10])

and the advantage of the latter is of course that it works even when you've
been counting the number of occurences of complex numbers in a document :)

Thanks, so for newer python(2.3+?, don't know when the sort has been
enhanced), this (v,k) is again there for historical reason but not a
real use case any more.
 
C

Christoph Zwerschke

And this, again from the doc(about mapping objects):
A mapping object maps immutable values to arbitrary objects.
Seems that is questionable too.
a=(1,[])
d={}
d[a]=1
again would give TypeError, list object are unhashable.

That's a good example showing that the term 'mutable' is not so
well-defined as it may seem.

If you set b=[]; a=(1,b); should a be considered mutable (because you
can change its value by changing b), or should it be considered
immutable (because it is a tuple)?

-- Christoph
 
C

Christoph Zwerschke

As a general note, I think it would be good to place the exact
description in a footnote, since speaking about hashable objects,
__hash__ and __cmp__ will certainly frighten off newbies and make it
hard to read even for experienced users. The main text may lie/simplyfy
a little bit.

Or as Donald Knuth once recommended:

"Simplify. Lie, if it helps. You can add the correct details later on,
but it is essential to present the reader with something straightforward
to start off with. So don’t be afraid to bend the facts initially where
this leads to a useful simplification and then pay back the debt to
truth later by gradual elaborations."

However, since the Python docs are a reference and not a textbook or
manual, I think the debt should not be payed back "later", but
immediately in a footnote.

The docs already follow this principle very well, e.g.:
http://docs.python.org/lib/typesmapping.html

-- Christoph
 
M

Mike Meyer

Christoph Zwerschke said:
As a general note, I think it would be good to place the exact
description in a footnote, since speaking about hashable objects,
__hash__ and __cmp__ will certainly frighten off newbies and make it
hard to read even for experienced users. The main text may
lie/simplyfy a little bit.

Good point.
However, since the Python docs are a reference and not a textbook or
manual, I think the debt should not be payed back "later", but
immediately in a footnote.

Ok, how about this for dictionaries/sets:

Any hashable object can be used as a dictionary key (set member). Immutable
objects, except for tuples that contain a non-hashable object, are
hashable. Python classes are normally hashable(1).

And the footnote is:

Instances of Python classes are hashable if they define a __hash__
method, or if they don't define either __cmp__ or __eq__.

<mike
 
C

Christoph Zwerschke

Mike said:
Ok, how about this for dictionaries/sets:

Any hashable object can be used as a dictionary key (set member). Immutable
objects, except for tuples that contain a non-hashable object, are
hashable. Python classes are normally hashable(1).

I would be even more simple here, like that:

The keys can be arbitrary immutable objects. Thus lists, sets or
dictionaries are not allowed as keys. Tuples are allowed if they do not
directly or indirectly contain mutable objects. More exactly, the keys
are required to be hashable (1).

And then go on and define "hashable" in the footnot.

I think that is easy and understandable and covers the basic cases.
And the footnote is:

Instances of Python classes are hashable if they define a __hash__
method, or if they don't define either __cmp__ or __eq__.

I think that's not exact enough. For instance, ordinary lists also
define a __hash__ method but are not hashable since that method just
raises an exception. Also, as Martin pointed out, if both is there
(__hash__ and __cmp__/__eq__) you would also require of a "proper"
__hash__ method that equal objects hash the same. Otherwise, semantics
would suffer, you could have dicts with non-unique keys (i.e. keys which
are only distinct as objects, but not different as values).

-- Christoph
 
M

Mike Meyer

Christoph Zwerschke said:
I would be even more simple here, like that:
The keys can be arbitrary immutable objects. Thus lists, sets or
dictionaries are not allowed as keys. Tuples are allowed if they do
not directly or indirectly contain mutable objects. More exactly, the
keys are required to be hashable (1).
And then go on and define "hashable" in the footnot.
I think that is easy and understandable and covers the basic cases.

You're shifting the emphasis from hashable back to mutable, which is
the problem that I'm trying to fix. The mutability - or immutability -
of an object is irrelevant to the question of whether or not they can
be a dictionary key/set element. The critical property is that they
are hashable. *Most* immutable builtin types are hashable, and no
builtin mutable types are hashable, which deserves a mention. Once you
get away from the builtins, mutability simply stops mattering. Giving
mutability top billing is simply wrong.
I think that's not exact enough. For instance, ordinary lists also
define a __hash__ method but are not hashable since that method just
raises an exception.

Ordinary lists aren't Python classes.
Also, as Martin pointed out, if both is there
(__hash__ and __cmp__/__eq__) you would also require of a "proper"
__hash__ method that equal objects hash the same. Otherwise, semantics
would suffer, you could have dicts with non-unique keys (i.e. keys
which are only distinct as objects, but not different as values).

Um, actually, I pointed that out (possibly as well), and included it
in one of the earlier versions. I was contemplating moving it
elsewhere, and accidently left it out. It certainly deserves mention
somewhere; I'm just not sure that here is the right place. Along with
__hash__ seems more appropriate.

<mike
 
C

Christoph Zwerschke

Mike said:
> The mutability - or immutability - of an object is irrelevant to
> the question of whether or not they can be a dictionary key/set
> element. The critical property is that they are hashable.
> *Most* immutable builtin types are hashable, and no builtin
> mutable types are hashable, which deserves a mention.

I think the problem we are facing is that you must distinguish between
hashable/mutable *types* and hashable/mutable objects. You can have
*types* like tuple which are hashable/immutable themselves, but can have
*instances* that are mutable and not hashable, such as ([]).

For the built-in types I think objects are hashable if and only if they
are immutable (not 100% sure, but can you give a counterexample?).
That's why I got back to talk about mutability first. But I agree one
should also explain the problem of non-built-in types.

Maybe one can split the explanation and talk about the built-in
datatypes first and then explain the situation for user-defined types.
>
> Ordinary lists aren't Python classes.

In your first sentence, it was unclear whether "they" refered to the
class or the instance. But anyway, what about the following class:

class mylist(list):
pass

This class definitely has a __hash__ method. But instances of this class
are not hashable.

-- Christoph
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Christoph said:
You need to know a third thing, namely that as an iterable object, a
dictionary returns the keys only, not the key/value tuples which would
also make some sense. If it had been designed that way, you could write

Right. I forgot to say that, but that is also explained in PEP 234.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Mike said:
I would have phrased it as "identical" instead of equal. Sets should
hold unique elements, where "unique" depends on the use at hand. For
some uses, "==" might be appropriate. For others, "is" might be
appropriate.

Sets, as currently defined, require hashable objects, i.e. objects
which has same all the time and hash same as all other equal objects.
Phrasing it as "identical" would be incorrect: hashes currently operate
on equality, not identity.
Would you care to suggest an improved wording?

No - I have learned that I cannot write documentation that others can
understand. I usually write documentation which only attempts to be
correct, and others than make it understandable.

Regards,
Martin
 
M

Mike Meyer

Christoph Zwerschke said:
Mike said:
The mutability - or immutability - of an object is irrelevant to
the question of whether or not they can be a dictionary key/set
element. The critical property is that they are hashable.
*Most* immutable builtin types are hashable, and no builtin
mutable types are hashable, which deserves a mention.
I think the problem we are facing is that you must distinguish between
hashable/mutable *types* and hashable/mutable objects. You can have
*types* like tuple which are hashable/immutable themselves, but can
have *instances* that are mutable and not hashable, such as ([]).

I think you're trying to tweak the wrong definition. Types are
immutable if their instances are immutable. The problem is that python
defines the value of container objects to depend on their contents, so
the value of a container object can change even if the container
itself can't change. Tuples can't change - a tuple will always refer
to the objects it refers to when created. However, those objects can
change, which changes the value of the tuple without changing the
tuple. So whether or not a tuple containing a mutable object is
immutable depends on whether you define a immutable as "the object
can't change" or "the objects value can't change".
For the built-in types I think objects are hashable if and only if
they are immutable (not 100% sure, but can you give a
counterexample?). That's why I got back to talk about mutability
first. But I agree one should also explain the problem of non-built-in
types.

If you define immutable as "the object doesn't change", then the
counterexample is the one you misspelled above: ([],). If you define
immutable as "the value of the object can't change", then I don't have
a counterexample.
class mylist(list):
pass

This class definitely has a __hash__ method. But instances of this
class are not hashable.

Do you think it's really necessary to specify "has a __hash__ method
that returns a value", as opposed to just "has a __hash__ method"?

<mike
 
C

Christoph Zwerschke

Mike said:
Ok, how about this for dictionaries/sets:

Any hashable object can be used as a dictionary key (set member). Immutable
objects, except for tuples that contain a non-hashable object, are
hashable. Python classes are normally hashable(1).

The problem with the last sentence is that can be misunderstood (classes
themselves as objects are always hashable) and has little information
otherwise (what is "normal"?).

-- Christoph
 
C

Christoph Zwerschke

Mike said:
> I think you're trying to tweak the wrong definition. Types are
> immutable if their instances are immutable.

I'm not trying to tweak it, but to gain more clarity about what is
actually meant when you talk about "mutable" types and objects and
whether it suffices to use these terms as long as you speak about
built-in datatypes.

Tuples have been considered an immutable type since generations, becaue
the elements of a tuple cannot be changed (i.e. the elements themselves,
not their values). Likewise, they should be considered a hashable type
because they have a proper hash function that can be used if all
elements are hashable. Still, you can create instances of this
immutable/hashable data type which are mutable/not hashable. So in the
docs it will be important to be clear and emphasize that the *objects*
have to be immutable/hashable, not only their types.
> So whether or not a tuple containing a mutable object is
> immutable depends on whether you define a immutable as "the object
> can't change" or "the objects value can't change".

Right - if you have an actual tuple instance like (a,) and a is a list,
you can argue whether it should be considered mutable or not. But you
cannot argue whether it is hashable or not. It's simply not hashable. In
so far, speaking about whether an object is hashable is more precise
(well-defined) than asking whether it is mutable, you're right.
> Do you think it's really necessary to specify "has a __hash__ method
> that returns a value", as opposed to just "has a __hash__ method"?

I think it is necessary to require that it has a "proper" __hash__
function. As far as I understand you can always add a __hash__ method.
Not only invalid __hash__ methods like in this case:

class mylist0(list):
def __hash__(self): return None

But you can also easily add valid __hash__ methods:

class mylist1(list):
def __hash__(self): return 0815

class mylist2(list):
def __hash__(self): return id(self)

In the case of mylist1, everything is ok including semantics, but
performance suffers dramatically. In mylist2, performance is great, but
semantics suffers greatly.

Which of these user-defined types would you call "hashable"? Technically
speaking, both probably are, so you can indeed use list objects of both
types as keys of a dictionary - but I think there should be a hint in
the docs that you need to have a "proper" hash function if you want your
dictionary to be performant and show the usually expected behavior.

-- Christoph
 
M

Mike Meyer

Christoph Zwerschke said:
I'm not trying to tweak it, but to gain more clarity about what is
actually meant when you talk about "mutable" types and objects and
whether it suffices to use these terms as long as you speak about
built-in datatypes.

I think we both agree that mutability isn't adequate, because it's
really irrelevant to the question. What it seems to provide is a
convenient and usually well-understood way of describing which builtin
types can be used: "Mutable builtin types cannot be used as dictionary
keys." "Immutable builtin types can be used as dictionary keys,
except for certain tuples." That exception creates headaches. Then you
have to ponder things like "Is a file instance immutable?" because
they are hashable. Maybe we should just skip the idea completely.
I think it is necessary to require that it has a "proper" __hash__
function. As far as I understand you can always add a __hash__
method. Not only invalid __hash__ methods like in this case:
class mylist0(list):
def __hash__(self): return None
But you can also easily add valid __hash__ methods:

class mylist1(list):
def __hash__(self): return 0815

class mylist2(list):
def __hash__(self): return id(self)

In the case of mylist1, everything is ok including semantics, but
performance suffers dramatically. In mylist2, performance is great,
but semantics suffers greatly.
Which of these user-defined types would you call "hashable"?

The latter two, as the given __hash__ meets the specifications for
__hash__. __hash__ itself isn't enough to decide the question of
semantics. You can make mylist2 have the right semantics by adding
def __eq__(self, other): return self is other
to it.
Technically speaking, both probably are, so you can indeed use list
objects of both types as keys of a dictionary - but I think there
should be a hint in the docs that you need to have a "proper" hash
function if you want your dictionary to be performant and show the
usually expected behavior.

I agree - but where does that hint belong? I think it belongs in the
documentation for __hash__, not the documentation for sets and
dictionaries.

How about we document it in the spirit of Python's "try it and see"
philosophy:

Any hashable object can be used as a dictionary key/set element. An
object is hashable if calling hash on that object returns an integer.

That's 100% correct. However, is it to little information to be
useful? If so, then going back to the mutable/immutable thing seems to
be right:

Any hashable object can be used as a dictionary key/set
element. Instances of mutable builtins are not hashable. Instances of
immutable builtins are hashable, except for tuples that contain a
non-hashable value. Instances of classes are usually hashable(*).

Footnote *: An instance of a class is hashable if it has a __hash__
method, or does not have either a __cmp__ or __eq__ method. If you're
not sure, call hash() on the instance, and if it returns an integer,
the instance is hashable.

Either way, over in the documentation for __hash__, add a note that:

For instances of a class to work properly with builtin container
types, two instances that compare as equal must have the same hash
value. For best performance, hash values should be different as often
as possible while honoring the equality requirement. (or is that
wrong?)

<mike
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Mike said:
The latter two, as the given __hash__ meets the specifications for
__hash__.

This is not true. The second definition of __hash__ does not meet
the specifications:

http://docs.python.org/ref/customization.html

"The only required property is that objects which compare equal have the
same hash value"

However, I can have two instances of mylist2 which compare equal,
yet have different hash values.

Regards,
Martin
 
C

Christoph Zwerschke

class mylist1(list):
> Mike Meyer wrote:
This is not true. The second definition of __hash__ does not meet
the specifications:
"The only required property is that objects which compare equal have the
same hash value"

As Mike has written in his last posting, you could easily "fix" that by
tweaking the equality relation as well. So technically speaking, Mike is
probably right. It would completely break common-sense semantics,
because for mylist2, if I have a=[1] and b=[1] then I will have a!=b.
This would not be very reasonable, but probably allowed by Python.

-- Christoph
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Christoph said:
As Mike has written in his last posting, you could easily "fix" that by
tweaking the equality relation as well. So technically speaking, Mike is
probably right.

No. If you define both __hash__ and __eq__ consistently, then __hash__
would meet the specification. As posted in the example, __hash__ does
not meet the specification, contrary to Mike's claim that it does.
It would completely break common-sense semantics,
because for mylist2, if I have a=[1] and b=[1] then I will have a!=b.
This would not be very reasonable, but probably allowed by Python.

If you have a=[1] and b=[1], then *always* a==b; you cannot
change the semantics of list displays. To achieve !=, you
would need to have a=mylist2([1]), b=mylist([2]). Whether or not
it is common sense that classes named "mylist2" compare equal
if they consist of the same items, I don't know - I personally don't
have any intuition as to what a class named "mylist2" should or
should not do.

Just that it has "list" in its name provides not sufficient clue: for
example, IdentityDictionary has "dictionary" in its name, yet I would
*not* expect that equality is used to compare keys, but instead I
would expect that identity is used.

Regards,
Martin
 
M

Mike Meyer

Martin v. Löwis said:
This is not true. The second definition of __hash__ does not meet
the specifications:

The specification isn't on the __hash__ method, but on objects.
Instances of mylist2 that aren't otherwise tweaked won't meet the
specification.
http://docs.python.org/ref/customization.html
"The only required property is that objects which compare equal have the
same hash value"

That's pretty flaky. The same docs say:

Should return a 32-bit integer usable as a hash value for dictionary
operations.

the dictionary implementation requires that a key's hash value is
immutable

Maybe those are only "recommended" properties, but any class that
doesn't meet them is going to have instances with interesting
behaviors in sets and as dictionary keys.

FWIW, the mutable builtins have __hash__ methods that don't meet that
requirement. Instances of them don't have has values at all.
However, I can have two instances of mylist2 which compare equal,
yet have different hash values.

That's a problem with mylist2 instances, and may or may not be a
problem in the mylist2 __hash__ method. You can't tell just by
examining the __hash__ method whether or not it meets this
specification.

<mike
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Mike said:
The specification isn't on the __hash__ method, but on objects.

What does it mean for a specification to be "on" something? The
specification I quoted is listed "under" the __hash__ heading
of the reference manual.
Instances of mylist2 that aren't otherwise tweaked won't meet the
specification.

Right, that's what I meant.
That's pretty flaky. The same docs say:

Should return a 32-bit integer usable as a hash value for dictionary
operations.

What is flaky about this?
the dictionary implementation requires that a key's hash value is
immutable

Indeed, the question here is: what does "an object is immutable"
mean in this context? You appear to read it as "an object whose
state does not change", however, what it really means is "an object
which will always use the same state in __hash__ and __eq__".
Maybe those are only "recommended" properties, but any class that
doesn't meet them is going to have instances with interesting
behaviors in sets and as dictionary keys.

Indeed, you can anything return you want in __hash__, or always
raise an exception. What precisely the result of dictionary
operations is when you do so is undefined in Python.
That's a problem with mylist2 instances, and may or may not be a
problem in the mylist2 __hash__ method. You can't tell just by
examining the __hash__ method whether or not it meets this
specification.

Correct. I would expect that this is the case for any specification
of __hash__ you can come up with: __hash__ would typically refer
to the state of the object, and whether or not the results meets
some specified requirement would also depend on what the state is.
(unless either the specification is trivial (e.g. "may or may
not return a value") or the implementation is trivial (e.g.
"return 0"))

Regards,
Martin
 
M

Mike Meyer

Martin v. Löwis said:
What does it mean for a specification to be "on" something? The
specification I quoted is listed "under" the __hash__ heading
of the reference manual.

"applies to"
What is flaky about this?

What's flaky is the claim in the docs that "the *only* requirement is
....". There are clearly other requirements, and they're listed in the
same paragraph as that requirement.
Indeed, the question here is: what does "an object is immutable"
mean in this context? You appear to read it as "an object whose
state does not change", however, what it really means is "an object
which will always use the same state in __hash__ and __eq__".

I've given up on trying to say what "immutable" means, because common
usage in the CS and Python communities is ambiguous when it comes to
container objects. You give an unambiguous definition, but it's
specific to python, and clashes with common usage in the CS and Python
communities. Every old-style class that doesn't define __hash__,
__eq__ or __cmp__ is immutable by that definition, even if it has
mutator methods. Similar comments apply to new-style classes, but
require more work to pin down.

That's the reason this thread went the direction it did - because I
started discussing fixing the docs to not use "immutable" as the
defining property.
Indeed, you can anything return you want in __hash__, or always
raise an exception. What precisely the result of dictionary
operations is when you do so is undefined in Python.

I certainly hope that if __hash__ raises an exception, dictionary
operations pass the exception on to their caller. For the other cases,
undefined behavior is perfectly reasonable.

Back onto topic:

If you want to use your definition of "immutable" in the docs for
dictionary keys and set elements, I think that the docs should
explicitly include that definition. We can probably come up with a
less ambiguous (does "same state" mean "same state variables" or
"variables that don't change value"?) wording as well.

Personally, I think we'd be better off to come up with a term for this
property that doesn't have a commonly understood meaning that has such
broad areas of disagreement with the property. I've been using
"hashable", which I would currently define as "has a __hash__ method
with the properties described in the __hash__ documentation, or does
not have either a __cmp__ or a __eq__ method." Listing which builtin
types are hashable and which aren't (and which vary on an
instance-by-instance basis) would seem to be worthwhile as well.

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top