Dictionary .keys() and .values() should return a set [with Python 3000 in mind]

V

vatamane

This has been bothering me for a while. Just want to find out if it
just me or perhaps others have thought of this too: Why shouldn't the
keyset of a dictionary be represented as a set instead of a list? I
know that sets were introduced a lot later and lists/dictionaries were
used instead but I think "the only correct way" now is for the
dictionary keys and values to be sets. Presently {1:0,2:0,3:0}.keys()
will produce [1,2,3] but it could also produce [3,1,2] or [3,2,1] on a
different machine architecture. The Python documentation states that:
"""
Keys and values are listed in an _arbitrary_(my emphasis) order which
is non-random, varies across Python implementations, and depends on the
dictionary's history of insertions and deletions.
"""

So on the same machine it will be the case that: {1:0, 2:0,
3:0}.keys() == {3:0, 2:0, 1:0}.keys() is True. But if there are 2 lists
of keys of the same dictionary, one is pickled on another machine, with
a different "arbitrary non-random" ordering, then the keys will not be
equal to each other. It seems like the problem could be solved by
returning a set instead.

The same thing goes for the values(). Here most people will argue that
values are not necessarily unique, so a list is more appropriate, but
in fact the values are unique it is just that more than one key could
map to the same value. What is 'seen' in dictionary is the mapping
rule. Also the values are not ordered and should not be indexable --
they should be a set. Just as the keys(), one ordered list of values
from a dictionary on one machine will not necessarily be equal to
another list of values an on another machine, while in fact, they
should be.

On a more fundamental and general level, a dictionary is actually an
explicit function, also called a 'map'. A set of keys, traditionally
called a 'domain' are mapped to a set of values, traditionally called a
'range'. This mapping produces at most a surjective function (i.e. two
or more keys can map to same value and all the values are mapped to by
some keys). Notice that the traditional counterparts to keys and
values are sets and not just lists. This seems like theory babble, but
in the case of Python staying 'true' to the theory is usually a
GoodThing(tm).

I love Python primarily because it does something in only one, correct,
and reasonable way. The same principle should probably apply to Python
itself including to its built-in data structures.

Of course the compatibilty will be broken. Any code relying on a
certain ordering of keys() or values() returned would need to be
updated. One could argue though that such code was not entirely correct
in the first place to asssume that, and should be fixed anyway.

Obviously this fix should not be in Python 2.X but perhaps would be
worth considering for Python 3000. What does everyone think?
 
C

cmdrrickhunter

There's a few good reasons.
1 - golden handcuffs. Breaking old code is bad 90% of the time
2 - creating a set MAY be slower.

Python's sets seem to imply to that they will always be a hash map. in
this case, some creative hash map "mapping" could allow one to create a
set without calculating hash codes (make the set hashmap have the same
dimentions and rules as the dictionary one).
If there was intent to allow Python implementations to use trees for
the set, then a list is far faster to create (O(n) time instead of
O(nlogn)).

3 - using a set is sometimes slower (just as using a list is sometimes
slower)
I can't speak for your code, but this is the most common use of keys in
my coding:
# d is some dictionary
keys = d.keys()
keys.sort()
for k in keys:
#blah

sets cannot be sorted, while lists can. If keys() returned a set, then
I'd have to turn it into a list every time.

There's potential to add "views" to python (a key view of a dictionary
being a frozenset containing the dictionary's keys, which is updated
whenever the dictionary updates), but I think thats annother topic
which is out of the scope of your origional idea.
 
S

Scott David Daniels

I can't speak for your code, but this is the most common use of keys in
my coding:
# d is some dictionary
keys = d.keys()
keys.sort()
for k in keys:
#blah

This you can rewrite quite effectively as:

for k in sorted(d):
#blah

--Scott David Daniels
(e-mail address removed)
 
N

Nick Vatamaniuc

1 - golden handcuffs. Breaking old code is bad 90% of the time
I agree with you on that, mostly for code that counted on list methods
of result of keys() - like you later show with sort. But there is a
side note: old code that assumed a particular ordering of the keys or
values is broken anyway. So even if ks={1:0,2:0,3:0}.keys() returns
[1,2,3] on my machine I should not do something like
'my_savings_account + ks[0]' That code should be fixed anyway, since on
a different machine it might produce different values for ks[0].
2 - creating a set MAY be slower.
Creating a set from the dictionary's keys should not be a lot slower
because the keys are already unique, there is no need to check each key
against the other keys just return them as a set. I assume this is
what you meant by "make the set hashmap have the same dimensions and
rules as the dictionary one". Perhaps a dictionary would internally
just copy its keys to the set and return it rather than construct as
set from scratch (with duplication checks and all).
3 - using a set is sometimes slower
Again, depending how it is used. In your case you argue that you
usually sort the keys anyway so a list is more convinient. But
different use cases can call for differnent operations on the keys()
after they have been retrieved. What if someone wants to do an
intersection to find common keys with another dictionary, then a set
would be more appropriate. The intent of the set() type was to not temp
anyone into assuming an ordering of keys() just because a list is
indexable. And eliminate the need for a long footnote in the
documentation of the dict type that talks about 'an arbitrary
non-random ordering' - it takes while just to grasp what that means...

In general I believe that a small performance penalty is acceptable in
order to have a correct and consistent data type, especially for Python
i.e. I might not argue the same for Perl or C.

-Nick V.
 
R

Robert Kern

The same thing goes for the values(). Here most people will argue that
values are not necessarily unique, so a list is more appropriate, but
in fact the values are unique it is just that more than one key could
map to the same value. What is 'seen' in dictionary is the mapping
rule. Also the values are not ordered and should not be indexable --
they should be a set. Just as the keys(), one ordered list of values
from a dictionary on one machine will not necessarily be equal to
another list of values an on another machine, while in fact, they
should be.

This part is pretty much a non-starter. Not all Python objects are hashable.

In [1]: d = {}

In [2]: for i in range(1, 10):
...: d = range(i)
...:
...:

In [3]: set(d.values())
---------------------------------------------------------------------------
exceptions.TypeError Traceback (most recent call
last)

/Users/kern/<ipython console>

TypeError: list objects are unhashable


Also, I may need keys to map to different objects that happen to be equal.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Machin

This has been bothering me for a while.
[snip]

Summary of OP's post: d.keys() and d.values() should return sets in
Python 3.0.

Observations:
(1) My code [some of which dates back to Python 1.5.1] uses about 12 x
d.items() and 11 x d.keys() for each 1 x d.values()
(2) Most cases of d.X() don't need/want the untrammelled list and could
be replaced by d.iterX(). Example: max_frequency = max(tally.values())

Opinion: d.X() could be deprecated, but I'd rather see a
consciousness-raising for the d.iterX() methods, and for the construct
for key in d:

Cheers,
John
 
R

Roy Smith

Nick Vatamaniuc said:
But there is a side note: old code that assumed a particular ordering
of the keys or values is broken anyway.

From a testing point of view, it would be interesting if there was a flag
which said, "Deliberately change everything which isn't guaranteed to be a
specific way". So, for example, dict.keys() would return a list in reverse
order of how it normally does it (even if it cost more to do it that way).
An alternate hash key generator would be slipped into place. Floating
point math would get a little noise added to the least significant bits.
And so on. Might be interesting to see what sorts of bugs that would shake
out from user code.
 
N

Nick Vatamaniuc

You are correct I should have thought of that. I still think the keys()
method should return a set() though.

Robert said:
The same thing goes for the values(). Here most people will argue that
values are not necessarily unique, so a list is more appropriate, but
in fact the values are unique it is just that more than one key could
map to the same value. What is 'seen' in dictionary is the mapping
rule. Also the values are not ordered and should not be indexable --
they should be a set. Just as the keys(), one ordered list of values
from a dictionary on one machine will not necessarily be equal to
another list of values an on another machine, while in fact, they
should be.

This part is pretty much a non-starter. Not all Python objects are hashable.

In [1]: d = {}

In [2]: for i in range(1, 10):
...: d = range(i)
...:
...:

In [3]: set(d.values())
---------------------------------------------------------------------------
exceptions.TypeError Traceback (most recent call
last)

/Users/kern/<ipython console>

TypeError: list objects are unhashable


Also, I may need keys to map to different objects that happen to be equal.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
P

Paul McGuire

This has been bothering me for a while. Just want to find out if it
just me or perhaps others have thought of this too: Why shouldn't the
keyset of a dictionary be represented as a set instead of a list?

I think this is an interesting suggestion. Of course, the current situation
is as much a product of historical progression as anything: lists and dicts
pre-date sets, so the collection of keys had to be returned as a list.
Since lists also carry some concept of order to them, the documentation for
the list returned by dict.keys() went out of its way to say that the order
of the elements in the dict.keys() list had no bearing on the dict, the
insertion order of entries, or anything else, that the order of keys was
purely arbitrary.

In fact, there is not a little irony to this proposal, since it seems it was
just a few months ago that c.l.py had just about weekly requests for how to
create an "ordered dict," with various ideas of how a dict should be
ordered, but most intended to preserve the order of insertion of items into
the dict. And now here we have just about the opposite proposal - dicts
should not only *not* be ordered, they should revel in their disorderliness.

I liked the example in which the OP (of this thread) wanted to compare the
keys from two different dicts, for equality of keys. Since the keys()
method returns a set of indeterminate order, we can't simply perform
dictA.keys() == dictB.keys(). But how often does this really happen? In
practice, I think the keys of a dict, when this collection is used at all,
are usually sorted and then iterated over, usually to prettyprint the keys
and values in the dict. Internally, this set of items shouldn't even exist
as a separate data structure - the dict's keys are merely labels on nodes in
some sort of hash tree.

Now that we really do have sets in Python, if one were to design dicts all
over again, it does seem like set would be a better choice than list for the
type returned by the keys() method. To preserve compatibility, would a
second method, keyset(), do the trick? The OP used this term himself,
referring to "the keyset of the dictionary". Still given the few cases
where I would access this as a true set, using "set(dictA.keys())" doesn't
seem to be that big a hardship, and its obviousness will probably undercut
any push for a separate keyset() method.

-- Paul
 
P

Paddy

This has been bothering me for a while. Just want to find out if it
just me or perhaps others have thought of this too: Why shouldn't the
keyset of a dictionary be represented as a set instead of a list?

I think the order of the items returned by keys() and values() are
related. I decided on a short empirical test:
import random
n=50
d = dict((i,random.randint(0,n-1)) for i in range(n))
k,v = d.keys(), d.values()
[d[k] == v for i in range(n)]

[True, True, True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True, True, True, True,
True, True, True]
## for larger n try
# False in [d[k] == v for i in range(n)]


The order of keys() and the order of values() are related, so a change
to returning sets would also loose this functionality.

Mind you, Never rely on that implied ordering. Always use items().

And so *if* sets were returned for keys() and values() then it should
for items() too.

- Paddy.
 
B

bearophileHUGS

Paddy:
Mind you, Never rely on that implied ordering. Always use items().

Using dict.items() is probably better, but the manual says:
If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): "pairs = zip(a.values(), a.keys())". The same relationship holds for the iterkeys() and itervalues() methods:<

Is this going to change?


dict.keyset() seems nice, but you usually don't want to make a too much
big API.
Keeping APIs small is very important in Python, otherwise you need the
manual to write code.
I think a better solution to solve such key set problems is to optimize
Python itself, so Python computes set(dict) really fast (it can just
"copies" the hash of the dict).

Bye,
bearophile
 
S

Simon Forman

So, values() can't return a set because of (at least) the two reasons
given above. And since, as Scott David Daniels pointed out, dicts
support the iterator protocol, you can ask for a set of the keys easily
enough if you want it:
set(['a', 'c', 'b'])

So, IMHO, there's not much point to having keys() return a set.


Peace,
~Simon
 
T

Terry Reedy

The meaning of dict.keys, etc, will not change for the 2.x series. For
3.0, I believe that Guido already intends that .keys() no longer return a
separate list. For one thing, a major, if not the main use, of the method
is for iteration, as in 'for keys in d.keys():'. For this, creating and
them discarding a separate list is inefficient and, in a sense, silly.

One option is to make .keys be what .iterkeys is today. Another, more
recent, is to make .keys return a new iterable dict view object, details
not yet worked out. Either way, one would make an independent set or list
with set(d.keys()) or list(d.keys)).

Terry Jan Reedy
 
P

Piet van Oostrum

P> I think the order of the items returned by keys() and values() are
P> related. I decided on a short empirical test:

yes, it is documented that their order is related. In fact
d.items() == zip(d.keys(), d.values())
This wouldn't work with sets instead of lists.
 
A

Antoon Pardon

There's a few good reasons.
1 - golden handcuffs. Breaking old code is bad 90% of the time
2 - creating a set MAY be slower.

Python's sets seem to imply to that they will always be a hash map. in
this case, some creative hash map "mapping" could allow one to create a
set without calculating hash codes (make the set hashmap have the same
dimentions and rules as the dictionary one).
If there was intent to allow Python implementations to use trees for
the set, then a list is far faster to create (O(n) time instead of
O(nlogn)).

3 - using a set is sometimes slower (just as using a list is sometimes
slower)
I can't speak for your code, but this is the most common use of keys in
my coding:
# d is some dictionary
keys = d.keys()
keys.sort()
for k in keys:
#blah

Wouldn't you be better of with a tree instead of dictionary? Maybe
there are other reasons to prefer a dict, but have a look at:

http://www.pardon-sleeuwaegen.be/antoon/avltree.html

Suppose t is a tree implementing the same mapping a your dictionary
d above, the code would be:

# t is some tree
for k in t:
#blah

And the keys will be treated in order.


If you try it, let me know what you think of it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top