Why is dictionary.keys() a list and not a set?

  • Thread starter Christoph Zwerschke
  • Start date
F

Fredrik Lundh

If I need two parallel list(one key, one value), I can still use the
same [(k,v)] tuple, just access it as x[0], x[1].

that's a single list containing tuples, not two parallel lists.

this is two parallel lists:

k = d.keys()
v = d.values()
assert isinstance(k, list)
assert isinstance(v, list)
use(k, v)

</F>
 
B

bonono

Fredrik said:
If I need two parallel list(one key, one value), I can still use the
same [(k,v)] tuple, just access it as x[0], x[1].

that's a single list containing tuples, not two parallel lists.

this is two parallel lists:

k = d.keys()
v = d.values()
assert isinstance(k, list)
assert isinstance(v, list)
use(k, v)
I know that is a single list of tuples, I mean that can be used as
well.

for k, _ in d.items(): print k
for _, v in d.items(): print v
for k in d.keys(): print k
for v in d.values(): print v

Would there be a noticeable performance difference ?
 
F

Fredrik Lundh

I know that is a single list of tuples, I mean that can be used as
well.

for k, _ in d.items(): print k
for _, v in d.items(): print v
for k in d.keys(): print k
for v in d.values(): print v

Would there be a noticeable performance difference ?

Sloppy use of print statements is a great way to produce misleading
benchmarks [1], since the time it takes to print lots of stuff tends to
overshadow the actual algorithm (for bonus points, print to a terminal
window, and use "time" to measure performance, so you get startup
times as well). If you don't necessarily want to print all the stuff, my
original assertion still holds:

"creating a list full of tuples just to pull them apart and throw
them all away isn't exactly the most efficient way to do things"

Consider a dictionary with one million items. The following operations

k = d.keys()
v = d.values()

creates two list objects, while

i = d.items()

creates just over one million objects. In your "equivalent" example,
you're calling d.items() twice to produce two million objects, none
of which you really care about.

</F>

1) careful use of timeit and generator expressions is another one:
http://online.effbot.org/2005_01_01_archive.htm#20050125 (code)
http://online.effbot.org/2005_01_01_archive.htm#20050127 (explanation)
 
D

Duncan Booth

As for the (k,v) vs (v,k), I still don't think it is a good example. I
can always use index to access the tuple elements and most other
functions expect the first element to be the key. For example :

a=d.items()
do something about a
b = dict(a)

But using the imaginary example :

a = zip(d.values(), d.keys())
do something about a
b = dict(a) # probably not what I want.

The typical use case for the (v,k) tuple would be when you want sort the
contents of a dictionary by value. e.g. counting the number of occurrences
of each word in a document.

a = zip(d.values(), d.keys())
a.sort()
a.reverse()
print "Today's top 10:"
print str.join(',', [ k for (v,k) in a[:10]])

Ok, so today you can do that another way, but before there was a key
parameter to sort you had to build the tuple somehow:

print str.join(',', sorted(a.iterkeys(),
key=a.__getitem__, reverse=True)[:10])

and the advantage of the latter is of course that it works even when you've
been counting the number of occurences of complex numbers in a document :)
 
B

bonono

Fredrik said:
I know that is a single list of tuples, I mean that can be used as
well.

for k, _ in d.items(): print k
for _, v in d.items(): print v
for k in d.keys(): print k
for v in d.values(): print v

Would there be a noticeable performance difference ?

Sloppy use of print statements is a great way to produce misleading
benchmarks [1], since the time it takes to print lots of stuff tends to
overshadow the actual algorithm (for bonus points, print to a terminal
window, and use "time" to measure performance, so you get startup
times as well). If you don't necessarily want to print all the stuff, my
original assertion still holds:

"creating a list full of tuples just to pull them apart and throw
them all away isn't exactly the most efficient way to do things"

Consider a dictionary with one million items. The following operations

k = d.keys()
v = d.values()

creates two list objects, while

i = d.items()

creates just over one million objects. In your "equivalent" example,

Sorry. I lose you here. Could you explain in detail what do you mean by
"two list objects" vs one million objects ?
 
B

bonono

Fredrik said:
I know that is a single list of tuples, I mean that can be used as
well.

for k, _ in d.items(): print k
for _, v in d.items(): print v
for k in d.keys(): print k
for v in d.values(): print v

Would there be a noticeable performance difference ?

Sloppy use of print statements is a great way to produce misleading
benchmarks [1], since the time it takes to print lots of stuff tends to
overshadow the actual algorithm (for bonus points, print to a terminal
window, and use "time" to measure performance, so you get startup
times as well). If you don't necessarily want to print all the stuff, my
original assertion still holds:
BTW, it is not intended as a speed comparion of but more about the
question of are the for "for"s have speed difference. The print is just
that to complete the statement, I believe it can also be "pass" for
performance purpose.
 
B

bonono

Fredrik said:
Consider a dictionary with one million items. The following operations

k = d.keys()
v = d.values()

creates two list objects, while

i = d.items()

creates just over one million objects. In your "equivalent" example,
you're calling d.items() twice to produce two million objects, none
of which you really care about.

This is what I get from the doc :
a.items() a copy of a's list of (key, value) pairs (3)
a.keys() a copy of a's list of keys (3)
a.values() a copy of a's list of values

I can't derive what you mean by "two list objects" vs "million
objects".
 
D

Duncan Booth

Sorry. I lose you here. Could you explain in detail what do you mean by
"two list objects" vs one million objects ?

It should be fairly clear. 'd.keys()' creates one new object (a list).
'd.items()' creates a list of tuples, so that is one new list and one
million new tuples.
 
F

Fredrik Lundh

This is what I get from the doc :
a.items() a copy of a's list of (key, value) pairs (3)
a.keys() a copy of a's list of keys (3)
a.values() a copy of a's list of values

I can't derive what you mean by "two list objects" vs "million
objects".

The "copy of a's list of" is a list object.

The "pairs" are tuples, which are objects too. For a dictionary with
one million items, the "items" method has to create one million "pair"
tuples before it can return them to you... (the difference between
"items" and "iteritems" is that the latter doesn't create all of them
up front; it still has to create them, though).

In Python, everything that you can put in a variable or otherwise save
for a later time is an object. All objects must be created. Creating a
new object is often very cheap, but the cost is never zero.

</F>
 
B

bonono

Fredrik said:
The "copy of a's list of" is a list object.

The "pairs" are tuples, which are objects too. For a dictionary with
one million items, the "items" method has to create one million "pair"
tuples before it can return them to you... (the difference between
"items" and "iteritems" is that the latter doesn't create all of them
up front; it still has to create them, though).

In Python, everything that you can put in a variable or otherwise save
for a later time is an object. All objects must be created. Creating a
new object is often very cheap, but the cost is never zero.

I have redo the timeit test :

D:\Python24\Lib>..\python timeit.py -s "d=dict.fromkeys(range(100000))"
"zip(d.k
eys(),d.values())"
10 loops, best of 3: 158 msec per loop

D:\Python24\Lib>..\python timeit.py -s "d=dict.fromkeys(range(100000))"
"d.items
()"
10 loops, best of 3: 129 msec per loop

D:\Python24\Lib>..\python timeit.py -s "d=dict.fromkeys(range(100000))"
"d.keys(
)"
10 loops, best of 3: 33.2 msec per loop

D:\Python24\Lib>..\python timeit.py -s "d=dict.fromkeys(range(100000))"
"d.value
s()"
100 loops, best of 3: 19.6 msec per loop

These results make more sense. However, I am still puzzled :

1. why would d.keys()/d.values() only return one list element ? Why
isn't it a list of 1M element of either the keys or values but items()
is ?

2. Back to the original question of the guranteed sequence of
keys/values

If I don't need them in sync, the gurantee is a moot point and in that
case, I would use keys/values when appropriate(or the iter version).

If I need them in sync(meaning I would like to zip() them later), the
zip() time is longer than just taking items() and I don't gain any
performance advantage.

So unless I use keys()/values() only as k/v, that is keeping my
own index, I would see get the speed(and probably memory) advantage.
This however seems to be more difficult to maintain than just use the
item tuples.

And would it be the best to just use iteritems() in 90% of the case
unless of course I need to update the dict or that I need a snapshot ?
As iteritems() don't create the list and only retrieve them when
needed. But then, there may be performance penalty comparing with
create the list upfront if I need them all. And of course, I cannot
[:].
 
F

Fredrik Lundh

These results make more sense. However, I am still puzzled :

1. why would d.keys()/d.values() only return one list element ? Why
isn't it a list of 1M element of either the keys or values but items()
is ?

"keys" returns a single list object which contains references to existing
key objects. "items" has to create one million new pair objects (which
in turn hold references to existing key and value objects).

creating a new object is a lot more expensive than adding another
reference to an existing object

(in a fully GC:ed implementation, the cost is zero. if you have a pointer
to an object, you have a reference to it. in CPython, you need to adjust
the reference count, so the cost is a couple of machine instructions)

</F>
 
B

bonono

Fredrik said:
"keys" returns a single list object which contains references to existing
key objects. "items" has to create one million new pair objects (which
in turn hold references to existing key and value objects).
Ah, that is clearer now. So keys()/items() create a list of 1M
reference(object identity?) to existing objects, items() also contains
1 M reference, but to a newly created 1M tuples.

How about the other puzzle of whether seperate keys()/values() pair
have real performance gain in real use ?
 
C

Christoph Zwerschke

Mike said:
Are you sure dict.values() should be a set? After all, values aren't
guaranteed to be unique, so dict(a = 1, b = 1).values() currently
returns [1, 1], but would return set([1]) under your proposal.

Good point. Contrary to the keys, the values are not unique. Still, it
would make sense to only return the set (the value set of the mapping)
because this is usually what you are interested in and you'd think the
information about which value belongs to which key is lost anyway.

However (see other postings in this thread) this last statement is
wrong: If you call keys() and values() consecutively, then they are
guaranteed to correspond to each other. If values() would return a set,
this would be completely broken.
What about dict.items()?

Actually, if keys() would be a set, this should return a set, too (the
set of key/value tuples).
"in" already works for dicdtionaries:

I know. I wanted to use it as an example where set operations have
already been made available for dictionaries.

I know, 'in' has existed for lists and tuples long before sets, but
actually it is a native set operation.

From a mathematical point of view, it would have been nice if Python
had defined "set" as a basic data type in the beginning, with lists,
tuples, dictionaries as derived data types.

By the way, i wonder why the common mathematical notation { 1,2,3 } was
not allowed for set((1,2,3)). It would not clash with the dictionary
notation which requires additional colons.

-- Christoph
 
C

Christoph Zwerschke

Martin said:
You answer the question whether it would be possible (no, because of
backwards compatibility); you don't attempt to answer the question
whether it would be desirable. Why do you think that would be
a good idea?

It was meant simply as an open question to be discussed here, I did not
say that I think it would be a good idea. Sometimes it's good to
question why certain things are set up the way they are.

I considered the question not completely absurd, because if a language
supports sets and dictionaries, it would be somewhat consequential that
the keys of a dictionary are a set. (We also discussed "ordered
dictionaries" here, where it is obvious that the keys are a list.) At
least from a mathematical/aesthetical view.

If you consider compatibility and usability aspects, making them sets
would probably not be a good idea.
If you want the set of keys of a dictionary d, then do set(d):set([1, 3])

I know. But this does not answer the question: If the keys *are* already
a set according to their semantics, why aren't they returned as a set
from the beginning?

Better answers:

- Compatibility issues, particularly the keys()/values() match hattrick
- Because lists are still more native/pythonic objects than sets
As Mike Meyer explains, the same is meaningless for .values():
they might not be unique. For .items(), conversion into a set
does would appear to be meaningful, but doesn't actually work:
if the values contain unhashable objects, the set of tuples
cannot be constructed (as set members must be immutable).

I would not say that a values() set would be meaningless; it is an
interesting object in itself. It would however lose the information
about the "frequency" of the values.

But your second objection is a valid one. So I can add to the list of
reasons why sets are not used for values() and items():

- Because sets can only contain immutable values

This, of course, in turn raises the question ;-) Would it be desirable
to have an additional, more general set datatype that can contain
mutable objects? I know about the perfomance drawbacks. And I think I
know the answer: It would not make much sense, since the implementation
would be actually not different from a list. So you could take a list
anyway. Which gives the answer why values() and items() return a list.

Probably the only use of such a general set type would be that it could
be considered as an abstract supertype for list and value. And in cases
where the order does not matter, this could be made more clear by using
sets. (Similar as the reason why you use False and True when you could
use 0 and 1 instead.)

-- Christoph
 
C

Christoph Zwerschke

Martin said:
If you want the set of keys of a dictionary d, then do set(d):
d={1:2,3:4}
set(d)
set([1, 3])

I know. But this does not answer the question: If the keys *are* already
a set according to their semantics, why aren't they returned as a set
from the beginning?

Sorry. Your answer was good; I missed the point and thought you wrote
set(d.keys()). Is it documented anywhere that set(d) = set(d.keys())? I
think this should go into the Python Doco where keys() are explained.

I would have expected that set(d) returns set(d.items()), but as has
been discussed, this would cause problems with mutable values.

-- Christoph
 
C

Christoph Zwerschke

You can already get a set from a dictionary's keys in an efficient manner:Set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Good point. I expected that set(l) = set(l.items()) and not
set(l.keys()), but the latter would not work with mutable values. See
discussion with Martin.

-- Christoph
 
M

Mike Meyer

Christoph Zwerschke said:
- Because sets can only contain immutable values

Not true. Sets can only contain *hashable* objects, which isn't the
same thing.
This, of course, in turn raises the question ;-) Would it be desirable
to have an additional, more general set datatype that can contain
mutable objects?

You can do that now:
.... def __hash__(self): return id(self)
....
s = set([HashableList(), HashableList()])
s set([[], []])
for x in s:
.... x.append(3)
.... set([[3], [3]])

This also illustrates the danger of this approach, in that I've got
two lists that will compare equal in the same set:
True

Of course, in this case the two objets aren't the same, which is what
is required for the mathematical definition of list:
False

<mike
 
A

Alex Martelli

Christoph Zwerschke said:
Mike said:
Are you sure dict.values() should be a set? After all, values aren't
guaranteed to be unique, so dict(a = 1, b = 1).values() currently
returns [1, 1], but would return set([1]) under your proposal.

Good point. Contrary to the keys, the values are not unique. Still, it
would make sense to only return the set (the value set of the mapping)
because this is usually what you are interested in and you'd think the

Absolutely not in my use cases. The typical reason I'm asking for
..values() is because each value is a count (e.g. of how many time the
key has occurred) and I want the total, so the typical idiom is
sum(d.values()) -- getting a result set would be an utter disaster, and
should I ever want it, set(d.values()) is absolutely trivial to code.

Note that in both cases I don't need a LIST, just an ITERATOR, so
itervalues would do just as well (and probably faster) except it looks
more cluttered and by a tiny margin less readable -- "the sum of the
values" rolls off the tongue, "the sum of the itervalues" doesn't!-)
So, the direction of change for Python 3000, when backwards
compatibility can be broken, is to return iterators rather than lists.
At that time, should you ever want a list, you'll say so explicitly, as
you do now when you want a set, i.e. list(d.values())

I know. I wanted to use it as an example where set operations have
already been made available for dictionaries.

I know, 'in' has existed for lists and tuples long before sets, but
actually it is a native set operation.

Historically, the 'in' operator was added to dictionaries (and the
special method __contains__ introduced, to let you support containment
checking in your own types) well before sets were added to Python (first
as a standard library module, only later as a built-in).

In Python today 'in' doesn't necessarily mean set membership, but some
fuzzier notion of "presence in container"; e..g., you can code

'zap' in 'bazapper'

and get the result True (while 'paz' in 'bazapper' would be False, even
though, if you thought of the strings as SETS rather than SEQUENCES of
characters, that would be absurd). So far, strings (plain and unicode)
are the only containers that read 'in' this way (presence of subsequence
rather than of single item), but one example should be enough to show
that "set membership" isn't exactly what the 'in' operator means.
From a mathematical point of view, it would have been nice if Python
had defined "set" as a basic data type in the beginning, with lists,
tuples, dictionaries as derived data types.

You appear to have a strange notion of "derived data type". In what
sense, for example, would a list BE-A set? It breaks all kind of class
invariants, e.g. "number of items is identical to number of distinct
items", commutativity of addition, etc..


Alex
 
M

Mike Meyer

Christoph Zwerschke said:
I trusted the doco which defines a set as "an unordered collection of
immutable values" (http://docs.python.org/lib/types-set.html).

If the hash changes, you've screwed up the set. What it really should
say is "collection of objects with fixed hashes", or words to that
effect. Do you want to file a PR on this?
Anyway, the problems stays the same.

How so? As I demonstrated, you can subclass any class that doesn't
have a hash to add one, and then use the subclass, which - except for
having a hash - will have exactly the same behavior as your original
class.

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top