set, dict and other structures

B

bearophileHUGS

I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts. Sets also need the
same memory of dicts (can they be made to use less memory, not storing
values? Maybe this requires too much code rewriting).
I presume such sets are like this because they are kind of dicts. If
this is true, then converting a dict to a set (that means converting
just the keys; this is often useful for a successive processing with
set operators like intersection, etc) can be a very quick operation;
but this little code shows me that this isn't true, set-ify a dict is
even slower than doing it on a list. Can this operation made faster (by
people that writes Python interpreter)?

..from time import clock
..for p in xrange(16, 20):
.. n = 2**p
.. l = range(n)
.. t = clock()
.. set(l)
.. print n, round(clock()-t,3),
.. d = {}.fromkeys(xrange(n))
.. t = clock()
.. set(d)
.. print round(clock()-t,3)

==>
65536 0.082 0.1
131072 0.205 0.218
262144 0.45 0.562
524288 1.014 1.029

----------------------

Dicts and sets are two different data structures, but they share some
similarities. So instead of always converting dicts to sets, maybe some
fast set-like operations/methods can be added to dicts, like
intersection, mass removal, etc. (The result of *some* of such
operations between two dicts can be a regular set, and not a dict. This
is maybe not very "clear" from an formal/theoretical point of view).

-------------

I've studied the nice sets python module from Py2.3, and I am...
well... writing something similar, a (python) graph module for sparse
graphs. I've found 3 modules like that around on the Net, but I think
they can be improved (even if I'm not an expert of Python, and I know,
my code is probably quite rough/naive compared to "sets" module one...)
Graphs are complex data structures, so probably one person needs from a
graph data structure are different from other people needs, so it's
probably not easy to define a "standard" graph module fit for everyone.
Still, I think it can be useful to have a standard module to manage
them.
Do you think it be useful to add a (well made) standard module like
that?

Bear hugs,
Bearophile
 
R

Raymond Hettinger

I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts.

Glad to hear that you find them to be a useful abstraction.

Since sets are internally implemented as dicts, they should have almost
identical performance (net of a small difference for the time to forward the
calls).

Sets also need the
same memory of dicts (can they be made to use less memory, not storing
values? Maybe this requires too much code rewriting).

Yes, an alternate implementation could be written that would take less memory
(by about 1/3) and be a bit faster. The approach would be to copy relevant
sections of the dictionary implementation and then remove anything pertaining to
dict values. If the need arises and I find the time, this will probably happen
someday.

In the meantime, the current implementation is rock solid and gives great
performance. Have you encountered situations where a 1/3 memory savings and a
slight speed improvement would have made a critical difference in a real
application?


I presume such sets are like this because they are kind of dicts. If
this is true, then converting a dict to a set (that means converting
just the keys; this is often useful for a successive processing with
set operators like intersection, etc) can be a very quick operation;
but this little code shows me that this isn't true, set-ify a dict is
even slower than doing it on a list. Can this operation made faster (by
people that writes Python interpreter)?

Dicts take longer than lists because dict iteration takes longer than for lists.

If set-ifying a dictionary turns out to be an essential and time-critical
operation, it is possible to add some special case code for this. The question
is whether it is worth it. If you find the need is pressing, file a feature
request explaining why it is essential. Then, I'll implement it for you. On
the other hand, if this is a theoretical nice-to-have, then why clutter up the
code (with pre-sizing and forcing all values to True).


I've studied the nice sets python module from Py2.3, and I am...
well... writing something similar, a (python) graph module for sparse
graphs. I've found 3 modules like that around on the Net, but I think
they can be improved (even if I'm not an expert of Python, and I know,
my code is probably quite rough/naive compared to "sets" module one...)
Graphs are complex data structures, so probably one person needs from a
graph data structure are different from other people needs, so it's
probably not easy to define a "standard" graph module fit for everyone.
Still, I think it can be useful to have a standard module to manage
them.
Do you think it be useful to add a (well made) standard module like
that?

Start by posting a module outside the standard library; gathering user feedback;
and, if popular, propose it for inclusion in the standard library.

My guess is that there will be two issues. One is that no one implementation of
graphs seems to satisfy all users. The second is that only a small fraction of
Python users need for graph support (there is probably a much greater demand for
combinatorics, numeric/numarray, or financial modules). If the appeal is not
broad, it has little chance of making it into the standard library.


Raymond Hettinger
 
L

Leif K-Brooks

I'm frequently using Py2.4 sets, I find them quite useful, and I like
them, even if they seem a little slower than dicts.

They look exactly the same speed-wise to me:
3.0590860843658447
 
B

bearophileHUGS

You are really gentle Raymond Hettinger, but surely I'm not asking
you/anyone to write some code just for me; I don't have "real
applications", I'm just learning/playing.
Your words are quite good answers to most of my questions.
The only left small topic is about the group/set-like
operations/methods added to dicts; maybe some other person can comment
that idea too. The semantic of such operations can be a little tricky,
but sometimes they can avoid the use of sets (and the set-ify of
dicts), sometimes they can carry the dict values with them (even if
they work just on the dict keys), etc.

A bear hug,
Bearophile
 
G

Giovanni Bajo

Raymond said:
If set-ifying a dictionary turns out to be an essential and
time-critical operation, it is possible to add some special case code
for this. The question is whether it is worth it. If you find the
need is pressing, file a feature request explaining why it is
essential. Then, I'll implement it for you. On the other hand, if
this is a theoretical nice-to-have, then why clutter up the code
(with pre-sizing and forcing all values to True).


Just today I was writing some code where I wanted to use sets for the
abstraction (intersection, etc.), but also carry some values with them to
process. So, yes, I believe that having set-like abstraction for dictionaries
would be great. In fact, for a moment I wondered if dict.keys() was already of
type set in 2.4, because it could have made sense.

My concrete situation was a routine that was updating a current snapshot of
data (stored in a dictionary) with a new snapshot of data (another dictionary).
With 'updating' I mean some kind of algorithm where you have to do different
things for:

- items that were present in the current snapshot but not in the new snapshot
anymore (obsolete items)
- items that were not present in the current snapshot but are presente in the
new snapshot (new items)
- items that were present in both the current and the new snapshot (possibly
modified items)

So, this was a perfect suit for sets. Given my dictionaries d1 and d2, I would
have liked to be able to do:

for k,v in (d1 - d2).iteritems():
...

for k,v in (d2 - d1).iteritems():
...

for k,v in (d1 & d2).iteritems(): # uhm, not fully correct
...

Of course, there are some nitpicks. For instance, in the last case the
dictionary at the intersection should probably hold a tuple of two values, one
coming from each dictionary (since the intersection finds the items whose key
is present in both dictionaries, but the value can obviously be different). So
you'd probably need something:

for k,(v1,v2) in (d1 & d2).iteritems():
...

Another solution would be to say that set-like operations, like (d1 & d2),
return an iterator to a sequence of keys *only*, in this case the sequence of
keys available in both dictionaries.

I also tried a different approach, that is putting my data in some structure
that was suitable for a set (with __hash__ and __key__ methods that were caring
of the key part only):

class FakeForSet:
def __init__(self, k, v):
self.key = key
self.value = value
def __hash__(self):
return hash(self.key)
def __cmp__(self, other):
return cmp(self.key, other.key)

but this looked immediatly weird because I was lying to Python somehow. It then
became clear that it's not going to work, because when you go through (s1 & s2)
you do not know if the items you get are coming from s1 or s2, and that is
important because the value member is going to be different (even if you're
lying to Python saying that those items are equal anyway).

I know of course there are other ways of doing the same thing, like:

# intersection
for k in d1:
if k not in d2:
continue
v1, v2 = d1[k], d2[k]
...

But I don't think there is anything wrong in providing an abstraction for this,
especially since we already decided that set-like abstractions are useful.

So, FWIW, I would find set-like operations on dictionaries an useful addition
to Python. Besides, I guess they can be easily experimented and studied by
subclassing dict.
 
J

Jeremy Bowers

My guess is that there will be two issues. One is that no one
implementation of graphs seems to satisfy all users. The second is that
only a small fraction of Python users need for graph support (there is
probably a much greater demand for combinatorics, numeric/numarray, or
financial modules). If the appeal is not broad, it has little chance of
making it into the standard library.

This reminded me of a group of people that were supposed to have started
in on this:

http://www.python.org/moin/PythonGraphApi

It's too soon to call it dead, but I don't see a success pattern for it;
I'd say they foundered on trying to shoot for the moon in advance, instead
of starting to write code and incrementally improving it. (Hint, hint, if
any of you are reading this.)

Trying to create a graph library that meets any significant percentage of
the needs of the people who would use it is quite non-trivial. It's not
impossible, but it's a lot lot lot lot lot of work. I remember counting at
least 48 distinct types of graphs when I enumerated my concerns to that
project and I'm sure there are a few more dimensions that didn't make it
into that list (the Wiki talks about "hierachial" graphs which adds
another dimension I didn't count, for instance, and I'm sure there are
more...).
 
B

bearophileHUGS

Leif K-Brooks:
They look exactly the same speed-wise to me:

There's a little overhead, you can see it with a bigger test:

..from time import clock
..n = 1*10**6
..
..t1 = clock()
..d = set()
..for i in xrange(n):
.. d.add(i)
..t2 = clock()
..for i in xrange(n):
.. d.remove(i)
..t3 = clock()
..d = set(xrange(n))
..t4 = clock()
..print round(t2-t1,2), round(t3-t2,2), round(t4-t3,2), round(t4-t1,2)
..
..t1 = clock()
..d = {}
..for i in xrange(n):
.. d = None
..t2 = clock()
..for i in xrange(n):
.. del d
..t3 = clock()
..d = {}.fromkeys(xrange(n))
..t4 = clock()
..print round(t2-t1,2), round(t3-t2,2), round(t4-t3,2), round(t4-t1,2)

----------------------

Jeremy Bowers:
I don't see a success pattern for it; I'd say they foundered on trying
to shoot for the moon in advance, instead of starting to write code and
incrementally improving it.<

I agree, their ideas seem too much abstract and big (and the resulting
code looks sloow); I am aiming to something quite simpler, like the
"sets" python module of Py2.3 (but the after the python version
development & testing, a C version can be useful, just like the set
objects in Py2.4).
My (simple) graph module is already nearly done ^_^
Trying to create a graph library that meets any significant percentage
of
the needs of the people who would use it is quite non-trivial.<

This is right. But:
1) I've found some implementations of graphs:
- One inside Gato: http://www.zpr.uni-koeln.de/~gato/About/
- http://pygraphlib.sourceforge.net/dist/
- http://pygraphlib.sourceforge.net/doc/public/pygraphlib-module.html
- http://www.ics.uci.edu/~eppstein/PADS/
Some modules from Julien Burdy can be found, etc. I think I can find 5
different graph python modules, this number tells me that there's some
people that need/want them.
2) A graph data structure written in C probably cannot be the right one
for every person, because there are so many kinds of graphs... but if
it's fast and general enough and it's already there (and simple
enough!), then most people will start using it, adjusting their needs
to it...

----------------------

Giovanni Bajo:
In fact, for a moment I wondered if dict.keys() was already of type
set in 2.4, because it could have made sense.<

Well, I presume sometimes you need a list, so maybe something like
dict.keysset() can be a better idea.

Of course, there are some nitpicks.<
Yes.


For instance, in the last case the dictionary at the intersection
should probably hold a tuple of two values, one coming from each
dictionary (since the intersection finds the items whose key is present
in both dictionaries, but the value can obviously be different).<

There are other possibilities.

Another solution would be to say that set-like operations, like (d1 &
d2),
return an iterator to a sequence of keys *only*, in this case the
sequence of
keys available in both dictionaries.<

Right ^_^

Then let's see something:

a = {}.fromkeys(range(10),0)
b = {}.fromkeys(range(5,13),1)

1) a - b ==> dict
Its meaning can be something like:
r = {}
for k in set(a).difference(b): r[k] = a[k]

For functional programming people, it can even mean something like:
r = {}
scan(curry(r.__setitem__(Blank, a[k])), set(a).difference(b))
(Unlike map, scan does not build up a new expression to return.)


2) a + b ==> dict
Its meaning:
r = a.copy()
r.update(b)


3) The intersection is a bit tricky. I don't like the idea of resulting
a dict with tuple values, with both the values of the shared keys. The
other idea is nicer:
a.intersection(b)
Its possible meaning:
3a) set(a).intersection(b)
Alternative meaning:
3b) list(set(a).intersection(b))
(Instead of adding this intersection to dicts, it can just be made a
quite faster dict=>set conversion to make set(a).intersection(b) fast.)


4) Another possibility is less common, but easy to implement, so this
is optional:
a.xor(b)
Its meaning:
r = {}
for x in a: if x not in b: r[x] = r[a]
for x in b: if x not in a: r[x] = r

(Probably there are better ways to implement them, here I've just used
python as a kind of pseudocode).

Bear hugs,
Bearophile
 
B

bearophileHUGS

r = {}
for x in a: if x not in b: r[x] = r[a]
for x in b: if x not in a: r[x] = r

I know, this is wrong :-]
This looks better:
r = {}
for x in a: if x not in b: r[x] = a[x]
for x in b: if x not in a: r[x] = b[x]

Bearophile
 
S

Simo Melenius

Giovanni Bajo said:
Just today I was writing some code where I wanted to use sets for
the abstraction (intersection, etc.), but also carry some values
with them to process. So, yes, I believe that having set-like
abstraction for dictionaries would be great. In fact, for a moment I
wondered if dict.keys() was already of type set in 2.4, because it
could have made sense.

Which sets you need depends on each case, so a generic set-ified
extension would be both limited and potentially confusing. For
example, a set of (key,value) tuples is the simplest and unambiguous.
On the other hand, if the values differ you really just need a set of
keys and then you can do the value lookup and collision handling
yourself using the result set as keys to both of your dictionaries.
You could subclass dict to provide aliases such as:

dict.itemset ()
dict.keyset ()
dict.valueset ()

for the following if these look overly bloated to you:

set (dict.iteritems ())
set (dict.iterkeys ())
set (dict.itervalues ())

For example:

py> for k,v in set (d1.iteritems ()) - set (d2.iteritems ()):
py> print k, v

looks quite concise and self-explanatory to me already. A _notable_
performance gain would probably justify native set-ifying getters,
especially if there indeed were implementation-level data structures
to share.


br,
S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top