dict.reserve and other tricks

B

bearophileHUGS

I have started doing practice creating C extensions for CPython, so
here are two ideas I have had, possibly useless.

If you keep adding elements to a CPython dict/set, it periodically
rebuilds itself. So maybe dict.reserve(n) and a set.reserve(n) methods
may help, reserving enough (empty) memory for about n *distinct* keys
the programmer wants to add to the dict/set in a short future. I have
seen that the the C API of the dicts doesn't allow this, and I don't
know if this can be implemented modifying the dicts a bit. Do you think
this may be useful?

-------------------------------

Sometime I need to remove some elements from dicts/sets. If I know a
rule that tells what elements have to be removed I can use code like
this:

import itertools

def filterset(predicate, inset):
return set(itertools.ifilter(predicate, inset))

print filterset(lambda x: x & 1, set(range(10)))

Output:
set([1, 3, 9, 5, 7])

And a similar one to filter dicts that works with a predicate(key,
val).
Most of the times such functions are good enough, but sometimes the
dicts are big, so to reduce memory used I remove keys in place:

def filterdict(pred, indict):
todel = [k for k,v in indict.iteritems() if not pred(k,v)]
for key in todel:
del indict[key]

d = dict.fromkeys(xrange(8), 0)
print d
filterdict(lambda k,v: k & 1, d)
print d

# Output:
# {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0}
# {1: 0, 3: 0, 5: 0, 7: 0}


But doing the same thing while iterating on the dict may be faster and
use even less memory.
This iteration&deletion capability is probably not safe enough to be
used inside Python programs, but a compiled C module with a function
that works like that filterdict (and deletes while iterating) may be
created, and its use is safe from Python programs.

The C++ STL API of maps allows to delete items while iterating on the
map (but probably it has to be done only when necessary, because it may
be a source for bugs):

typedef map<int, string> isMap;
typedef isMap::value_type isValType;
typedef isMap::iterator isMapItor;

void main() {
isMap m;

... // items added to m

for(isMapItor itr = m.begin(); itr != m.end();) {
if(itr->second == "somestring")
m.erase(itr++);
else
++itr;
}


The Python/C API, 7.4.1 Dictionary Objects, says that's not possible
with CPython dicts:
The dictionary p should not be mutated during iteration. It is safe (since Python 2.1) to modify the values of the keys as you iterate over the dictionary, but only so long as the set of keys does not change.<

Do you think it may be useful to create to create such C filterdict
function that deletes while iterating? (To create such function it
probably has to bypass the CPython dict API).

Bye,
bearophile
 
T

Terry Reedy

If you keep adding elements to a CPython dict/set, it periodically
rebuilds itself. So maybe dict.reserve(n) and a set.reserve(n) methods
may help, reserving enough (empty) memory for about n *distinct* keys
the programmer wants to add to the dict/set in a short future. I have
seen that the the C API of the dicts doesn't allow this, and I don't
know if this can be implemented modifying the dicts a bit. Do you think
this may be useful?

Ideas for user-controlled 'performance tuning' have been proposed on the
development list. GVR has rejected such as more trouble than they are
worth. Instead, effort has gone into making the implementation as good as
he and others know how.

Since you are writing extensions, you can create a built-in subclass of
dict to experiment with. I presume the 2.5 default dict should be a model.

Terry Jan Reedy
 
K

Klaas

I have started doing practice creating C extensions for CPython, so
here are two ideas I have had, possibly useless.

If you keep adding elements to a CPython dict/set, it periodically
rebuilds itself. So maybe dict.reserve(n) and a set.reserve(n) methods
may help, reserving enough (empty) memory for about n *distinct* keys
the programmer wants to add to the dict/set in a short future. I have
seen that the the C API of the dicts doesn't allow this, and I don't
know if this can be implemented modifying the dicts a bit. Do you think
this may be useful?

It has been proposed before and rejected. How often is dict creation a
bottleneck in python apps? I'd guess not often. Do you know of any
examples

Optimize space use also isn't terribly compelling, as a "tight" dict
can be created from a "loose" dict d using dict(d).

Most of the times such functions are good enough, but sometimes the
dicts are big, so to reduce memory used I remove keys in place:

def filterdict(pred, indict):
todel = [k for k,v in indict.iteritems() if not pred(k,v)]
for key in todel:
del indict[key]

But doing the same thing while iterating on the dict may be faster and
use even less memory.

Well, you can reduce the memory usage to virtually nothing by using a
generator expression rather than list comprehension.
This iteration&deletion capability is probably not safe enough to be
used inside Python programs, but a compiled C module with a function
that works like that filterdict (and deletes while iterating) may be
created, and its use is safe from Python programs.


Do you think it may be useful to create to create such C filterdict
function that deletes while iterating? (To create such function it
probably has to bypass the CPython dict API).

Such a beast would be fiendish to write, I think. Remember, arbitrary
python code can be executed by __hash__ and deleting (DECREF) python
objects.

-Mike
 
B

bearophileHUGS

Thank you for the answers Terry Reedy and Klaas.
Since you are writing extensions, you can create a built-in subclass of
dict to experiment with. I presume the 2.5 default dict should be a model.

That way it's doable, but I think it's of limited use too; I'd like to
remove elements from arbitrary (normal) dicts.


Klaas:
Well, you can reduce the memory usage to virtually nothing by using a
generator expression rather than list comprehension.

Are you sure? I don't think so. Can you show a little example?
As you know this can't work:

def filterdict(pred, indict):
todel = (k for k,v in indict.iteritems() if not pred(k,v))
for key in todel:
del indict[key]

d = dict.fromkeys(xrange(8), 0)
print d
filterdict(lambda k,v: k & 1, d)
print d

(You can remove elements from the dict and put them into a list, and
then put back the elements into the dict, this is probably the only way
I know of *theoretically* keeping the memory used about the same with
Python. In practice the less memory consuming version may be the
filterdict I have shown).

arbitrary python code can be executed by __hash__
and deleting (DECREF) python objects.

I am starting to understand. I'll think more about this.

Thank you,
bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top