Request for comments on a design

TomF · Oct 23, 2010

I have a program that manipulates lots of very large indices, which I
implement as bit vectors (via the bitarray module). These are too
large to keep all of them in memory so I have to come up with a way to
cache and load them from disk as necessary. I've been reading about
weak references and it looks like they may be what I want.

My idea is to use a WeakValueDictionary to hold references to these
bitarrays, so Python can decide when to garbage collect them. I then
keep a key-value database of them (via bsddb) on disk and load them
when necessary. The basic idea for accessing one of these indexes is:

_idx_to_bitvector_dict = weakref.WeakValueDictionary()

def retrieve_index(idx):
if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
not None:
return _idx_to_bitvector_dict[idx]
else: # it's been gc'd
bv_str = bitvector_from_db[idx] # Load from bsddb
bv = cPickle.loads(bv_str) # Deserialize the string
_idx_to_bitvector_dict[idx] = bv # Re-initialize the weak
dict element
return bv

Hopefully that's not too confusing. Comments on this approach? I'm
wondering whether the weakref stuff isn't duplicating some of the
caching that bsddb might be doing.

Thanks,
-Tom

Peter Otten · Oct 23, 2010

TomF said:
I have a program that manipulates lots of very large indices, which I
implement as bit vectors (via the bitarray module). These are too
large to keep all of them in memory so I have to come up with a way to
cache and load them from disk as necessary. I've been reading about
weak references and it looks like they may be what I want.

My idea is to use a WeakValueDictionary to hold references to these
bitarrays, so Python can decide when to garbage collect them. I then
keep a key-value database of them (via bsddb) on disk and load them
when necessary. The basic idea for accessing one of these indexes is:

_idx_to_bitvector_dict = weakref.WeakValueDictionary()

In a well written script that cache will be almost empty. You should compare
the weakref approach against a least-recently-used caching strategy. In
newer Pythons you can use collections.OrderedDict to implement an LRU cache
or use the functools.lru_cache decorator.

def retrieve_index(idx):
if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
not None:
return _idx_to_bitvector_dict[idx]

In a multi-threaded environment the above may still return None or raise a
KeyError.

else: # it's been gc'd
bv_str = bitvector_from_db[idx] # Load from bsddb
bv = cPickle.loads(bv_str) # Deserialize the string
_idx_to_bitvector_dict[idx] = bv # Re-initialize the weak
dict element
return bv

Hopefully that's not too confusing. Comments on this approach? I'm
wondering whether the weakref stuff isn't duplicating some of the
caching that bsddb might be doing.

Even if the raw value is cached somewhere you save the overhead of
deserialisation.

Peter

TomF · Oct 23, 2010

In a well written script that cache will be almost empty. You should compare
the weakref approach against a least-recently-used caching strategy. In
newer Pythons you can use collections.OrderedDict to implement an LRU cache
or use the functools.lru_cache decorator.

I don't know what your first sentence means, but thanks for pointers to
the LRU stuff. Maintaining my own LRU cache might be a better way to
go. At least I'll have more control.

Thanks,
-Tom

Request for comments on python distributed technologies	5	May 19, 2006
Request for comments - kgets()	10	Aug 13, 2004
a position on Treestructure in SQL and request for comments on same	1	Nov 26, 2003
Request for comments on a JPEG metadata Perl module	11	Jun 26, 2004
Comments on ObjectiveView issue 9 (no, I'm not a spambot)	3	Dec 12, 2006
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
Support for new items in set type	10	Apr 22, 2007
Best way to cache a bunch of images on the client? Possible to "stream" image info for caching after	1	Jun 16, 2007

Request for comments on a design

TomF

Peter Otten

TomF

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads