Request for comments on a design

T

TomF

I have a program that manipulates lots of very large indices, which I
implement as bit vectors (via the bitarray module). These are too
large to keep all of them in memory so I have to come up with a way to
cache and load them from disk as necessary. I've been reading about
weak references and it looks like they may be what I want.

My idea is to use a WeakValueDictionary to hold references to these
bitarrays, so Python can decide when to garbage collect them. I then
keep a key-value database of them (via bsddb) on disk and load them
when necessary. The basic idea for accessing one of these indexes is:

_idx_to_bitvector_dict = weakref.WeakValueDictionary()

def retrieve_index(idx):
if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
not None:
return _idx_to_bitvector_dict[idx]
else: # it's been gc'd
bv_str = bitvector_from_db[idx] # Load from bsddb
bv = cPickle.loads(bv_str) # Deserialize the string
_idx_to_bitvector_dict[idx] = bv # Re-initialize the weak
dict element
return bv

Hopefully that's not too confusing. Comments on this approach? I'm
wondering whether the weakref stuff isn't duplicating some of the
caching that bsddb might be doing.

Thanks,
-Tom
 
P

Peter Otten

TomF said:
I have a program that manipulates lots of very large indices, which I
implement as bit vectors (via the bitarray module). These are too
large to keep all of them in memory so I have to come up with a way to
cache and load them from disk as necessary. I've been reading about
weak references and it looks like they may be what I want.

My idea is to use a WeakValueDictionary to hold references to these
bitarrays, so Python can decide when to garbage collect them. I then
keep a key-value database of them (via bsddb) on disk and load them
when necessary. The basic idea for accessing one of these indexes is:

_idx_to_bitvector_dict = weakref.WeakValueDictionary()

In a well written script that cache will be almost empty. You should compare
the weakref approach against a least-recently-used caching strategy. In
newer Pythons you can use collections.OrderedDict to implement an LRU cache
or use the functools.lru_cache decorator.
def retrieve_index(idx):
if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
not None:
return _idx_to_bitvector_dict[idx]

In a multi-threaded environment the above may still return None or raise a
KeyError.
else: # it's been gc'd
bv_str = bitvector_from_db[idx] # Load from bsddb
bv = cPickle.loads(bv_str) # Deserialize the string
_idx_to_bitvector_dict[idx] = bv # Re-initialize the weak
dict element
return bv

Hopefully that's not too confusing. Comments on this approach? I'm
wondering whether the weakref stuff isn't duplicating some of the
caching that bsddb might be doing.

Even if the raw value is cached somewhere you save the overhead of
deserialisation.

Peter
 
T

TomF

In a well written script that cache will be almost empty. You should compare
the weakref approach against a least-recently-used caching strategy. In
newer Pythons you can use collections.OrderedDict to implement an LRU cache
or use the functools.lru_cache decorator.

I don't know what your first sentence means, but thanks for pointers to
the LRU stuff. Maintaining my own LRU cache might be a better way to
go. At least I'll have more control.

Thanks,
-Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top