Persistent objects

P

Paul Rubin

I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Please don't suggest using a pickle or shelve; I know about those
already. I'm after something higher-performance. Basically d would
live in a region of memory that could be mmap'd to a disk file as well
as shared with other processes. One d was rooted into that region,
any entries created in it would also be in that region, and any
objects assigned to the entries would also get moved to that region.

There'd probably have to be a way to lock the region for update, using
semaphores. Ordinary subscript assignments would lock automatically,
but there might be times when you want to update several structures in
a single transaction.

A thing like this could save a heck of a lot of SQL traffic in a busy
server app. There are all kinds of bogus limitations you see on web
sites, where you can't see more than 10 items per html page or
whatever, because they didn't want loading a page to cause too many
database hits. With the in-memory approach, all that data could be
right there in the process, no TCP messages needed and no context
switches needed, just ordinary in-memory dictionary references. Lots
of machines now have multi-GB of physical memory which is enough to
hold all the stuff from all but the largest sites. A site like
Slashdot, for example, might get 100,000 logins and 10,000 message
posts per day. At a 1k bytes per login (way too much) and 10k bytes
per message post (also way too much), that's still just 200 megabytes
for a full day of activity. Even a low-end laptop these days comes
with more ram than that, and multi-GB workstations are no big deal any
more. Occasionally someone might look at a several-day-old thread and
that might cause some disk traffic, but even that can be left in
memory (the paging system can handle it).

On the other hand, there'd either have to be interpreter hair to
separate the persistent objects from the non-persistent ones, or else
make everything persistent and then have some way to keep processes
sharing memory from stepping on each other. Maybe the abstraction
machinery in PyPy can make this easy.

Well, as you can see, this idea leaves a lot of details not yet
thought out. But it's alluring enough that I thought I'd ask if
anyone else sees something to pursue here.
 
M

Max M

Paul said:
Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Have you considdered using the standalone ZODB from Zope?


--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
 
P

Paul Rubin

Max M said:
Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Have you considdered using the standalone ZODB from Zope?

No. I've heard that it's quite slow, and works sort of the way shelve
does. Am I mistaken? I want the objects to never leave memory except
through mmap.
 
D

Duncan Booth

Paul said:
Well, as you can see, this idea leaves a lot of details not yet
thought out. But it's alluring enough that I thought I'd ask if
anyone else sees something to pursue here.

Have you looked at ZODB and ZEO? It does most of what you ask for, although
not necessarily in the way you suggest.

It doesn't attempt to hold everything in memory, but so long as most of
your objects are cache hits this shouldn't matter. Nor does it use shared
memory: using ZEO you can have a client server approach so you aren't
restricted to a single machine.

Instead of a locking scheme each thread works within a transaction, and
only when the transaction is committed do you find out whether your changes
are accepted or rejected. If they are rejected then you simply try again.
So long as most of your accesses are read-only, and transactions are
committed quickly this scheme can work better than locking.
 
P

Paul Rubin

Duncan Booth said:
Have you looked at ZODB and ZEO? It does most of what you ask for,
although not necessarily in the way you suggest.

You're the second person to mention these, so maybe I should check into
them more. But I thought they were garden-variety persistent object
schemes that wrote pickles into disk files. That's orders of magnitude
slower than what I had in mind.
It doesn't attempt to hold everything in memory, but so long as most of
your objects are cache hits this shouldn't matter. Nor does it use shared
memory: using ZEO you can have a client server approach so you aren't
restricted to a single machine.

Well, if it doesn't use shared memory, what does it do instead? If
every access has to go through the TCP stack, you're going to get
creamed speed-wise. The mmap scheme should be able to do millions of
operations per second. Are there any measurements of how many
ops/second you can get through ZODB?
 
N

Nigel Rowe

Paul said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one.
<<snip>>

Maybe POSH (http://poshmodule.sourceforge.net/) is what you want.

From the "About POSH"

Python Object Sharing, or POSH for short, is an extension module to Python
that allows objects to be placed in shared memory. Objects in shared memory
can be accessed transparently, and most types of objects, including
instances of user-defined classes, can be shared. POSH allows concurrent
processes to communicate simply by assigning objects to shared container
objects.
 
P

Paul Rubin

Nigel Rowe said:

Thanks, that is great. The motivation was somewhat different but it's
clear that the authors faced and dealt with most of the same issues
that were bugging me. I had hoped to avoid the use of those proxy
objects but I guess there's no decent way around them in a
multi-process setting. The authors similarly had to reimplement the
basic Python container types, which I'd also hoped could be avoided,
but I guess what they did was straightforward if messier than I'd like.

POSH also makes no attempt to implement persistence, but maybe that's
a fairly simple matter of mmap'ing the shared memory region and
storing some serialized representation of the proxy objects. If I
correctly understand how POSH works, the number of proxies active at
any moment should be fairly low.
 
A

Alan Kennedy

Hi Paul,

[Paul Rubin]
> Basically I wish there was a way to have persistent in-memory objects
> in a Python app, maybe a multi-process one. So you could have a
> persistent dictionary d, and if you say
> d[x] = Frob(foo=9, bar=23)
> that creates a Frob instance and stores it in d[x]. Then if you
> exit the app and restart it later, there'd be a way to bring d back
> into the process and have that Frob instance be there.

Have you looked at Ian Bicking's SQLObject?

http://sqlobject.org/

To define a class

class MyPersistentObj(SQLObject):

foo = IntCol()
bar = IntCol()

To instantiate a new object

my_new_object = MyPersistentObj(foo=9, bar=23)

Once the new object has been created, it has already been persisted into
a RDBMS table automatically. To reload it from the table/database, e.g.
after a system restart, simply supply its id.

my_existing_object = MyPersistentObj.get(id=42)

Select a subset of your persistent objects using SQL-style queries

my_foo_9_objects = MyPersistentObj.select(MyPersistentObj.q.foo == 9)
for o in my_foo_nine_objects:
process(o)

SQLObject also takes care of caching, in that objects are optionally
cached, associated with a specific connection to the database. (this
means that it is possible to have different versions of the same object
cached with different connections, but that's easy to solve with good
application architecture). So in your case, if your (web?) app is
persistent/long-running, then you can simply have SQLObject cache all
your objects, assuming you've got enough memory. (Hmm, I wonder if
SQLObject could be made to work with weak-references?). Lastly, caching
can be disabled.

I've found performance of SQLObject to be pretty good, but since you
haven't specified particular requirements for performance, it's not
possible to say if it meets your criteria. Although I feel comfortable
in saying that SQLObject combined with an SQLite in-memory database
should give pretty good performance, if you've got the memory to spare
for the large databases you describe.

Other nice features include

1. RDBMS independent: currently supported are PostGres, FireBird, MySQL,
SQLite, Oracle, Sybase, DBM. SQLServer support is in the pipepline.
SQLObject code should be completely portable between such backend stores.

2. Full support for ACID transactional updates to data.

3. A nice facility for building SQL queries using python syntax.

4. Automated creation of tables and databases. Table structure
modification supported on most databases.

5. Full support for one-to-one, one-to-many and many-to-many
relationships between objects.

All in all, a great little package. I recommend that you take a close look.

Regards,
 
P

Paul Rubin

Alan Kennedy said:
Have you looked at Ian Bicking's SQLObject?

http://sqlobject.org/

That sounds like Python object wrappers around SQL transactions.
That's the opposite of what I want. I'm imagining a future version of
Python with native compilation. A snippet like

user_history[username].append(time())

where user_history is an ordinary Python dict, would take a few dozen
machine instructions. If user_history is a shared memory object of
the type I'm imagining, there might be a few dozen additional
instructions of overhead dealing with the proxy objects. But if SQL
databases are involved, that's thousands of instructions, context
switches, TCP messages, and whatever. That's orders of magnitude
difference.
 
I

Irmen de Jong

Paul said:
Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

If I'm not mistaken, PyPersist (http://www.pypersyst.org/) is
something like this. Haven't used it though...

--Irmen
 
B

Bengt Richter

I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.
I've had similar thoughts. Various related topics have come up on clp.
I speculated on getting a fast start capability via putting the entire
heap etc state of python in an mmap-ed region and having a checkpoint
function you could call sort of like a yield to goes under the hood
and writes the state to a file. Then you would restart it via
python -resume savedstate and instead of loading normally, python would
load its state from savedstate and appear to continue from the "yeild"
that caused the checkpointing.

Of course that has a lot of handwaving content, but the idea of image-wise
saving state is similar to what you want to do I think.

But I think you have to think about what id(Frob(foo=9, bar=23)) means,
because that is basically what you are passing to d (along with id(x) above).

For speed you really don't want d to be copying Frob immutable representations
from heap memory to mmap memory, you want Frob to be created in mmap memory
to start with (as I think you were saying). But this requires specifying that
Frob should behave that way, one way or another. If we get class decorators,
maybe we could write

@persistent(mmapname)
class Frob(object):
...

but in the mean while we could write
class Frob(object):
...
Frob = persistent(mmapname)(Frob)

The other way is to modify the Frob code to inherit from Persistent(mmapname)
or such. BTW, mmapname IMO should not name a file directly, or you get horrible
coupling of the code to a particular site. I think there should be a standard
place to store the mapping from names to files, similar to sys.modules, so that
mmapnames can be used as abstract mmap space specifiers. In fact, the mapping
could use sys.modules by being in a standard module for registering mmapnames,
backed by a persistent config file (or we get recursion ?;-)

Thinking out loud here ...

So what could Frob = persistent(mmapnmame)(Frob) do with Frob to make Frob
objects persist and what will foo = Frob(...) mean?
foo is an ordinary name bound to the Frob instance.
But the binding here is between a transient name in the local name space
and a persistent Frob instance. So I think we need a Frob instance proxy
that implements an indirect reference to the persistent data using a persistent
data id, which could be an offset into the mmap file where the representation
is stored, analogous to id's being memory addresses. But the persistent representation
has to include type info also, which can't have RAM memory references in it, if it's
to be shared -- unless maybe if you do extreme magic for sharing, like debugger code
and relocating loaders etc.

Now if you wrote

@persistent(mmapname)
class PD(dict): pass
...
d = PD()
d[x] = frobinst

then if persistent was smart enough to recognize some useful basic types like dict,
then d would be a transient binding to a persistent dict proxy which could recognize
persistent object proxies as values and just get the persistent id and use that instead
of creating a new persistent representation, or if the value was a reference to an ordinary
immutable, a persistent copy could be made in mmap space, and the offset/id or _that_ would
be used as the value ref in the persistent representation of d. Similarly with the key.

d.__setitem__(key, value) would not accept a reference to an ordinary mutable value object
unless maybe it had a single reference count, indicating that it was only constructed to
pass as an argument. In that case, if persistent(mmapname)type(themutable)) succeeded, then
a representation could be created in the mmapname space and the mmap offset/id could be
used as the value ref in the d hash association with the key done again similarly.

I feel like this is doable, if you don't get too ambitious to start ;-)
The tricky parts will be getting performace with proxies checking in-RAM cached
representations vs in-mmap-RAM representations, and designing representations to make
that happen.

If it's worth it ;-) Don't good os file systems already have lru caching of hot info,
so how much is there to gain over a light weight data base's performance?

Please don't suggest using a pickle or shelve; I know about those
already. I'm after something higher-performance. Basically d would
live in a region of memory that could be mmap'd to a disk file as well
as shared with other processes. One d was rooted into that region,
any entries created in it would also be in that region, and any
objects assigned to the entries would also get moved to that region.
UIAM heap objects would be hard to move unless they had ref counts of 1
-- and that only if ref count of of 1 was implemented to indentify the
referrer. Or totally rework garbage collection etc. And as mentioned,
direct references from the mmap region to ordinary RAM locations wouldn't
fly, since the latter are not persistent, but can't be moved unless other
references are updated. For checkpointing it would be different, becuase
it's not sharing.
There'd probably have to be a way to lock the region for update, using
semaphores. Ordinary subscript assignments would lock automatically,
but there might be times when you want to update several structures in
a single transaction.
Definitely there would have to be a mutex, and one that could be accessed
by name between programs.
A thing like this could save a heck of a lot of SQL traffic in a busy
server app. There are all kinds of bogus limitations you see on web
sites, where you can't see more than 10 items per html page or
whatever, because they didn't want loading a page to cause too many
database hits. With the in-memory approach, all that data could be
right there in the process, no TCP messages needed and no context
switches needed, just ordinary in-memory dictionary references. Lots
of machines now have multi-GB of physical memory which is enough to
hold all the stuff from all but the largest sites. A site like
Slashdot, for example, might get 100,000 logins and 10,000 message
posts per day. At a 1k bytes per login (way too much) and 10k bytes
per message post (also way too much), that's still just 200 megabytes
for a full day of activity. Even a low-end laptop these days comes
with more ram than that, and multi-GB workstations are no big deal any
more. Occasionally someone might look at a several-day-old thread and
that might cause some disk traffic, but even that can be left in
memory (the paging system can handle it).
OTOH I think the danger of premature optimization is ever present. What info
do you have re actual causes of overhead? And are you looking at mostly
read-only or a lot of r/w activity?

If there is a use for this, do you really need the generality of Frob,
or would a d[x]=y that only allowed x and y as strings, but was fast,
be useful? I think the latter would not be that hard to implement. Basically
a string repository plus some presistent representation of a hash table
associating key strings with value strings, and locking provisions.
On the other hand, there'd either have to be interpreter hair to
separate the persistent objects from the non-persistent ones, or else
make everything persistent and then have some way to keep processes
sharing memory from stepping on each other. Maybe the abstraction
machinery in PyPy can make this easy.

Well, as you can see, this idea leaves a lot of details not yet
thought out. But it's alluring enough that I thought I'd ask if
anyone else sees something to pursue here.

The strings-only version would let you build various pickling on top of
that for other objects, and there wouldn't be so much re-inventing to do ;-)
That seems like an evening's project, to get a prototype. But no time now...

Regards,
Bengt Richter
 
D

Dan Perl

Paul Rubin said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

I haven't used it myself, but this sounds to me a lot like metakit
(http://www.equi4.com/metakit.html). Have you considered that and does it
fit your needs?

Dan
 
K

Keith Dart

Paul said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Check out the Durus project.

http://www.mems-exchange.org/software/durus/



--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
============================================================================
 
K

Keith Dart

Paul said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Check out the Durus project.

http://www.mems-exchange.org/software/durus/



--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
============================================================================
 
K

Keith Dart

Paul said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

Check out the Durus project.

http://www.mems-exchange.org/software/durus/



--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
============================================================================
 
D

deelan

Paul said:
I've had this recurring half-baked desire for long enough that I
thought I'd post about it, even though I don't have any concrete
proposals and the whole idea is fraught with hazards.

Basically I wish there was a way to have persistent in-memory objects
in a Python app, maybe a multi-process one. So you could have a
persistent dictionary d, and if you say
d[x] = Frob(foo=9, bar=23)
that creates a Frob instance and stores it in d[x]. Then if you
exit the app and restart it later, there'd be a way to bring d back
into the process and have that Frob instance be there.

this sounds like the "orthogonal persistence" of
unununium python project:

Orthogonal Persistence

''A system in which things persist only until they are no longer needed
is said to be orthogonally persistant. In the context of an OS, this
means that should the computer be turned off, its state will persist
until it is on again. Some popular operating systems implement non-fault
tolerant persistence by allowing the user to explicitly save the state
of the machine to disk. However, Unununium will implement fault tolerant
persistence in which state will be saved even in the case of abnormal
shutdown such as by power loss.''

from:
<http://unununium.org/introduction>

check also this thread:
<http://unununium.org/pipermail/uuu-devel/2004-September/000218.html>

''Such a solution isn't orthogonal persistence. The "orthogonal" means
how the data is manipulated is independent of how it is stored. Contrast
this with the usual way of doing things that requires huge amounts of
code, time, and developer sanity to shuffle data in and out of
databases.''

''In fact, we do plan to write to RAM every 5 minutes or so. But, it's
not slow. In fact, it's faster in many cases. Only dirty pages need to
be written to the drive, which is typically a very small fraction of all
RAM. Because there is no filesystem thus the drive spends little time
seeking. Furthermore, serializing all data to some database is not
required, which saves CPU cycles and code, resulting in reduced memory
usage and higher quality in software design.''

''The resulting system is fully fault tolerant (a power outage will
never hose your work), requires no network, and doesn't introduce any
new points of failure of complexity.''


you may want to investigate how the UUU guys think to
implememt this in their project.


bye.
 
P

Paul Rubin

deelan said:
this sounds like the "orthogonal persistence" of
unununium python project:

Thanks, I had something a little more conventional in mind, but I
really enjoyed reading the unununium page. It's not often that I see
someone doing something that interesting and far-out, that's simple
and low level at the same time.

I think POSH is the closest thing mentioned in this thread to what I
had in mind. I could imagine adding some simple kernel hacks to
support it on the x86 architecture, specifically using the x86
segmentation registers to allow using raw pointers (instead of
handles) in the shared memory objects. You'd just set the ES register
appropriately for each process depending on where the shared memory
region was mapped for that process. Alternatively, simply have some
way to give the shared region the same address in every process.
That's most attractive on a 64-bit machine. You'd just reserve some
32-bit (or larger) block of address space for the shared region, that
starts at some sufficiently high address in each process.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top