low-end persistence strategies?

P

Paul Rubin

I've started a few threads before on object persistence in medium to
high end server apps. This one is about low end apps, for example, a
simple cgi on a personal web site that might get a dozen hits a day.
The idea is you just want to keep a few pieces of data around that the
cgi can update.

Immediately, typical strategies like using a MySQL database become too
big a pain. Any kind of compiled and installed 3rd party module (e.g.
Metakit) is also too big a pain. But there still has to be some kind
of concurrency strategy, even if it's something like crude file
locking, or else two people running the cgi simultaneously can wipe
out the data store. But you don't want crashing the app to leave a
lock around if you can help it.

Anyway, something like dbm or shelve coupled with flock-style file
locking and a version of dbmopen that automatically retries after 1
second if the file is locked would do the job nicely, plus there could
be a cleanup mechanism for detecting stale locks.

Is there a standard approach to something like that, or should I just
code it the obvious way?

Thanks.
 
T

Thomas Guettler

Am Tue, 15 Feb 2005 18:57:47 -0800 schrieb Paul Rubin:
I've started a few threads before on object persistence in medium to
high end server apps. This one is about low end apps, for example, a
simple cgi on a personal web site that might get a dozen hits a day.
The idea is you just want to keep a few pieces of data around that the
cgi can update. [cut]
Anyway, something like dbm or shelve coupled with flock-style file
locking and a version of dbmopen that automatically retries after 1
second if the file is locked would do the job nicely, plus there could
be a cleanup mechanism for detecting stale locks.

Is there a standard approach to something like that, or should I just
code it the obvious way?

Hi,

I would use the pickle module and access to the
pickle files could be serialized (one after the other
is allowed to read or write) with file locking.

This means your cgi application can only serve one request
after the other.

HTH,
Thomas
 
P

Paul Rubin

Diez B. Roggisch said:
Maybe ZODB helps.

I think it's way too heavyweight for what I'm envisioning, but I
haven't used it yet. I'm less concerned about object persistence
(just saving strings is good enough) than finding the simplest
possible approach to dealing with concurrent update attempts.
 
D

Diez B. Roggisch

Paul said:
I think it's way too heavyweight for what I'm envisioning, but I
haven't used it yet. I'm less concerned about object persistence
(just saving strings is good enough) than finding the simplest
possible approach to dealing with concurrent update attempts.

And that's exactly where zodb comes into play. It has full ACID support.
Opening a zodb is a matter of three lines of code - not to be compared to
rdbms'ses. And apart from some standard subclassing, you don't have to do
anything to make your objects persistable. Just check the tutorial.
 
P

Paul Rubin

Diez B. Roggisch said:
And that's exactly where zodb comes into play. It has full ACID support.
Opening a zodb is a matter of three lines of code - not to be compared to
rdbms'ses.

The issue with using an rdbms is not with the small amount of code
needed to connect to it and query it, but in the overhead of
installing the huge piece of software (the rdbms) itself, and keeping
the rdbms server running all the time so the infrequently used app can
connect to it. ZODB is also a big piece of software to install. Is
it at least 100% Python with no C modules required? Does it need a
separate server process? If it needs either C modules or a separate
server, it really can't be called a low-end strategy.
 
D

Diez B. Roggisch

The issue with using an rdbms is not with the small amount of code
needed to connect to it and query it, but in the overhead of

Its not only connecting - its creating (automaticall if necessary) and
"connecting" which is actually only opening.
installing the huge piece of software (the rdbms) itself, and keeping
the rdbms server running all the time so the infrequently used app can
connect to it. ZODB is also a big piece of software to install. Is
it at least 100% Python with no C modules required? Does it need a
separate server process? If it needs either C modules or a separate
server, it really can't be called a low-end strategy.

It has to be installed. And it has C-modules - but I don't see that as a
problem. Of course this is my personal opinion - but it's certainly easier
installed than to cough up your own transaction isolated persistence layer.
I started using it over pickle when my multi-threaded app caused pickle to
crash.

ZODB does not have a server-process, and no external setup beyond the
installation of the module itself.

Even if you consider installing it as too heavy for your current needs, you
should skim over the tutorial to get a grasp of how it works.
 
T

Tom Willis

Sounds like you want pickle or cpickle.


I've started a few threads before on object persistence in medium to
high end server apps. This one is about low end apps, for example, a
simple cgi on a personal web site that might get a dozen hits a day.
The idea is you just want to keep a few pieces of data around that the
cgi can update.

Immediately, typical strategies like using a MySQL database become too
big a pain. Any kind of compiled and installed 3rd party module (e.g.
Metakit) is also too big a pain. But there still has to be some kind
of concurrency strategy, even if it's something like crude file
locking, or else two people running the cgi simultaneously can wipe
out the data store. But you don't want crashing the app to leave a
lock around if you can help it.

Anyway, something like dbm or shelve coupled with flock-style file
locking and a version of dbmopen that automatically retries after 1
second if the file is locked would do the job nicely, plus there could
be a cleanup mechanism for detecting stale locks.

Is there a standard approach to something like that, or should I just
code it the obvious way?

Thanks.
 
P

Paul Rubin

Diez B. Roggisch said:
It has to be installed. And it has C-modules - but I don't see that
as a problem. Of course this is my personal opinion - but it's
certainly easier installed than to cough up your own transaction
isolated persistence layer. I started using it over pickle when my
multi-threaded app caused pickle to crash.

I don't feel that I need ACID since, as mentioned, I'm willing to lock
the entire database for the duration of each transaction. I just want
a simple way to handle locking, retries, and making sure the locks are
cleaned up.
ZODB does not have a server-process, and no external setup beyond the
installation of the module itself.

That helps, thanks.
Even if you consider installing it as too heavy for your current needs, you
should skim over the tutorial to get a grasp of how it works.

Yes, I've been wanting to look at it sometime.
 
C

Chris Cioffi

I'd like to second this one...ZODB is *extremely* easy to use. I use
it in projects with anything from a couple dozen simple objects all
the way up to a moderately complex system with several hundred
thousand stored custom objects. (I would use it for very complex
systems as well, but I'm not working on any right now...)

There are a few quirks to using ZODB, and the documentation sometimes
feel lite, but mostly that's b/c ZODB is so easy to use.

Chris
 
D

Dave Brueck

Chris said:
I'd like to second this one...ZODB is *extremely* easy to use. I use
it in projects with anything from a couple dozen simple objects all
the way up to a moderately complex system with several hundred
thousand stored custom objects. (I would use it for very complex
systems as well, but I'm not working on any right now...)

Chris (or anyone else), could you comment on ZODB's performance? I've Googled
around a bit and haven't been able to find anything concrete, so I'm really
curious to know how ZODB does with a few hundred thousand objects.

Specifically, what level of complexity do your ZODB queries/searches have? Any
idea on how purely ad hoc searches perform? Obviously it will be affected by the
nature of the objects, but any insight into ZODB's performance on large data
sets would be helpful. What's the general ratio of reads to writes in your
application?

I'm starting on a project in which we'll do completely dynamic (generated on the
fly) queries into the database (mostly of the form of "from the set of all
objects, give me all that have property A AND have property B AND property B's
value is between 10 and 100, ..."). The objects themselves are fairly dynamic as
well, so building it on top of an RDBMS will require many joins across property
and value tables, so in the end there might not be any performance advantage in
an RDBMS (and it would certainly be a lot work to use an object database - a
huge portion of the work is in the object-relational layer).

Anyway, thanks for any info you can give me,
-Dave
 
M

Michele Simionato

What about bsddb? On most Unix systems it should be
already installed and on Windows it comes with the
ActiveState distribution of Python, so it should fullfill
your requirements.
 
P

Paul Rubin

Michele Simionato said:
What about bsddb? On most Unix systems it should be already
installed and on Windows it comes with the ActiveState distribution
of Python, so it should fullfill your requirements.

As I understand it, bsddb doesn't expose the underlying Sleepycat API's
for concurrent db updates, nor does it appear to make any attempt at
locking, based on looking at the Python lib doc for it. There's an
external module called pybsddb that includes this stuff. Maybe the
stdlib maintainers ought to consider including it, if it's considered
stable enough.
 
D

Diez B. Roggisch

Chris (or anyone else), could you comment on ZODB's performance? I've
Googled around a bit and haven't been able to find anything concrete, so
I'm really curious to know how ZODB does with a few hundred thousand
objects.
Specifically, what level of complexity do your ZODB queries/searches have?
Any idea on how purely ad hoc searches perform? Obviously it will be
affected by the nature of the objects, but any insight into ZODB's
performance on large data sets would be helpful. What's the general ratio
of reads to writes in your application?

This is a somewhat weak point of zodb. Zodb simply lets you store arbitrary
object graphs. There is no indices created to access these, and no query
language either. You can of course create indices yourself - and store them
as simply as all other objects. But you've got to hand-tailor these to the
objects you use, and create your querying code yourself - no 4gl like sql
available.

Of course writing queries as simple predicates evaluated against your whole
object graph is straightforward - but unoptimized.

The retrieval of objects themselves is very fast - I didn't compare to a
rdbms, but as there is no networking involved it should be faster. And of
course no joins are needed.

So in the end, if you have always the same kind of queries that you only
parametrize and create appropriate indices and hand-written "execution
plans" things are nice.

But I want to stress another point that can cause trouble when using zodb
and that I didn't mention in replies to Paul so far, as he explicitly
didn't want to use an rdbms:


For rdbms'ses, a well-defined textual representation of the entities stored
in the db is available. So while you have to put some effort on creating on
OR-mapping (if you want to deal with objects) that will most likely evolve
over time, migrating the underlying data usually is pretty straightforward,
and even toolsupport is available. Basically, you're only dealing with
CSV-Data that can be easily manipulated and stored back.

ZODB on the other side is way easier to code for - but the hard times begin
if you have a rolled out application that has a bunch of objects inside
zodb that have to be migrated to newer versions and possibly changed object
graph layouts. This made me create elaborate yaml/xml serializations to
allow for im- and exports and use with xslt and currently I'm investigating
a switch to postgres.

This point is important, and future developments of mine will take that into
consideration more than they did so far.
 
P

pyguy2

People sometimes run to complicated systems, when right before you
there is a solution. In this case, it is with the filesystem itself.

It turns out mkdir is an atomic event (at least on filesystems I've
encountered). And, from that simple thing, you can build something
reasonable as long as you do not need high performance. and space isn't
an issue.

You need a 2 layer lock (make 2 directories) and you need to keep 2
data files around plus a 3rd temporary file.

The reader reads from the newest of the 2 data files.

The writer makes the locks, deletes the oldest data file and renames
it's temporary file to be the new data file. You could
have the locks expire after 10 minutes, to take care of failure to
clean up. Ultimately, the writer is responsible for keeping the locks
alive. The writer knows it is his lock because it has his timestamp.
If the writer dies, no big deal, since it only affected a temporary
file and the locks will expire.

Rename the temporary file takes advantage of the fact that a rename
is essentially immediate. Since, whatever does the reading, only reads
from the newest of the 2 files (if both are available). Once, the
rename of the temporary file done by the writer is complete, any future
reads will now hit the newest data. And, deleting the oldest file
doesn't matter since the reader never looks at it.

If you want more specifics let me know.

john
 
M

Michele Simionato

The documentation hides this fact (I missed that) but actually python
2.3+ ships
with the pybsddb module which has all the functionality you allude too.
Check at the test directory for bsddb.

Michele Simionato
 
C

Cameron Laird

I've started a few threads before on object persistence in medium to
high end server apps. This one is about low end apps, for example, a
simple cgi on a personal web site that might get a dozen hits a day.
The idea is you just want to keep a few pieces of data around that the
cgi can update.

Immediately, typical strategies like using a MySQL database become too
big a pain. Any kind of compiled and installed 3rd party module (e.g.
Metakit) is also too big a pain. But there still has to be some kind
of concurrency strategy, even if it's something like crude file
locking, or else two people running the cgi simultaneously can wipe
out the data store. But you don't want crashing the app to leave a
lock around if you can help it.

Anyway, something like dbm or shelve coupled with flock-style file
locking and a version of dbmopen that automatically retries after 1
second if the file is locked would do the job nicely, plus there could
be a cleanup mechanism for detecting stale locks.

Is there a standard approach to something like that, or should I just
code it the obvious way?

Thanks.

I have a couple of oblique, barely-helpful reactions; I
wish I knew better solutions.

First: I'm using Metakit and SQLite; they give me more
confidence and fewer surprises than dbm.

Second: Locking indeed is a problem, and I haven't
found a good global solution for it yet. I end up with
local fixes, that is, rather project-specific locking
schemes that exploit knowledge that, for example, there
are no symbolic links to worry about, or NFS mounts, or
....

Good luck.
 
P

Paul Rubin

Michele Simionato said:
The documentation hides this fact (I missed that) but actually
python 2.3+ ships with the pybsddb module which has all the
functionality you allude too. Check at the test directory for bsddb.

Thanks, this is very interesting. It's important functionality that
should be documented, if it works reliably. Have you had any probs
with it?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top