Looking for a Fast Persistent Store

Bob Hutchison · Aug 9, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem

These kinds of stores are normally implemented as persistent hashes
or BTrees (or both). I know that Sleepycat's Berkeley DB <http://
www.sleepycat.com/> does this, and I've used Java based systems that
do this. I also know of some C based things but they don't have Ruby
wrappers. I can't find anything in the Ruby world, and I don't care
too much if it isn't pure Ruby.

I know about Purple <http://purple.rubyforge.org/> and the QDMB Ruby
wrapper <http://qdbm.sourceforge.net/>. Neither do the multiple hash/
BTree in a transaction thing.

The trick appears to be with transactions.

I know this can be done easily enough in a Relational DB, but I know
that, for example, JDBC/MySQL combination is significantly slower at
what I need to do than Perst (a pure Java thing that's startlingly
fast). I've taken a look at sqlite, which satisfies all requirements
but I can't shake this feeling that things can be better.

Does anyone have any ideas?

Thanks,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Harold Hausman · Aug 9, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem
...

Does anyone have any ideas?

Thanks,
Bob

Hi Bob,

If you're the fun and adventurous type you might want to look into
Mongoose. ( http://rubyforge.org/projects/mongoose/ ) It's new, and
claims to be fast. In the past I wrote some small apps that used the
same authors other system called KirbyBase which was very nice, I
haven't tried Mongoose yet, but I'm happily awaiting a problem with
which to use it on. Though now that I'm thinking about it I don't
know if it has built-in "transaction" support. I suppose it depends on
what you want it to do.

Have you tried SQLite and found it to be too slow? It sounds to me
like you might be prematurely optimizing. We have several medium to
medium-large sized apps that make heavy use of SQLite and speed has
been not a problem at all.

What domain are you working in that requires this kind of speed
anyway? Just curious.

Hope that helps,
-Harold

Robert Klemme · Aug 9, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem

What about PStore?

robert

ara.t.howard · Aug 9, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem

These kinds of stores are normally implemented as persistent hashes or
BTrees (or both). I know that Sleepycat's Berkeley DB
<http://www.sleepycat.com/> does this, and I've used Java based systems that
do this. I also know of some C based things but they don't have Ruby
wrappers. I can't find anything in the Ruby world, and I don't care too much
if it isn't pure Ruby.

I know about Purple <http://purple.rubyforge.org/> and the QDMB Ruby wrapper
<http://qdbm.sourceforge.net/>. Neither do the multiple hash/BTree in a
transaction thing.

The trick appears to be with transactions.

check out joel's fsdb - it's very nice if you want to go pure ruby. i've used
it on several projects.

I know this can be done easily enough in a Relational DB, but I know that,
for example, JDBC/MySQL combination is significantly slower at what I need
to do than Perst (a pure Java thing that's startlingly fast). I've taken a
look at sqlite, which satisfies all requirements but I can't shake this
feeling that things can be better.

Does anyone have any ideas?

sqlite is hard to beat - i use it in at least 20 production systems and have
never had a single error. ruby queue uses it under the hood for the cluster
job store and this is used over nfs - we've been through disk failures and
power failures and come through unscathed.

it's also very fast, esp if you use in memory tables, but on linux it's just
as easy to copy the db to /dev/shm and then go...

if you end up using it try with my arrayfields package - james' sqlite binding
detects it automatically and the tuples with come back as arrays with named
field access - it's faster and requires less memory than a hash, and is also
really convenient to code with.

cheers.

-a

M. Edward (Ed) Borasky · Aug 9, 2006

Harold said:
Have you tried SQLite and found it to be too slow? It sounds to me
like you might be prematurely optimizing. We have several medium to
medium-large sized apps that make heavy use of SQLite and speed has
been not a problem at all.

On Linux, at any rate, one ought to be able to make SQLite extremely
fast by adding RAM and tuning the I/O subsystem and the kernel, and
using tricks like memory mapped files, assuming the database in question
isn't too large. I'm assuming it's small, otherwise he'd pretty much
need a humongous database server and an industrial strength database
like Oracle or PostgreSQL.

Bob Hutchison · Aug 9, 2006

Hi Bob,

If you're the fun and adventurous type you might want to look into
Mongoose. ( http://rubyforge.org/projects/mongoose/ ) It's new, and
claims to be fast. In the past I wrote some small apps that used the
same authors other system called KirbyBase which was very nice, I
haven't tried Mongoose yet, but I'm happily awaiting a problem with
which to use it on. Though now that I'm thinking about it I don't
know if it has built-in "transaction" support. I suppose it depends on
what you want it to do.

I have had a look at Mongoose, and quite like it. I'm sure to find a
use for it. However, for this particular situation Mongoose's
transactions are too week.

Have you tried SQLite and found it to be too slow? It sounds to me
like you might be prematurely optimizing. We have several medium to
medium-large sized apps that make heavy use of SQLite and speed has
been not a problem at all.

In the Java world things like Perst and JDBM were very fast. SQLite
was problematic for some reason that is a bit foggy just now (I'll
think of it).

What domain are you working in that requires this kind of speed
anyway? Just curious.

I write web based applications. The situations I'm thinking of are
either financial or CMS-like applications. In the financial cases
I've got situations where there might be a few million (probably not
more than ten million, well maybe 20 million) small things tucked
away. In the case of CMS-like stuff there are far fewer things, lets
say 25k at most and usually more like a thousand or so, but they are
bigger things.

The key factor (pardon the pun) is that the *vast* majority of the
time (and I know how to make it 100% of the time) queries are as
you'd have in a hash table (i.e. a single key where the key is
usually a String).

This is for the persistent store associated with xampl (see URL in my
signature). Xampl already has a few persistent stores but I now need
one a little fancier.

Hope that helps,
-Harold

Thanks Harold.

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 9, 2006

What about PStore?

Oh. Right. PStore never registered with me before.

I just had a look at it and it seems to be along the lines of QDBM
and Purple, transactional on a single store and each store being one
hash/tree.

Thanks for pointing that out.

Cheers,
Bob

robert

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 9, 2006

Hi Ara,

check out joel's fsdb - it's very nice if you want to go pure
ruby. i've used
it on several projects.

I'm already using that (version 0.5 -- I can't get to RAA right now
for some reason and there are no files on Rubyforge for fsdb so I
don't know if there is a more recent version). In version 0.5 the
transactions were not sufficient I think (it would be nice if I was
wrong).

sqlite is hard to beat - i use it in at least 20 production systems
and have
never had a single error. ruby queue uses it under the hood for
the cluster
job store and this is used over nfs - we've been through disk
failures and
power failures and come through unscathed.

That's good to hear. Don't misunderstand, I *like* SQLite but I don't
know that it is suitable. If I can't find anything else I'll use it.

The thing that makes me nervous is that I can do what I need with
just two tables. One that has (key, value-kind, value), and another
that has (index-name, index-value, key, value-kind). Each column is a
String. The vast majority of queries would be based on key and value-
kind on the first table. The remaining queries would be a select/join
kind of thing. And I can very easily jam the key and value-kind into
the same field (but that would be an optimisation that may not be
necessary). The trouble is that the first table might get to be very
very large. I don't know how SQLite behaves with a single huge table.
I suppose I'm going to have to find out.

it's also very fast, esp if you use in memory tables, but on linux
it's just
as easy to copy the db to /dev/shm and then go...

That's interesting.

if you end up using it try with my arrayfields package - james'
sqlite binding
detects it automatically and the tuples with come back as arrays
with named
field access - it's faster and requires less memory than a hash,
and is also
really convenient to code with.

I noticed that the other day. Nice.

Cheers,
Bob

cheers.

-a
--
to foster inner awareness, introspection, and reasoning is more
efficient than
meditation and prayer.
- h.h. the 14th dali lama

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 9, 2006

On Linux, at any rate, one ought to be able to make SQLite
extremely fast by adding RAM and tuning the I/O subsystem and the
kernel, and using tricks like memory mapped files, assuming the
database in question isn't too large. I'm assuming it's small,
otherwise he'd pretty much need a humongous database server and an
industrial strength database like Oracle or PostgreSQL.

A few GB max for one set of applications, far far less for others.

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

ara.t.howard · Aug 9, 2006

The thing that makes me nervous is that I can do what I need with just two
tables. One that has (key, value-kind, value), and another that has
(index-name, index-value, key, value-kind). Each column is a String. The
vast majority of queries would be based on key and value-kind on the first
table. The remaining queries would be a select/join kind of thing. And I
can very easily jam the key and value-kind into the same field (but that
would be an optimisation that may not be necessary). The trouble is that the
first table might get to be very very large. I don't know how SQLite behaves
with a single huge table. I suppose I'm going to have to find out.

please post a summary of your findings when you run your tests - i'd be
interested to read them.

cheers.

-a

Kroeger, Simon (ext) · Aug 9, 2006

=20

From: Bob Hutchison [mailto:[email protected]]=20
Sent: Wednesday, August 09, 2006 4:39 PM
=20

Have you tried SQLite and found it to be too slow? It sounds to me
like you might be prematurely optimizing. We have several medium to
medium-large sized apps that make heavy use of SQLite and speed has
been not a problem at all.

Click to expand...

=20
In the Java world things like Perst and JDBM were very fast. SQLite =20
was problematic for some reason that is a bit foggy just now (I'll =20
think of it).

All I can say is SQLite3 was amazingly fast at solving every problem
I threw at it (I don't know for the ruby bindings, but they appear
to be relativ thin).

I have to admit though that I never had more than a few 100 thousand
items in a single table.

Being blessed with the power and speed of complex SQL statements is=20
invaluable especialy in a scripting language.

cheers

Simon

M. Edward (Ed) Borasky · Aug 10, 2006

Bob said:
A few GB max for one set of applications, far far less for others.

A few GB ... I'd be looking at a "real database" for that. "Pay me now
or pay me later", as the saying goes.

snacktime · Aug 10, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem

These kinds of stores are normally implemented as persistent hashes
or BTrees (or both). I know that Sleepycat's Berkeley DB <http://
www.sleepycat.com/> does this, and I've used Java based systems that
do this. I also know of some C based things but they don't have Ruby
wrappers. I can't find anything in the Ruby world, and I don't care
too much if it isn't pure Ruby.

Why not use berkeleydb? There are ruby bindings for it.

Bob Hutchison · Aug 11, 2006

Why not use berkeleydb? There are ruby bindings for it.

The licensing cost of bdb for commercial re-distribution is
prohibitive. Other than that, no reason.

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 11, 2006

A few GB ... I'd be looking at a "real database" for that. "Pay me
now or pay me later", as the saying goes.

Well, my concern comes probably comes from the same place that your
thinking of a 'real database' comes from

To make it worse, in a
relational DB I've got this urge to stick all that into only two
tables... makes me nervous. Anyway, in the Java world, Perst has
proven itself very capable, jaw-dropping fast, and reliable.

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

khaines · Aug 11, 2006

On Fri, 11 Aug 2006, Bob Hutchison wrote:

How fast is fast enough, and what kind of transaction support do you need?

I have a class that is part of the suite of caches I provide in IOWA that
I have recently been working on.

It's called a DiskCache because it implements an LRU cache that operates
exclusively through persistent storage, but the expiration of elements can
be turned off, turning it into a simple persistent storage mechanism.
Because it operates exclusively through persistent storage, it requires no
in-memory indexes or other data structures. This was important to me
because if one were using PStore with many small transactions on any
data set that was not tiny (less than a thousandish elements), PStore's
index reading/writing overhead became troublesome, and I didn't want to
incur the RAM hit of an in memory index for a very large data set,
either.

It's relatively fast (it is much faster than PStore for non-trivial data
sets since PStore requires reading an index into memory for each
transaction), platform independent (well, I have tested it on Linux and
Win XP) and it is multiprocess/threadsafe.

It supports transactions as well as nested transactions.

i.e.

@cache.transaction do
@cache[:a] = 1
@cache.transaction do
@cache[:b] = 2
@cache.transaction do
@cache[:c] = 3
rollback
end
commit
end
rollback
end

At the end of those transactions, @cache[:b] == 2, and the two other items
have been rolled back.

I'm currently expanding the test cases for it and cleaning up some bugs,
and will go ahead and release it as a standalone package this weekend.
Maybe it will help you.

Kirk Haines

Francis Cianfrocca · Aug 11, 2006

Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

Justin Collins · Aug 11, 2006

Kirk said:
How fast is fast enough, and what kind of transaction support do you
need?

I have a class that is part of the suite of caches I provide in IOWA
that I have recently been working on.

I'm currently expanding the test cases for it and cleaning up some
bugs, and will go ahead and release it as a standalone package this
weekend. Maybe it will help you.

Francis said:
Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

I'm interested in this, too...please make an announcement on the list
when you release it?

-Justin

khaines · Aug 11, 2006

Kirk, how did you do this? Are you storing immutable objects and
turning older versions into garbage somehow? If you have no indexing,
then how do you enforce a consistent view across processes without
incurring disk i/o (at least having to read the inode)? Or are you
managing an arena that is backed by shared memory and you convert the
key values deterministically into offsets?

Sorry for all the questions, but I've just had to fight with this
problem too many times and it would be great if you've come up with a
better mousetrap.

I don't think it is necessarily a better mousetrap. In this case, I'm
just willing to make the filesystem do some of the work for me instead of
keeping a separate index in memory.

I'm using a modified version of Ara Howard's lockfile.rb to handle locking
between processes. It works over NFS, and I have modified it so that it
also works on Windows by automatically falling back to a Windows safe
locking method.

I create a hash (Digest::SHA512 based) out of the key, and use that key as
the filename for the data. There is also a second file written that
contains some metadata (the structure on disk is actually a linked list to
more easily support some of the LRU aspects and element expiration).
There are a couple of metadata files that also maintain some information
on the data store as a whole.

The code automatically breaks the file storage into a tree of directories
to help avoid any filesystem problems with having too many files in a
single directory. The size of this tree is tuneable so that a data store
that may have a million entries may have more nodes (directories) than one
that is only expected to have 10000.

So, retrieving an element from the data store is reduced to hashing the
key, finding the file, and reading it. It's disk i/o, but less than
PStore generates most of the time.

Since my need was for something that could do this while also being
treated as an LRU cache, there is some extra work mixed into there to
maintain the meta data and expire elements from the cache (it can also
expire them based on age), and that's really where most of the performance
hit comes from.

I'm working on creating a second version that strips out all of the LRU
cache code to provide a simple, fast data persistence implementation
without any of that extra overhead, and will be better able to report on
just how much of a performance hit that overhead brings with it soon.

Thanks,

Kirk Haines

Francis Cianfrocca · Aug 11, 2006

I don't think it is necessarily a better mousetrap. In this case, I'm
just willing to make the filesystem do some of the work for me instead of
keeping a separate index in memory.

I'm using a modified version of Ara Howard's lockfile.rb to handle locking
between processes. It works over NFS, and I have modified it so that it
also works on Windows by automatically falling back to a Windows safe
locking method.

I create a hash (Digest::SHA512 based) out of the key, and use that key as
the filename for the data. There is also a second file written that
contains some metadata (the structure on disk is actually a linked list to
more easily support some of the LRU aspects and element expiration).
There are a couple of metadata files that also maintain some information
on the data store as a whole.

The code automatically breaks the file storage into a tree of directories
to help avoid any filesystem problems with having too many files in a
single directory. The size of this tree is tuneable so that a data store
that may have a million entries may have more nodes (directories) than one
that is only expected to have 10000.

So, retrieving an element from the data store is reduced to hashing the
key, finding the file, and reading it. It's disk i/o, but less than
PStore generates most of the time.

Since my need was for something that could do this while also being
treated as an LRU cache, there is some extra work mixed into there to
maintain the meta data and expire elements from the cache (it can also
expire them based on age), and that's really where most of the performance
hit comes from.

I'm working on creating a second version that strips out all of the LRU
cache code to provide a simple, fast data persistence implementation
without any of that extra overhead, and will be better able to report on
just how much of a performance hit that overhead brings with it soon.

Thanks,

Kirk Haines

With all the times I've reinvented this wheel, I've never tried
storing millions of data elements each in its own file. Do you have
performance metrics you can share? Did you tune it for a particular
filesystem?

Fast implementation of XML escape	4	Aug 3, 2007
Array shift bug	9	Sep 24, 2006
[ANN] Localmemcache-0.3.0: A persistent key-value database based onmmap()'ed shared memory	8	Apr 17, 2009
Segmentation fault, proc, eval, long string	26	Nov 30, 2006
Advice on persistent storage in Java?	5	Dec 11, 2004
[ANN] xampl 0.1.0	8	Nov 21, 2005
Any progammers looking for a killer app to develop? How about a voice enabled forum?	16	Jan 29, 2004
advice for a class implementation? ( ugh, long... )	4	Nov 11, 2004

Looking for a Fast Persistent Store

Bob Hutchison

Harold Hausman

Robert Klemme

ara.t.howard

M. Edward (Ed) Borasky

Bob Hutchison

Bob Hutchison

Bob Hutchison

Bob Hutchison

ara.t.howard

Kroeger, Simon (ext)

M. Edward (Ed) Borasky

snacktime

Bob Hutchison

Bob Hutchison

khaines

Francis Cianfrocca

Justin Collins

khaines

Francis Cianfrocca

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads