Looking for a Fast Persistent Store

khaines · Aug 11, 2006

With all the times I've reinvented this wheel, I've never tried
storing millions of data elements each in its own file. Do you have
performance metrics you can share? Did you tune it for a particular
filesystem?

Not really. I have not stored the results of any tests to date. The
largest test, however, created a million item cache, and I had no
problems.

The main consideration if one were planning on storing large numbers of
elements is making sure that the filesystem is built with large numbers of
small files taken into consideration.

However, a lot of files can fit into a surprisingly non-messy directory
structure.

Consider an SHA512 hash for a key that has the first few characters of:

dd232b224979c0e

If I am running a cache with a bucket depth of 2, and a bucket width of 2,
the file for that cache is going to go into the directory

dd/23/

Given an even distribution of hashes, with a million records, one should
have 256 directories at the top level, 256 at the second level, and 15 or
16 files under each of those.

That's pretty easy to handle efficiently.

Kirk Haines

Bob Hutchison · Aug 12, 2006

I'm currently expanding the test cases for it and cleaning up some
bugs, and will go ahead and release it as a standalone package this
weekend. Maybe it will help you.

Oh, this sounds very cool! Even if it doesn't help with this
particular situation, I have a pretty good idea where it will come in
handy.

Looking forward to it.

Cheers,
Bob

Kirk Haines

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 12, 2006

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K
files
brought the shared-memory filesystem to its knees almost
immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20
thousand files
a second pretty consistently. That's a decent but not huge fraction
of the
available channel bandwidth. Haven't tried a million files. I'm still
interested though.

In the Java thing we wrote we had 100 of thousands of files in a
directory, nothing shared though. No problems until you did a 'ls' by
mistake

Backups are also a problem. External fragmentation can be
a problem. I know we experimented with the kind of thing that I think
Kirk was suggesting. By building a hierarchy you can keep the numbers
of files/directories down, and this avoids the backup (and 'ls')
problems.

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 12, 2006

Clearly a directory hierarchy is the only way to make this work,
and Kirk's
idea of partitioning the space by segmenting the hash seems as
reasonable as
any. So far I'm very surprised to find that performance may be
reasonable.

So was I but it actually is quite good (I'll publish some
unscientific benchmarks in the next couple of days).

But what do you mean by "external fragmentation"? If you're
thinking of NFS,
I'd rather cut myself than attempt something like this on NFS, even
though
Kirk says it's supported.

Sorry, jargon from 3 decades ago. External fragmentation, in this
case is the disk space wasted due to small files being put into a
larger segment (e.g. 1k file in a 4k disk segment is a 3k loss of
space (external fragmentation)). Things like Perst, qdbm, Purple,
SQLite-as-persistent-store will avoid almost all external
fragmentation, at the cost of internal fragmentation. In the case of
the file system, there isn't any space wasted *within* the 1k file,
but the persistent stores all have this problem (and you'll see them
talking about free space management, best-fit, first-fit, garbage
collection, and so on).

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Jon Smirl · Aug 12, 2006

Hi,

I'm looking for a persistent store, where:
* I can use it from Ruby
* it is fast
* it is transactional
* it can update more than one store in a single transaction
* simple String -> String mappings in the store are sufficient
* ideally uses files in the filesystem

These kinds of stores are normally implemented as persistent hashes
or BTrees (or both). I know that Sleepycat's Berkeley DB <http://
www.sleepycat.com/> does this, and I've used Java based systems that
do this. I also know of some C based things but they don't have Ruby
wrappers. I can't find anything in the Ruby world, and I don't care
too much if it isn't pure Ruby.

SQLite has fully Ruby support and is free.

http://sqlite-ruby.rubyforge.org/

khaines · Aug 12, 2006

(How) do you handle object references?

It uses Marshal.

Kirk Haines

Bill Kelly · Aug 13, 2006

From: "Francis Cianfrocca said:
I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand files
a second pretty consistently. That's a decent but not huge fraction of the
available channel bandwidth. Haven't tried a million files. I'm still
interested though.

Hi,

What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?

At a prior gig, we used to split a couple hundred thousand
encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs--our batch-oriented processing was expected
to run overnight anyway--but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

(This was on sunOs/solaris systems circa 2000... i don't know if
the filesystem was journalled or not.)

Regards,

Bill

Matt Todd · Aug 13, 2006

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

Francis Cianfrocca · Aug 13, 2006

Hi,

What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?

At a prior gig, we used to split a couple hundred thousand
encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs--our batch-oriented processing was expected
to run overnight anyway--but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

I wanted to put a rough lower bound on the performance of the
approach, just to decide if it's worth pursuing at all. (Kirk
obviously thinks it is, but I don't know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn't test reads of the
files (which would involve touching an inode) because I didn't feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

I'd expect under normal conditions to get maybe 15 megabytes per
second of disk-write bandwidth from this system, although with tuning
I could probably get a lot more. But again, I was going for a smell
test here. This whole approach is attractive because it's easy to
code, so I'd want to use it for light-duty applications on non-tuned
hardware. For an application with more stringent requirements, I'd
make a different tradeoff in development time and probably use a
different approach.

Anyway, I first tried it on /dev/shm. That worked really nice for
10,000 files, took about 0.9 seconds consistently to create new files
and 0.6 seconds to re-write them. The same test with 100,000 files
totally de-stabilized the machine. I didn't want to reboot it so I
waited. Fifteen minutes later it was back. But what a strange journey
that must have been. Obviously this approach doesn't make a lot of
sense on a shm device anyway, but I had to know.

With a disk-based filesystem, things were a lot better. For 10,000
files, about 1.6 seconds to create them and 1.2 to rewrite them. Those
numbers were consistent across many trials. Similar results at 30,000
and 60,000 files, just scale upward. At 100,000 files things got
screwy. The create-time got variable, ranging from 5 seconds to almost
15 seconds from run to run. During all the runs, the machine didn't
appear to destabilize and remained responsive. Obviously processor
loads were very low. I didn't make a thorough study of page faults and
swapping activity though. But notice that the implied throughput is an
interestingly high fraction of my notional channel bandwidth of 15
megabytes/sec. And the journalling FS means that I don't even think
about the movement of the R/W head inside the disk drive anymore. (Of
course that may matter a great deal on Windows, but if I'm using
Windows then *everything* about the project costs vastly more anyway,
so who cares?)

So I'm claiming without further study (and without trying to explain
the results) that the lower bound on performance is in the region of
5000 writes a second. That's just at the fuzzy edge of being worth
doing. I usually think in terms of a "budget" for any operation that
must be repeated for a continuously-running server application. In
general, I want to be able to do a bare minimum of 1000 "useful
things" per second on a sustained basis, on untuned hardware. ("Useful
things" might be dynamic web-pages generated, or guaranteed-delivery
messages processed, etc.) So this approach uses nearly 20% of my
budget. It's a big number. (Just to show how I apply this kind of
analysis: I never worry about adding a SHA-1 hash calculation to any
critical code path, because I know I can do 250,000 of those per
second without breaking a sweat.)

Bob Hutchison · Aug 13, 2006

Hi,

That's a pretty constant tradeoff, speed for space, and it shows up
in many
different ways. I'm convinced that's why there's no single answer
to this
problem- every application will need different optimizations.
However, this
is somewhat less true than it once was, for two reasons: journaling
filesystems, and the fact that disk storage is still getting
cheaper every
year at a rapid rate- it's the only element in the computing chain
that
still is. So my inclination (subject to change depending on the
particular
case) is to waste space if it will save time (development time and/or
runtime).

I absolutely agree which is why I've implemented the filesystem route
first. There is *no* performance problem with it. The problem is that
I've got to write a bunch of files out and if for some reason even
one file failed to write I'd have a consistency problem. There are
easy ways to mitigate these risks but they can take an excruciatingly
long time to run. Moving to a transactional store (like Perst in
Java, or SQLite in ruby) gives me the guarantees of consistency that
I'm looking for, but SQLite in ruby is 3-5 times slower than direct
to the file system on OS/X with journaling on (unless there's
something wrong in my testing... silly benchmarks forthcoming). In
Java, Perst is *much* faster than the filesystem.

Now, I've thought of implementing the same kind of techniques that
some persistent stores use to assure consistency but directly in the
filesystem. The only quick-enough techniques that I can think of
would require a recovery after failure, but that shouldn't be too big
a problem (especially the frequency of failure will be low judging
from my current experience where I've not had any failures yet).

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Bob Hutchison · Aug 13, 2006

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though... But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

That's a good point. I don't think I want to involve Java in this if
I can help it. The fellow who wrote Perst, Konstantin Knizhnik
<http://www.garret.ru/~knizhnik/databases.html> has also written
GOODS, FastDB, and GigaBASE. If I was going to write a C extension
I'd go with one of those. Konstantin has also written something
called DyBASE which is specifically for Dynamic languages, like Ruby
and Python, and comes with bindings for Ruby 1.6.x. I've asked
Konstantin about the state of DyBASE and am trying to work out if
that is worth updating to Ruby 1.8.4

Cheers,
Bob

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Francis Cianfrocca · Aug 13, 2006

I absolutely agree which is why I've implemented the filesystem route
first. There is *no* performance problem with it.

The fastest thing I ever wrote was a transactional persistent store
for a guaranteed-devlivery messaging system. It used a decidedly
space-wasting design, and profiled approximately 100 times faster than
the filesystem on all supported platforms (Windows, Linux, Solaris) at
the time it was written (about eight years ago), assuming enough I/O
and memory bandwidth was available to make the comparison meaningful.
Most of the speed came by taking advantage of the tunings in the
kernel's VMM.

But I can't imagine writing such a thing in Java. What does Perst do
to be so fast?

Francis Cianfrocca · Aug 13, 2006

something wrong in my testing... silly benchmarks forthcoming). In
Java, Perst is *much* faster than the filesystem.

Ok, did some reading up on Perst- no real magic there. They simply put
together a lot of the good techniques you would expect to find in a
virtual arena manager, and they made some guesses about the handful of
tough questions you have to answer. (Allocation-grain size, page size,
etc.) I used same kind of bitmap-strategy for the persistent
GD-message system I mentioned upthread, but Perst makes a different
choice: they sequentially search the bitmap pages for a big-enough
free space when storing an object. Even though they use the obvious
optimization (storing a max-available-size on each page), I remember
being phobic about that approach and did something much more elaborate
(but reliably O(1)). But given your comments about Perst's speed, they
have obviously made the right choices overall!

I think something like Perst could be done in Ruby, possibly augmented
with a tiny C extension to access virtual-memory features.

Jon Smirl · Aug 13, 2006

I wanted to put a rough lower bound on the performance of the
approach, just to decide if it's worth pursuing at all. (Kirk
obviously thinks it is, but I don't know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn't test reads of the
files (which would involve touching an inode) because I didn't feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Git is the SCM designed by Linus to replaces Bitkeeper. C source code
is freely available.

The git file store hashes files into an array of 255 directories and
keeps them initially as small files. Later on you can run a pass which
transparently converts them into single file packs. A pack may have
many thousand files that can be accessed via their hash name. Packs
recover the space lost to block fragmentation. Git directory objects
are used to map hash names to human readable strings.

When packs are created there is code that searches for similarities
between files in the pack. If similar files are found they are stored
as deltas from the original file. All objects stored in the system,
files or packs, are zlib compressed.

It wouldn't be a lot of trouble to take the indexing code out of
cLucene and adjust it to use the git file store.

Transactions are achieved by first adding all of your files to the
file store, then a single 'commit' makes them become visible. If you
never do the commit there are tools for pruning dangling objects out
of the db. But the dangling objects aren't visible so it's just a
cleanup step.

Git is an archival system, objects in the database can not be changed,
they can only be superseded by a later version. Cryptographic hashes
will immediately detect if something in the db has been altered.

I have multi gigabyte git files stores containing millions of objects.
Performance is pretty constant until your working set won't fit into
memory. The killer feature of these file stores is support for
distributed replication with multiple people adding objects to their
distributed copies.

Francis Cianfrocca · Aug 13, 2006

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that
there are almost as many valid and useful approaches to this problem
as there are applications.

khaines · Aug 13, 2006

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that

A persistent session cache was actually my original motivation when I
started working on the code.

IOWA's default way of managing sessions is via a cache. For a busy
server, this obviously imposes some memory constraints on just how many
sessions or for just how long one can keep sessions around.

So I created a bilevel cache that uses an in-memory LRU cache for the L1
cache, and when an element is expired from the L1 cache, it is inserted
into the L2 cache. If the L2 cache is a persistent store, then that
session can be retrieved again when required.

By making that persistent store itself an LRU cache, I can still let it
self manage its ultimate size or the ultimate age of elements within it, I
maintain reasonably fast access to elements that have gone out to the
persistent store, I can potentially do things like have the server flush
it's L1 cache to L2 and shut down. When it's started again all of the
sessions which were available when it was shut down are still available.
Etc....

Kirk Haines

Jon Smirl · Aug 13, 2006

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You're proving yet again that
there are almost as many valid and useful approaches to this problem
as there are applications.

Active sessions would stay as files and generate a new key each time
they are changed. Once a day you would sweep them into a pack. You
will need to make some small modifications for the git code to do this
but all of the hard code is already written.

With git's compression you will be able to store hundreds of millions
of session snapshots in a couple of gigs of disk. It would provide
interesting data for later analysis.

Francis Cianfrocca · Aug 13, 2006

Active sessions would stay as files and generate a new key each time
they are changed. Once a day you would sweep them into a pack. You
will need to make some small modifications for the git code to do this
but all of the hard code is already written.

With git's compression you will be able to store hundreds of millions
of session snapshots in a couple of gigs of disk. It would provide
interesting data for later analysis.

Once again, you're solving a rather different problem (or perhaps
you're redefining the problem to fit your solution): permanently
storing ephemeral session caches for later analysis is interesting but
different from just having a session cache that survives process
crashes. Yes, the "hard work" is done, but if you didn't need to do
the hard work in the first place, it's wasteful. Again, this whole
subthread is about Kirk's filesystem approach, which is attractive for
some applications (not including yours) because it's dead easy and it
may be "fast enough."

Joel VanderWerf · Aug 14, 2006

Bob said:
Hi Ara,

I'm already using that (version 0.5 -- I can't get to RAA right now for
some reason and there are no files on Rubyforge for fsdb so I don't know
if there is a more recent version). In version 0.5 the transactions were
not sufficient I think (it would be nice if I was wrong).

I apologize for the lack of fsdb stuff on the RubyForge page. I've never
found a comfortable way to automate gem uploads, so I still use my old
scripts for building this page:

http://redshift.sourceforge.net/

A quick scan there shows that fsdb-0.5 is the latest.

It's on my list to figure out how to automate the RubyForge dance (and I
think Ara or someone did this a while ago, so maybe it's a solved
problem now).

Now, on to your question...

Transactions in fsdb can be nested (and the transactions can be on
different dbs)--is that sufficient? This may not be what you mean by
"single transaction", though.

But, like PStore, FSDB isn't very fast--it's pure ruby, it marshals
objects (by default--but you _can_ tell fsdb to write the value strings
directly rather than via marshal), and it pays the cost of thread- and
process-safety. One advantage over pstore is finer granularity.

Anyway, I hope your search is fruitful, whether it lands at bdb or
sqlite or ...

Francis Cianfrocca · Aug 14, 2006

I apologize for the lack of fsdb stuff on the RubyForge page. I've never
found a comfortable way to automate gem uploads, so I still use my old
scripts for building this page:

Austin Ziegler has automated the "Rubyforge dance"- look at
PDF::Writer or post to him on this list.

Fast implementation of XML escape	4	Aug 3, 2007
Array shift bug	9	Sep 24, 2006
[ANN] Localmemcache-0.3.0: A persistent key-value database based onmmap()'ed shared memory	8	Apr 17, 2009
Segmentation fault, proc, eval, long string	26	Nov 30, 2006
Advice on persistent storage in Java?	5	Dec 11, 2004
[ANN] xampl 0.1.0	8	Nov 21, 2005
Any progammers looking for a killer app to develop? How about a voice enabled forum?	16	Jan 29, 2004
advice for a class implementation? ( ugh, long... )	4	Nov 11, 2004

Looking for a Fast Persistent Store

khaines

Bob Hutchison

Bob Hutchison

Bob Hutchison

Jon Smirl

khaines

Bill Kelly

Matt Todd

Francis Cianfrocca

Bob Hutchison

Bob Hutchison

Francis Cianfrocca

Francis Cianfrocca

Jon Smirl

Francis Cianfrocca

khaines

Jon Smirl

Francis Cianfrocca

Joel VanderWerf

Francis Cianfrocca

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads