Memory-mapped persistent hash?

B

Bill Kelly

Hi,

Essentially I'm looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially "connecting to or opening" this persistent hash should
be very minimal before I can start using it.

Does such a beast exist?

From what I've read, it seems like the Berkeley DB might fit the
bill; although I'm a little worried I couldn't find the cost of
a commercial license listed anywhere on the website.

Are there other options besides BDB?


Thanks,

Regards,

Bill
 
S

Sven Suska

Bill said:
Essentially I'm looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially "connecting to or opening" this persistent hash should
be very minimal before I can start using it.


Yes, I would love such a thing, too.

Sven
 
F

Farrel Lifson

Hi,

Essentially I'm looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially "connecting to or opening" this persistent hash should
be very minimal before I can start using it.

Does such a beast exist?

From what I've read, it seems like the Berkeley DB might fit the
bill; although I'm a little worried I couldn't find the cost of
a commercial license listed anywhere on the website.

Are there other options besides BDB?


Thanks,

Regards,

Bill

SQLite?

Farrel
 
B

Brian Candler

From what I've read, it seems like the Berkeley DB might fit the
bill; although I'm a little worried I couldn't find the cost of
a commercial license listed anywhere on the website.

If you need it for a software product you will sell, then ask them - they'll
happily discuss pricing. BDB is used in many embedded applications, so I
don't expect it would be prohibitatively expensive.

Sure, BDB is now owned by Oracle - but then again, even a full Oracle
install is free (for a single process with up to 1GB of RAM and 4GB of
tablespace)

The suggestion someone else made of sqlite3 is a good one, and you can
always put an OR mapping like ActiveRecord in front of it, but clearly
there's an overhead of converting your DB requests into SQL and back again.

Regards,

Brian.
 
B

Bill Kelly

From: "Farrel Lifson said:

Thanks; I'll explain more below...


From: "Brian Candler said:
If you need it for a software product you will sell, then ask them - they'll
happily discuss pricing. BDB is used in many embedded applications, so I
don't expect it would be prohibitatively expensive.

It can't hurt to ask... I saw a (rather large) figure reported on
the web, which was way out of my price range; but it may not have
been accurate or current.

The suggestion someone else made of sqlite3 is a good one, and you can
always put an OR mapping like ActiveRecord in front of it, but clearly
there's an overhead of converting your DB requests into SQL and back again.

Indeed; actually I'm already using sqlite3 for relational data.
I like it a lot.

It just seemed to me that using memory-mapped I/O, it might be
possible to get maybe an order of magnitude faster performance
than sqlite3 for raw key/value style fetches and stores, as long
as lots of the relevant pages were being cached by the VMM.

Certainly if the VMM were able to cache *all* the pages representing
the hash database file, I'd expect performance could approach an
in-memory hash--which appears to be around two orders of magnitude
faster than sqlite3. Although, once the hash database file
significantly exceeds RAM size, one would expect performance to
degrade to I/O speeds given totally random access patterns; however
if just a subset of the total keys were being read repeatedly, I'd
expect performance to stay high. (The latter being a case I'm
interested in.)

So I figured there might conceivably already be several existing
implementations of such a thing. (But had trouble finding them
with web searches, if they exist--apart from BDB.)


Regards,

Bill
 
B

Brian Candler

It just seemed to me that using memory-mapped I/O, it might be
possible to get maybe an order of magnitude faster performance
than sqlite3 for raw key/value style fetches and stores, as long
as lots of the relevant pages were being cached by the VMM.

Certainly if the VMM were able to cache *all* the pages representing
the hash database file, I'd expect performance could approach an
in-memory hash--which appears to be around two orders of magnitude
faster than sqlite3. Although, once the hash database file
significantly exceeds RAM size, one would expect performance to
degrade to I/O speeds given totally random access patterns; however
if just a subset of the total keys were being read repeatedly, I'd
expect performance to stay high. (The latter being a case I'm
interested in.)

So I figured there might conceivably already be several existing
implementations of such a thing. (But had trouble finding them
with web searches, if they exist--apart from BDB.)

Well, there's ruby-mmap. Looking at the docs, this effectively gives you a
very large string, which is not what you're looking for, but maybe you could
layer something on top of this.

If your keys are small but your objects are large, you could keep all the
keys in RAM (e.g. using something like Madeleine) and just store pointers to
the data. Then you only(*) have to worry about fragmentation and garbage
collection within your mmap'd file.

Or you could just keep keys in RAM and use separate disk files for each
object.

Regards,

Brian.

(*) :)
 
J

Joel VanderWerf

Brian Candler wrote:
...
Or you could just keep keys in RAM and use separate disk files for each
object.

That's sounding sorta like fsdb[1]. I'm not sure I can recommend it in
this case, though, since you pay for Marshal.dump every time you write
an object into the database, plus locking, plus IO. I'd expect
FSDB::Database::[]= will be orders of magnitude slower than Hash#[]= .

It will be interesting to hear any solution using mmap....

[1] http://redshift.sourceforge.net/fsdb/
 
T

Tim Pease

Brian Candler wrote:
...
Or you could just keep keys in RAM and use separate disk files for each
object.

That's sounding sorta like fsdb[1]. I'm not sure I can recommend it in
this case, though, since you pay for Marshal.dump every time you write
an object into the database, plus locking, plus IO. I'd expect
FSDB::Database::[]= will be orders of magnitude slower than Hash#[]= .

It will be interesting to hear any solution using mmap....

You could do a mmap solution. Modify Hash such that []= does a
Marshal.dump of your object, stores the object into the mmap, and then
that memory location is stored in the Hash instead of the object.

[] must also be modified to take the memory location, Marshal.load the
object from the mmap, and then return the object.

The hard part of this is doing the memory management of the mmap --
when an object is deleted from the hash, removing it from the mmap;
consolidating unused mmap regions; etc. All the standard MMU stuff
you normally don't have to deal with in Ruby.

It would be much easier to implement if all the objects being stored
in the Hash were guaranteed to be the same size. Then you would just
need an free/allocated array to keep track of what can go where in the
mmap.

Blessings,
TwP
 
B

Bill Kelly

From: "Tim Pease said:
You could do a mmap solution. Modify Hash such that []= does a
Marshal.dump of your object, stores the object into the mmap, and then
that memory location is stored in the Hash instead of the object.

[] must also be modified to take the memory location, Marshal.load the
object from the mmap, and then return the object.

The hard part of this is doing the memory management of the mmap --
when an object is deleted from the hash, removing it from the mmap;
consolidating unused mmap regions; etc. All the standard MMU stuff
you normally don't have to deal with in Ruby.

It would be much easier to implement if all the objects being stored
in the Hash were guaranteed to be the same size. Then you would just
need an free/allocated array to keep track of what can go where in the
mmap.

Agreed. But ironically, what gets me, is that with a modern
VMM, this is exactly what is already going on with Ruby's hash
in memory. Except that the backing store is the system swap
file, and so, not persistent.

In principle, I just want to change the backing store to a
memory mapped file, instead. :)

I've wondered what would happen if one took a nice malloc
implementation, made it operate inside a heap that was
memory-mapped onto a file, and then took something like the
STL hash_map (or ruby's hash) and wired it to the malloc
allocating from the memory-mapped file.

Intuitively, it seems it would have no choice but to perform
fantastically, as long as the whole file could be mapped
into memory.

However, once the file size exceeded available memory, I
can imagine that it might degrade to sub-optimal performace.

Along these lines, I've also wondered if one could get a
ruby application to persist similarly, (in principle!)
by wiring ruby's memory allocation functions to a malloc
that was allocating from a memory-mapped file. Of course
the tricky part would be dealing with all the objects
containing system resources that can't be persisted,
such as file handles, etc. etc. Probably a nightmare in
practical terms, unless the language and its libraries
were designed that way from the start...

Ah, well. In talking about this, it seems there are really
two scenarios for memory-mapped persistent hashes. One when
all the pages can fit in RAM; and the other when the filesize
would greately exceed available RAM (and even worse, when
the filesize exceeds the maximum contiguous block that
can be even mapped into the process address space at all.)

Hmm...


Regards,

Bill
 
A

ara.t.howard

Hi,

Essentially I'm looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially "connecting to or opening" this persistent hash should
be very minimal before I can start using it.

Does such a beast exist?

From what I've read, it seems like the Berkeley DB might fit the
bill; although I'm a little worried I couldn't find the cost of
a commercial license listed anywhere on the website.

Are there other options besides BDB?


Thanks,

Regards,

Bill

cfp:~ > cat a.rb
require 'yaml'
require 'time'

### *nix AND windblows. built-in to ruby
DB =
begin
require 'sdbm'
SDBM
rescue LoadError
require 'dbm'
DBM
end

db = DB.new 'db'

last_time = db['last_time']
this_time = Time.now.iso8601(2)

y 'last_time' => last_time, 'this_time' => this_time

db['last_time'] = this_time
cfp:~ > ruby a.rb
---
this_time: "2007-05-17T13:56:13.97-06:00"
last_time:
cfp:~ > ruby a.rb
---
this_time: "2007-05-17T13:56:15.59-06:00"
last_time: "2007-05-17T13:56:13.97-06:00"
cfp:~ > ruby a.rb
---
this_time: "2007-05-17T13:56:16.91-06:00"
last_time: "2007-05-17T13:56:15.59-06:00"


-a
 
B

Bill Kelly

From: "ara.t.howard said:
cfp:~ > cat a.rb
require 'yaml'
require 'time'

### *nix AND windblows. built-in to ruby
DB =
begin
require 'sdbm'
SDBM
rescue LoadError
require 'dbm'
DBM
end

db = DB.new 'db'

Hi ara,

Thanks, I've never tried 'dbm'. I'll give it a shot.

I've tried 'sdbm' on both Perl and Ruby, and it was horribly
horribly bad. Even with about 20,000 keys, using small-sized
keys and values, the database would bloat up to several
gigabytes and start failing to store new keys. When using
larger sized values, it would fail more rapidly. And the
max-sized values that would work at all seemed to be around
a couple KBytes. (These issues occurred on both windows and
unix... however it seemed to have additional serious issues
on windows. This was a few years ago...)

Hopefully 'dbm' is better. I'll test it out tonight.


Thanks,

Bill
 
A

ara.t.howard

Hi ara,

Thanks, I've never tried 'dbm'. I'll give it a shot.

on linux it's basically just a tiny layer on bdb - but it gets built
by ruby automatically
I've tried 'sdbm' on both Perl and Ruby, and it was horribly
horribly bad. Even with about 20,000 keys, using small-sized
keys and values, the database would bloat up to several
gigabytes and start failing to store new keys. When using
larger sized values, it would fail more rapidly. And the
max-sized values that would work at all seemed to be around
a couple KBytes. (These issues occurred on both windows and
unix... however it seemed to have additional serious issues
on windows. This was a few years ago...)

Hopefully 'dbm' is better. I'll test it out tonight.

oh. it has to work!? picky!

at least it has cross-platform failure going for it!

;-)


-a
 
B

Bill Kelly

Thanks for the tip, ara!

In looking at ruby's dbm extension extconf.rb, I learned about
qdbm:

http://qdbm.sourceforge.net/

This appears to be an amazing library. I think its author,
Mikio Hirabayashi is going to be my hero of 2007.

QDBM runs on *nix and windows, handles large databases,
is LGPL, has outstanding documentation, comes with
ruby bindings. And it's fast.


Thanks,

Regards,

Bill
 
A

ara.t.howard

Thanks for the tip, ara!

In looking at ruby's dbm extension extconf.rb, I learned about
qdbm:

http://qdbm.sourceforge.net/

This appears to be an amazing library. I think its author,
Mikio Hirabayashi is going to be my hero of 2007.

QDBM runs on *nix and windows, handles large databases,
is LGPL, has outstanding documentation, comes with
ruby bindings. And it's fast.

there are so dang many of the xxxDBM libs i'd forgotten about that
one, but i figured one of them would be the key to the realm. this
whole thing did not support my first rule of software delevelopment
which is that

"it's written. it's free. and it's the first link on google."

i guess this time it's more like

"it's written. it's free. and will require several iterations of
posting to the mailing list of a cutting edge language and source
reading."

at least the first two were true!

cheers.

-a
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top