Large disc-based "hash"?

Robert Feldt · Oct 5, 2004

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

SDBM does not seem to cut it; a simple test with insertion of random
strings grows non-linearly in the file size and thus do not scale. Could
sqlite or something similar handle this better?

However, I would rather go for something simple than having to bundle a
"full" db.

If it simplifies you can assume the keys and/or values are of fixed
sizes since that is very likely but I'm not sure it will really be
possible to keep that invariant so more flexbile solutions might be
needed. Solution ideas which support caching is nice but not required
(mostly since I can add that later if the need is there).

Any ideas, experience, code or pointers that can help me design/choose
this is of interest.

Thanks,

Robert

Richard Kilmer · Oct 5, 2004

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

http://moulon.inra.fr/ruby/bdbxml.html

Ruby library:

http://moulon.inra.fr/ruby/bdb.html

-rich

Richard Kilmer · Oct 5, 2004

Berkely DB is a good solution, and Guy Decoux has a great Ruby wrapper for
accessing it.

Bdb link:

Opps... http://www.sleepycat.com/

Ara.T.Howard · Oct 5, 2004

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

if 'fairly frequent' is once per week look at CDB too - it's blindingly fast
and updates (atomic replacements actually) are suprisingly fast. reads are
absolutely insanely fast.

bdb is also great.

sqlite with adequate indexing may also work - but probably not.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================

Robert Feldt · Oct 5, 2004

if 'fairly frequent' is once per week look at CDB too - it's
blindingly fast
and updates (atomic replacements actually) are suprisingly fast.
reads are
absolutely insanely fast.

fairly frequent is more like "1-5% of keys per hour" but thanks for the
pointer, very interesting. Seems it is unix only but I need windows also
so maybe not.

bdb is also great.

Yes, seems it can handle this load. I tried with a BTree and the
db-file-size-to-added-key+value-bytes ratio seems to stay at 2.5 at
least up to 1e5 entries. Running a test with 1e6 added entries now;
insertion time does not seem to scale linearly but we'll see about file
size.

Thanks to both Rich and you for the pointers.

Downside of going with something like bdb is ease of install/support on
multiple platforms (ie windows since everything works on linux...

)
but I guess bdb should be well supported also on Windows so maybe that's
what it'll be.

sqlite with adequate indexing may also work - but probably not.

Ok, thanks for that info.

/R

rcoder · Oct 5, 2004

If you guys are talking about the same CDB format Bernstein created,
then I don't know that it would be a good match for this kind of task.

CDB has one major weakness: you can make updates in memory quickly, but
persisting them to disk requires time proportional to the total dataset
size. Note that's dataset size, not number of keys -- CDB requires you
to copy the entire old DB to a new file to persist changes.

It *is* rediculously fast for lookups of mostly constant data (hence
the initial 'C'), but probably isn't a good match for any sort of
regularly-updated dataset.

I'd second the recommendation for BDB -- it's reasonably fast, very
scalable, and gets hammered on by all kinds of applications day in and
out.

Gavin Sinclair · Oct 5, 2004

OT: Can someone explain to me the difference between "Berkeley DB" and
("DBM", "GDBM", "SDBM")?

Gavin

rcoder · Oct 5, 2004

Berkeley DB is an embedded database system developed and supported by
Sleepycat Software (http://www.sleepycat.com/). It is designed for
applications which need transactional data storage, but don't need (or
want) the additional structure that relational databases provide.

It allows simple numeric or string-valued keys to be associated with a
chunk of data, much like a persistent version of Ruby's native hash
tables. However, unlike a hash, only a small amount of the data has to
be in RAM at once -- Berkeley DB storages can be scaled into the
many-GB range easily. Like most relational databases, it also supports
atomic transactions and data recovery after crashes, as well as
concurrent (more than one process or thread) access to a database.

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them
natively offer transaction support, or handle concurrent access to the
same database gracefully.

The advantage of the *DBM libraries is their simplicity; if your
application uses less than 100,000 total keys, and doesn't need to
store much data in each one, they may offer plenty of functionality,
without requiring you to link in a large, complex library.

More specifically to Ruby, it is worth noting that the *DBM libraries
all have bindings included in the standard Ruby 1.8 distribution, while
Berkeley DB requires you to install BDB, then compile and install the
bindings. This is fine for some applications, but for quick
"download-and-go" tools, it may be more dependencies than you want to
introduce.

Markus · Oct 6, 2004

Hi,

I need a simple way to have a potentially very large hash mapping
Strings to Ruby objects. It needs to scale to on the order of 1 million
entries with up to 1kb of data per entry (even though it might end up
with maybe 2-300 bytes per entry for starters). Access time is not
critical but low-to-medium mem usage is. Deletion of (and insert of new)
keys/values will be fairly frequent so should be supported.

I've been thinking of doing it with some simple files/dir structure and
a "front end" object which hides the details (of marshal/load of the
objects and mapping of keys to the marshaled data etc) but maybe there
is something simpler that might be of use?

Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
def initialize(pc=Hash,cc=Hash)
@parent_class,@child_class = uc,lc
@storage = @parent_class.new
@fanout = 100
end
def sub_level(k)
@storage[k.hash.modulo(@fanout)] ||= @child_class.new
end
def [](k)
sub_level(k)[k]
end
def []=(k,v)
sub_level(k)[k] = v
end
end

It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
def initialize(path='.')
@path=path
File.mkdir(@path) unles File.directory? @path
@fanout = 100
end
def sub_key(k)
k.hash.modulo(@fanout)
end
def [](k)
File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
end
def []=(k,v)
File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
end
end

These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

-- Markus

Ara.T.Howard · Oct 6, 2004

Following the road less traveled a short way to see where it
leads... You can fairly easily make something that looks something like
a hash but has different internal structure:

class Hash_of_hashes
def initialize(pc=Hash,cc=Hash)
@parent_class,@child_class = uc,lc
@storage = @parent_class.new
@fanout = 100
end
def sub_level(k)
@storage[k.hash.modulo(@fanout)] ||= @child_class.new
end
def [](k)
sub_level(k)[k]
end
def []=(k,v)
sub_level(k)[k] = v
end
end

It also seems like it would be fairly easy to make something that
looked sort of like a hash but used a directory as its storage.

class Hash_on_disk
def initialize(path='.')
@path=path
File.mkdir(@path) unles File.directory? @path
@fanout = 100
end
def sub_key(k)
k.hash.modulo(@fanout)
end
def [](k)
File.open(@path+"/"+sub_key(k),'r') { |f| #read v from f }
end
def []=(k,v)
File.open(@path+"/"+sub_key(k),'w') { |f| #write v to f }
end
end

These are just sketches (you'd want to be able to delete keys/value
pairs, for example), but they are pretty darned simple. Thus it seems
like you should (in about 50 lines of ruby or so) be able to implement
something that looks like a hash but stores the data in a directory tree
that "terminates" in files based on some criterion. You might, for
example, want to store each value in its own file, or you might want to
store the bottom-most hash in a file (or you may want to decide
dynamically). (The reason for going with the tree structure is to avoid
killing your OS.)
The only gotcha's I can see are 1) intra-value references and 2)
hash-key degeneracy. Both should be reasonably easy to deal with if you
are on the lookout for them.

check out fsdb on the raa - it might actually be suitable for OP's post too.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================

Joe Cheng · Oct 6, 2004

Robert said:
Downside of going with something like bdb is ease of install/support on
multiple platforms (ie windows since everything works on linux... )
but I guess bdb should be well supported also on Windows so maybe that's
what it'll be.

If I'm not mistaken, bdb is also not free if you want to redistribute it
with a non-open-source application. (I get the impression the licensing
fees are pretty steep--maybe on the order of tens of thousands of USD?)

http://www.sleepycat.com/download/licensinginfo.shtml

Mauricio Fernández · Oct 6, 2004

DBM, GDBM, and SDBM are all database formats (and libraries to access
them) which, like Berkeley DB, allow simple key/value mappings to be
saved to disk, and updated incrementally. However, each has various
limitations as to data size per key, total database size, or
performance on large datasets which Berkeley DB does not. None of them

dbm, sdbm and ndbm have rather low limits for the data size IIRC.
gdbm had no such restrictions.

natively offer transaction support, or handle concurrent access to the
same database gracefully.

gdbm (at least) allows concurrent reader access to the db; however
when you have a writer process no other simultaneous access is possible.
In practice, gdbm is often adequate if you don't need transactions and
your db is not too large. I've heard it is often faster than bdb due to
the lack of transactions.

Seeking co-founders for my company.	3	Sep 8, 2024
hash table - simple implementation	16	Oct 17, 2012
Time based loop limiting	7	Sep 8, 2009
update variables based on own value - critique please?	5	May 1, 2011
whether large hash might leak?	5	Jul 26, 2007
Hash 256 function	0	Jun 14, 2014
Optimize write of large file	6	May 12, 2011
Persistent Hash	7	May 9, 2008

Large disc-based "hash"?

Robert Feldt

Richard Kilmer

Richard Kilmer

Ara.T.Howard

Robert Feldt

rcoder

Gavin Sinclair

rcoder

Markus

Ara.T.Howard

Joe Cheng

Mauricio Fernández

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads