key/value store optimized for disk storage

S

Steve Howell

This is slightly off topic, but I'm hoping folks can point me in the
right direction.

I'm looking for a fairly lightweight key/value store that works for
this type of problem:

ideally plays nice with the Python ecosystem
the data set is static, and written infrequently enough that I
definitely want *read* performance to trump all
there is too much data to keep it all in memory (so no memcache)
users will access keys with fairly uniform, random probability
the key/value pairs are fairly homogenous in nature:
keys are <= 16 chars
values are between 1k and 4k bytes generally
approx 3 million key/value pairs
total amount of data == 6Gb
needs to work on relatively recent versions of FreeBSD and Linux

My current solution works like this:

keys are file paths
directories are 2 levels deep (30 dirs w/100k files each)
values are file contents

The current solution isn't horrible, but I'm try to squeeze a little
performance/robustness out of it. A minor nuisance is that I waste a
fair amount of disk space, since the values are generally less than 4k
in size. A larger concern is that I'm not convinced that file systems
are optimized for dealing with lots of little files in a shallow
directory structure.

To deal with the latter issue, a minor refinement would be to deepen
the directory structure, but I want to do due diligence on other
options first.

I'm looking for something a little lighter than a full-on database
(either SQL or no-SQL), although I'm not completely ruling out any
alternatives yet.

As I mention up top, I'm mostly hoping folks can point me toward
sources they trust, whether it be other mailing lists, good tools,
etc. To the extent that this is on topic and folks don't mind
discussing this here, I'm happy to follow up on any questions.

Thanks,

Steve

P.S. I've already found some good information via Google, but there's
a lot of noise out there.
 
P

Paul Rubin

Steve Howell said:
keys are file paths
directories are 2 levels deep (30 dirs w/100k files each)
values are file contents
The current solution isn't horrible,

Yes it is ;-)
As I mention up top, I'm mostly hoping folks can point me toward
sources they trust, whether it be other mailing lists, good tools,

cdb sounds reasonable for your purposes. I'm sure there are python
bindings for it.

http://cr.yp.to/cdb.html mentions a 4gb limit (2**32) but I
half-remember something about a 64 bit version.
 
S

Steve Howell

Yes it is ;-)

cdb sounds reasonable for your purposes.  I'm sure there are python
bindings for it.

http://cr.yp.to/cdb.htmlmentions a 4gb limit (2**32) but I
half-remember something about a 64 bit version.

Thanks. That's definitely in the spirit of what I'm looking for,
although the non-64 bit version is obviously geared toward a slightly
smaller data set. My reading of cdb is that it has essentially 64k
hash buckets, so for 3 million keys, you're still scanning through an
average of 45 records per read, which is about 90k of data for my
record size. That seems actually inferior to a btree-based file
system, unless I'm missing something.

I did find this as follow up to your lead:

http://thomas.mangin.com/data/source/cdb.py

Unfortunately, it looks like you have to first build the whole thing
in memory.
 
P

Paul Rubin

Steve Howell said:
Thanks. That's definitely in the spirit of what I'm looking for,
although the non-64 bit version is obviously geared toward a slightly
smaller data set. My reading of cdb is that it has essentially 64k
hash buckets, so for 3 million keys, you're still scanning through an
average of 45 records per read, which is about 90k of data for my
record size. That seems actually inferior to a btree-based file
system, unless I'm missing something.

1) presumably you can use more buckets in a 64 bit version; 2) scanning
90k probably still takes far less time than a disk seek, even a "seek"
(several microseconds in practice) with a solid state disk.
http://thomas.mangin.com/data/source/cdb.py
Unfortunately, it looks like you have to first build the whole thing
in memory.

It's probably fixable, but I'd guess you could just use Bernstein's
cdbdump program instead.

Alternatively maybe you could use one of the *dbm libraries,
which burn a little more disk space, but support online update.
 
S

Steve Howell

1) presumably you can use more buckets in a 64 bit version; 2) scanning
90k probably still takes far less time than a disk seek, even a "seek"
(several microseconds in practice) with a solid state disk.

Doesn't cdb do at least one disk seek as well? In the diagram on this
page, it seems you would need to do a seek based on the value of the
initial pointer (from the 256 possible values):

http://cr.yp.to/cdb/cdb.txt
It's probably fixable, but I'd guess you could just use Bernstein's
cdbdump program instead.

Alternatively maybe you could use one of the *dbm libraries,
which burn a little more disk space, but support online update.

Yup, I don't think I want to incur the extra overhead. Do you have
any first hand experience pushing dbm to the scale of 6Gb or so? My
take on dbm is that its niche is more in the 10,000-record range.
 
P

Paul Rubin

Steve Howell said:
Doesn't cdb do at least one disk seek as well? In the diagram on this
page, it seems you would need to do a seek based on the value of the
initial pointer (from the 256 possible values):

Yes, of course it has to seek if there is too much data to fit in
memory. All I'm saying is that if you're spending milliseconds on the
seek, that may dominate the lookup time even if you scan the 90k.

Actually, looking at the spec more closely, there are 256 hash tables in
the file, but each table can be of any size. So there can be far more
than 2**16 hash slots. Uncharacteristically for Bernstein, the document
is pretty unclear, so maybe you have to look at the code to be sure of
the layout. Note also that empty hash buckets have just 2 words of
overhead. So with 3M keys and 75% load factor, you get 4M buckets and
relatively few collisions. The extra 1M buckets in a 64 bit
implementation is just 16MB in the file, which isn't much at all even
considering that you want it to stay resident in memory to avoid some
seeks, assuming you're on a PC and not some smaller device like a phone.
(A phone will have a solid state disk eliminating most seek time, so
you're ok in that situation too).
Yup, I don't think I want to incur the extra overhead. Do you have
any first hand experience pushing dbm to the scale of 6Gb or so? My
take on dbm is that its niche is more in the 10,000-record range.

There are a bunch of different variants. I'm trying to remember what
I've personally done with it and I'm sure I've used much more than 10k
records, but maybe not millions. Unix dbm was originally designed to
handle millions of records back when that was a lot. I'd expect gdbm,
bsddb and so forth can handle it easily. The naive Python dbm module
might not be able to.

The amount of data you're talking about (a few million keys, a few gb of
data) is fairly modest by today's standards, so I would think fancy
methods aren't really needed.
 
P

Paul Rubin

Paul Rubin said:
looking at the spec more closely, there are 256 hash tables.. ...

You know, there is a much simpler way to do this, if you can afford to
use a few hundred MB of memory and you don't mind some load time when
the program first starts. Just dump all the data sequentially into a
file. Then scan through the file, building up a Python dictionary
mapping data keys to byte offsets in the file (this is a few hundred MB
if you have 3M keys). Then dump the dictionary as a Python pickle and
read it back in when you start the program.

You may want to turn off the cyclic garbage collector when building or
loading the dictionary, as it badly can slow down the construction of
big lists and maybe dicts (I'm not sure of the latter).
 
S

Steve Howell

You know, there is a much simpler way to do this, if you can afford to
use a few hundred MB of memory and you don't mind some load time when
the program first starts.  Just dump all the data sequentially into a
file.  Then scan through the file, building up a Python dictionary
mapping data keys to byte offsets in the file (this is a few hundred MB
if you have 3M keys).  Then dump the dictionary as a Python pickle and
read it back in when you start the program.

You may want to turn off the cyclic garbage collector when building or
loading the dictionary, as it badly can slow down the construction of
big lists and maybe dicts (I'm not sure of the latter).

I'm starting to lean toward the file-offset/seek approach. I am
writing some benchmarks on it, comparing it to a more file-system
based approach like I mentioned in my original post. I'll report back
when I get results, but it's already way past my bedtime for tonight.

Thanks for all your help and suggestions.
 
M

Miki Tebeka

I'm looking for a fairly lightweight key/value store that works for
this type of problem:
I'd start with a benchmark and try some of the things that are already in the standard library:
- bsddb
- sqlite3 (table of key, value, index key)
- shelve (though I doubt this one)

You might find that for a little effort you get enough out of one of these.

Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.

HTH,
 
S

Steve Howell

I'm starting to lean toward the file-offset/seek approach.  I am
writing some benchmarks on it, comparing it to a more file-system
based approach like I mentioned in my original post.  I'll report back
when I get results, but it's already way past my bedtime for tonight.

Thanks for all your help and suggestions.


I ended up going with the approach that Paul suggested (except I used
JSON instead of pickle for persisting the hash). I like it for its
simplicity and ease of troubleshooting.

My test was to write roughly 4GB of data, with 2 million keys of 2k
bytes each.

The nicest thing was how quickly I was able to write the file.
Writing tons of small files bogs down the file system, whereas the one-
big-file approach finishes in under three minutes.

Here's the code I used for testing:

https://github.com/showell/KeyValue/blob/master/test_key_value.py

Here are the results:

~/WORKSPACE/KeyValue > ls -l values.txt hash.txt
-rw-r--r-- 1 steve staff 44334161 May 3 18:53 hash.txt
-rw-r--r-- 1 steve staff 4006000000 May 3 18:53 values.txt

2000000 out of 2000000 records yielded (2k each)
Begin READING test
num trials 100000
time spent 39.8048191071
avg delay 0.000398048191071

real 2m46.887s
user 1m35.232s
sys 0m19.723s
 
P

Paul Rubin

Steve Howell said:
My test was to write roughly 4GB of data, with 2 million keys of 2k
bytes each.

If the records are something like english text, you can compress
them with zlib and get some compression gain by pre-initializing
a zlib dictionary from a fixed english corpus, then cloning it.
That is, if your messages are a couple paragraphs, you might
say something like:

iv = (some fixed 20k or so of records concatenated together)
compressor = zlib(iv).clone() # I forget what this
# operation is actually called

# I forget what this is called too, but the idea is you throw
# away the output of compressing the fixed text, and sync
# to a byte boundary
compressor.sync()

zout = compressor.compress(your_record).sync()
...

i.e. the part you save in the file is just the difference
between compress(corpus) and compress(corpus_record). To
decompress, you initialize a compressor the same way, etc.

It's been a while since I used that trick but for json records of a few
hundred bytes, I remember getting around 2:1 compression, while starting
with an unprepared compressor gave almost no compression.
 
S

Steve Howell

If the records are something like english text, you can compress
them with zlib and get some compression gain by pre-initializing
a zlib dictionary from a fixed english corpus, then cloning it.
That is, if your messages are a couple paragraphs, you might
say something like:

  iv = (some fixed 20k or so of records concatenated together)
  compressor = zlib(iv).clone()  # I forget what this
                                 # operation is actually called

  # I forget what this is called too, but the idea is you throw
  # away the output of compressing the fixed text, and sync
  # to a byte boundary
  compressor.sync()

  zout = compressor.compress(your_record).sync()
  ...

i.e. the part you save in the file is just the difference
between compress(corpus) and compress(corpus_record).  To
decompress, you initialize a compressor the same way, etc.

It's been a while since I used that trick but for json records of a few
hundred bytes, I remember getting around 2:1 compression, while starting
with an unprepared compressor gave almost no compression.

Sounds like a useful technique. The text snippets that I'm
compressing are indeed mostly English words, and 7-bit ascii, so it
would be practical to use a compression library that just uses the
same good-enough encodings every time, so that you don't have to write
the encoding dictionary as part of every small payload.

Sort of as you suggest, you could build a Huffman encoding for a
representative run of data, save that tree off somewhere, and then use
it for all your future encoding/decoding.

Is there a name to describe this technique?
 
P

Paul Rubin

Steve Howell said:
Sounds like a useful technique. The text snippets that I'm
compressing are indeed mostly English words, and 7-bit ascii, so it
would be practical to use a compression library that just uses the
same good-enough encodings every time, so that you don't have to write
the encoding dictionary as part of every small payload.

Zlib stays adaptive, the idea is just to start with some ready-made
compression state that reflects the statistics of your data.
Sort of as you suggest, you could build a Huffman encoding for a
representative run of data, save that tree off somewhere, and then use
it for all your future encoding/decoding.

Zlib is better than Huffman in my experience, and Python's zlib module
already has the right entry points. Looking at the docs,
Compress.flush(Z_SYNC_FLUSH) is the important one. I did something like
this before and it was around 20 lines of code. I don't have it around
any more but maybe I can write something else like it sometime.
Is there a name to describe this technique?

Incremental compression maybe?
 
S

Steve Howell

Zlib stays adaptive, the idea is just to start with some ready-made
compression state that reflects the statistics of your data.


Zlib is better than Huffman in my experience, and Python's zlib module
already has the right entry points.  Looking at the docs,
Compress.flush(Z_SYNC_FLUSH) is the important one.  I did something like
this before and it was around 20 lines of code.  I don't have it around
any more but maybe I can write something else like it sometime.


Incremental compression maybe?

Many thanks, this is getting me on the right path:

compressor = zlib.compressobj()
s = compressor.compress("foobar")
s += compressor.flush(zlib.Z_SYNC_FLUSH)

s_start = s
compressor2 = compressor.copy()

s += compressor.compress("baz")
s += compressor.flush(zlib.Z_FINISH)
print zlib.decompress(s)

s = s_start
s += compressor2.compress("spam")
s += compressor2.flush(zlib.Z_FINISH)
print zlib.decompress(s)
 
P

Paul Rubin

Steve Howell said:
compressor = zlib.compressobj()
s = compressor.compress("foobar")
s += compressor.flush(zlib.Z_SYNC_FLUSH)

s_start = s
compressor2 = compressor.copy()

I think you also want to make a decompressor here, and initialize it
with s and then clone it. Then you don't have to reinitialize every
time you want to decompress something.

I also seem to remember that the first few bytes of compressed output
are always some fixed string or checksum, that you can strip out after
compression and put back before decompression, giving further savings in
output size when you have millions of records.
 
S

Steve Howell

I think you also want to make a decompressor here, and initialize it
with s and then clone it.  Then you don't have to reinitialize every
time you want to decompress something.

Makes sense. I believe I got that part correct:

https://github.com/showell/KeyValue/blob/master/salted_compressor.py
I also seem to remember that the first few bytes of compressed output
are always some fixed string or checksum, that you can strip out after
compression and put back before decompression, giving further savings in
output size when you have millions of records.

I'm pretty sure this happens for free as long as the salt is large
enough, but maybe I'm misunderstanding.
 
P

Paul Rubin

Steve Howell said:

The API looks nice, but your compress method makes no sense. Why do you
include s.prefix in s and then strip it off? Why do you save the prefix
and salt in the instance, and have self.salt2 and s[len(self.salt):]
in the decompress? You should be able to just get the incremental bit.
I'm pretty sure this happens for free as long as the salt is large
enough, but maybe I'm misunderstanding.

No I mean there is some fixed overhead (a few bytes) in the compressor
output, to identify it as such. That's fine when the input and output
are both large, but when there's a huge number of small compressed
strings, it adds up.
 
S

Steve Howell

Steve Howell said:
Makes sense.  I believe I got that part correct:

The API looks nice, but your compress method makes no sense.  Why do you
include s.prefix in s and then strip it off?  Why do you save the prefix
and salt in the instance, and have self.salt2 and s[len(self.salt):]
in the decompress?  You should be able to just get the incremental bit.

This is fixed now.

https://github.com/showell/KeyValue...ab4f3ca73fcaf4ec0e7f22b4#salted_compressor.py

No I mean there is some fixed overhead (a few bytes) in the compressor
output, to identify it as such.  That's fine when the input and output
are both large, but when there's a huge number of small compressed
strings, it adds up.

It it's in the header, wouldn't it be part of the output that comes
before Z_SYNC_FLUSH?
 
P

Paul Rubin

Steve Howell said:
This is fixed now.
Nice.

It it's in the header, wouldn't it be part of the output that comes
before Z_SYNC_FLUSH?

Hmm, maybe you are right. My version was several years ago and I don't
remember it well, but I half-remember spending some time diddling around
with this issue.
 
S

Steve Howell

I'd start with a benchmark and try some of the things that are already inthe standard library:
- bsddb
- sqlite3 (table of key, value, index key)
- shelve (though I doubt this one)

Thanks. I think I'm ruling out bsddb, since it's recently deprecated:

http://www.gossamer-threads.com/lists/python/python/106494

I'll give sqlite3 a spin. Has anybody out there wrapped sqlite3
behind a hash interface already? I know it's simple to do
conceptually, but there are some minor details to work out for large
amounts of data (like creating the index after all the inserts), so if
somebody's already tackled this, it would be useful to see their
code.
You might find that for a little effort you get enough out of one of these.

Another module which is not in the standard library is hdf5/PyTables and in my experience very fast.

Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top