Ping Jim Janney - sizing your snippet store

Tom Anderson · Apr 2, 2010

Jim,

Since i spouted off on cljp about how key-value stores were better than
RDBMs, i thought i ought to put my money where my mouth is. I'm going to
put together a simple demo/benchmark for storing key-value data,
implemented on top of Tokyo Cabinet and JDBC, which will hopefully
demonstrate that TC is easier to use and faster than a database (i can try
Derby, H2 and a leading commercial database whose license prohibits the
posting of performance comparisons, and so which will have to go nameless,
lest the bearded, megalomaniacal CEO of its manufacturer get upset).

So, it would be good to know what sort of problem i'm actually trying to
show that KV stores are good for. Jim, you said:

I need to maintain a data base of small text snippets keyed by arbitrary
strings, without the overhead of a full SQL relational database. We
will have several people putting data into it so it needs to support
concurrent access over a network.

Could i trouble you to expand a little on sizing? In particular:

- How many entries are there?
- How big are the keys? What sort of things are they?
- How big are the values?
- What's the workload mix? Read vs write vs delete?
- Are writes mostly of new entries, or overwrites of existing ones?
- How skewed is the access towards a few hot entries?
- How many users are using it in parallel?
- What is the request rate like?
- How much RAM can be used? How much disk space?

Answers to any of those, no matter how rough, would be useful.

I'll post to cljp if/when i get this done.

tom

Jim Janney · Apr 3, 2010

Tom Anderson said:
Jim,

Since i spouted off on cljp about how key-value stores were better
than RDBMs, i thought i ought to put my money where my mouth is. I'm
going to put together a simple demo/benchmark for storing key-value
data, implemented on top of Tokyo Cabinet and JDBC, which will
hopefully demonstrate that TC is easier to use and faster than a
database (i can try Derby, H2 and a leading commercial database whose
license prohibits the posting of performance comparisons, and so which
will have to go nameless, lest the bearded, megalomaniacal CEO of its
manufacturer get upset).

So, it would be good to know what sort of problem i'm actually trying
to show that KV stores are good for. Jim, you said:

Could i trouble you to expand a little on sizing? In particular:

- How many entries are there?
- How big are the keys? What sort of things are they?
- How big are the values?
- What's the workload mix? Read vs write vs delete?
- Are writes mostly of new entries, or overwrites of existing ones?
- How skewed is the access towards a few hot entries?
- How many users are using it in parallel?
- What is the request rate like?
- How much RAM can be used? How much disk space?

Answers to any of those, no matter how rough, would be useful.

I'll post to cljp if/when i get this done.

tom

We want our customers to be able to add their own annotations to
fields on our application's screens, sort of like sticky notes. (This
is why I was fooling with tooltips earlier). Which means that every
time the app displays a screen it needs to check for an annotation on
pretty much every field on the screen. Most of the time there won't
be any, or only very few, but it has to check anyway, and we have some
very complicated screens with nested tabbed panes. Many fields appear
in more than one screen. For fields tied directly to a column in a
database table the key is the table name plus the column name. For
derived fields we assign a more or less arbitrary key; these can't be
allowed to change or the sticky notes get mixed up.

The app is in Java, started through JNLP, communicates with an AS/400
through a LAN (each customer has its own). Some customers have
branches attached through slow connections.

So:
Number of entries probably < 1000 (but who knows what people will do?)
Keys are strings, 20 to 30 characters.
Values are strings, maybe 50 to 500 characters.
Frequent reads, occasional writes, very few deletes.
Maybe 5 to 20 users.
Every time someone displays a screen, 20 to 100 reads need to happen
in well under a second. Most of these will be misses.

As I said in another message, the real problem here is probably
network latency, and switching databases won't help that. I think
I'll have to abandon the brute force approach and try something more
clever.

Tom Anderson · Apr 3, 2010

So:
Number of entries probably < 1000 (but who knows what people will do?)
Keys are strings, 20 to 30 characters.
Values are strings, maybe 50 to 500 characters.
Frequent reads, occasional writes, very few deletes.
Maybe 5 to 20 users.
Every time someone displays a screen, 20 to 100 reads need to happen
in well under a second. Most of these will be misses.

My gut feeling is that this is not a big ask, and that an RDBMS will be
able to deal with this without any trouble. The first performance-related
thing i'd do (possibly even upfront, before i had hard data that
performance was actually a problem) would be to batch the reads: collect
together all the keys you're going to need, send them over the wire all at
once, and have the far end do a query like:

SELECT key, value
FROM annotation
WHERE key IN (?, ?, ?, ?)

Plugging each key into one of the parameters in the IN-set.

There's a question over how many parameters you put in that IN-set, and
what you do if your number of keys is different. I think i'd construct the
PreparedStatement on the fly with exactly the right number of parameters,
so the question is moot. Since there will only be a small number of
different numbers of parameters (not more than one per form in your app),
the database's PreparedStatement cache should happily hold all of them,
and you won't be incurring a lot of preparation overhead.

As I said in another message, the real problem here is probably network
latency, and switching databases won't help that. I think I'll have to
abandon the brute force approach and try something more clever.

Probably. Robert Klemme's suggestion of building a toy version to get some
data on the performance numbers is an excellent one; in my part of the
world, we call this a 'spike':

http://www.extremeprogramming.org/rules/spike.html

tom

Jim Janney · Apr 3, 2010

Tom Anderson said:
My gut feeling is that this is not a big ask, and that an RDBMS will
be able to deal with this without any trouble. The first
performance-related thing i'd do (possibly even upfront, before i had
hard data that performance was actually a problem) would be to batch
the reads: collect together all the keys you're going to need, send
them over the wire all at once, and have the far end do a query like:

SELECT key, value
FROM annotation
WHERE key IN (?, ?, ?, ?)

Plugging each key into one of the parameters in the IN-set.

There's a question over how many parameters you put in that IN-set,
and what you do if your number of keys is different. I think i'd
construct the PreparedStatement on the fly with exactly the right
number of parameters, so the question is moot. Since there will only
be a small number of different numbers of parameters (not more than
one per form in your app), the database's PreparedStatement cache
should happily hold all of them, and you won't be incurring a lot of
preparation overhead.

Probably. Robert Klemme's suggestion of building a toy version to get
some data on the performance numbers is an excellent one; in my part
of the world, we call this a 'spike':

http://www.extremeprogramming.org/rules/spike.html

It occurred to me this morning that if I put a timestamp in each
record, I could

SELECT * FROM annotation WHERE timestamp > ?

to find any changes not in the local cache. Most of the time that
would return nothing, presumably fairly quickly. It would miss
deletes, but that's probably acceptable.

Tom Anderson · Apr 5, 2010

It occurred to me this morning that if I put a timestamp in each
record, I could

SELECT * FROM annotation WHERE timestamp > ?

to find any changes not in the local cache. Most of the time that would
return nothing, presumably fairly quickly. It would miss deletes, but
that's probably acceptable.

Good idea. And you could do deletes by setting the value to NULL and
leaving the record in the database.

This starts to sound like what you really want is some sort of file
replication, though.

tom

A good data structure to store INI files.	24	Feb 10, 2009
Is your computer hazardous to your kids' health?	0	Jul 13, 2009
How to map your neighborhood or any USA neighborhood	1	Dec 20, 2006
Developer Environment Setup Opinions // Seeking your opinion	11	Mar 20, 2008
What happens when your program crashes?	87	Mar 18, 2008
SQL Data driven GUI	15	Apr 5, 2006
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011
high traffic/availability application and gnu_cxx::hash_map problem -better to use tr1/unordered_map	1	Jan 5, 2011

Ping Jim Janney - sizing your snippet store

Tom Anderson

Jim Janney

Tom Anderson

Jim Janney

Tom Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads