Ping Jim Janney - sizing your snippet store

T

Tom Anderson

Jim,

Since i spouted off on cljp about how key-value stores were better than
RDBMs, i thought i ought to put my money where my mouth is. I'm going to
put together a simple demo/benchmark for storing key-value data,
implemented on top of Tokyo Cabinet and JDBC, which will hopefully
demonstrate that TC is easier to use and faster than a database (i can try
Derby, H2 and a leading commercial database whose license prohibits the
posting of performance comparisons, and so which will have to go nameless,
lest the bearded, megalomaniacal CEO of its manufacturer get upset).

So, it would be good to know what sort of problem i'm actually trying to
show that KV stores are good for. Jim, you said:
I need to maintain a data base of small text snippets keyed by arbitrary
strings, without the overhead of a full SQL relational database. We
will have several people putting data into it so it needs to support
concurrent access over a network.

Could i trouble you to expand a little on sizing? In particular:

- How many entries are there?
- How big are the keys? What sort of things are they?
- How big are the values?
- What's the workload mix? Read vs write vs delete?
- Are writes mostly of new entries, or overwrites of existing ones?
- How skewed is the access towards a few hot entries?
- How many users are using it in parallel?
- What is the request rate like?
- How much RAM can be used? How much disk space?

Answers to any of those, no matter how rough, would be useful.

I'll post to cljp if/when i get this done.

tom
 
J

Jim Janney

Tom Anderson said:
Jim,

Since i spouted off on cljp about how key-value stores were better
than RDBMs, i thought i ought to put my money where my mouth is. I'm
going to put together a simple demo/benchmark for storing key-value
data, implemented on top of Tokyo Cabinet and JDBC, which will
hopefully demonstrate that TC is easier to use and faster than a
database (i can try Derby, H2 and a leading commercial database whose
license prohibits the posting of performance comparisons, and so which
will have to go nameless, lest the bearded, megalomaniacal CEO of its
manufacturer get upset).

So, it would be good to know what sort of problem i'm actually trying
to show that KV stores are good for. Jim, you said:


Could i trouble you to expand a little on sizing? In particular:

- How many entries are there?
- How big are the keys? What sort of things are they?
- How big are the values?
- What's the workload mix? Read vs write vs delete?
- Are writes mostly of new entries, or overwrites of existing ones?
- How skewed is the access towards a few hot entries?
- How many users are using it in parallel?
- What is the request rate like?
- How much RAM can be used? How much disk space?

Answers to any of those, no matter how rough, would be useful.

I'll post to cljp if/when i get this done.

tom

We want our customers to be able to add their own annotations to
fields on our application's screens, sort of like sticky notes. (This
is why I was fooling with tooltips earlier). Which means that every
time the app displays a screen it needs to check for an annotation on
pretty much every field on the screen. Most of the time there won't
be any, or only very few, but it has to check anyway, and we have some
very complicated screens with nested tabbed panes. Many fields appear
in more than one screen. For fields tied directly to a column in a
database table the key is the table name plus the column name. For
derived fields we assign a more or less arbitrary key; these can't be
allowed to change or the sticky notes get mixed up.

The app is in Java, started through JNLP, communicates with an AS/400
through a LAN (each customer has its own). Some customers have
branches attached through slow connections.

So:
Number of entries probably < 1000 (but who knows what people will do?)
Keys are strings, 20 to 30 characters.
Values are strings, maybe 50 to 500 characters.
Frequent reads, occasional writes, very few deletes.
Maybe 5 to 20 users.
Every time someone displays a screen, 20 to 100 reads need to happen
in well under a second. Most of these will be misses.

As I said in another message, the real problem here is probably
network latency, and switching databases won't help that. I think
I'll have to abandon the brute force approach and try something more
clever.
 
T

Tom Anderson

So:
Number of entries probably < 1000 (but who knows what people will do?)
Keys are strings, 20 to 30 characters.
Values are strings, maybe 50 to 500 characters.
Frequent reads, occasional writes, very few deletes.
Maybe 5 to 20 users.
Every time someone displays a screen, 20 to 100 reads need to happen
in well under a second. Most of these will be misses.

My gut feeling is that this is not a big ask, and that an RDBMS will be
able to deal with this without any trouble. The first performance-related
thing i'd do (possibly even upfront, before i had hard data that
performance was actually a problem) would be to batch the reads: collect
together all the keys you're going to need, send them over the wire all at
once, and have the far end do a query like:

SELECT key, value
FROM annotation
WHERE key IN (?, ?, ?, ?)

Plugging each key into one of the parameters in the IN-set.

There's a question over how many parameters you put in that IN-set, and
what you do if your number of keys is different. I think i'd construct the
PreparedStatement on the fly with exactly the right number of parameters,
so the question is moot. Since there will only be a small number of
different numbers of parameters (not more than one per form in your app),
the database's PreparedStatement cache should happily hold all of them,
and you won't be incurring a lot of preparation overhead.
As I said in another message, the real problem here is probably network
latency, and switching databases won't help that. I think I'll have to
abandon the brute force approach and try something more clever.

Probably. Robert Klemme's suggestion of building a toy version to get some
data on the performance numbers is an excellent one; in my part of the
world, we call this a 'spike':

http://www.extremeprogramming.org/rules/spike.html

tom
 
J

Jim Janney

Tom Anderson said:
My gut feeling is that this is not a big ask, and that an RDBMS will
be able to deal with this without any trouble. The first
performance-related thing i'd do (possibly even upfront, before i had
hard data that performance was actually a problem) would be to batch
the reads: collect together all the keys you're going to need, send
them over the wire all at once, and have the far end do a query like:

SELECT key, value
FROM annotation
WHERE key IN (?, ?, ?, ?)

Plugging each key into one of the parameters in the IN-set.

There's a question over how many parameters you put in that IN-set,
and what you do if your number of keys is different. I think i'd
construct the PreparedStatement on the fly with exactly the right
number of parameters, so the question is moot. Since there will only
be a small number of different numbers of parameters (not more than
one per form in your app), the database's PreparedStatement cache
should happily hold all of them, and you won't be incurring a lot of
preparation overhead.


Probably. Robert Klemme's suggestion of building a toy version to get
some data on the performance numbers is an excellent one; in my part
of the world, we call this a 'spike':

http://www.extremeprogramming.org/rules/spike.html

It occurred to me this morning that if I put a timestamp in each
record, I could

SELECT * FROM annotation WHERE timestamp > ?

to find any changes not in the local cache. Most of the time that
would return nothing, presumably fairly quickly. It would miss
deletes, but that's probably acceptable.
 
T

Tom Anderson

It occurred to me this morning that if I put a timestamp in each
record, I could

SELECT * FROM annotation WHERE timestamp > ?

to find any changes not in the local cache. Most of the time that would
return nothing, presumably fairly quickly. It would miss deletes, but
that's probably acceptable.

Good idea. And you could do deletes by setting the value to NULL and
leaving the record in the database.

This starts to sound like what you really want is some sort of file
replication, though.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top