How would you design scalable solution?

B

Bryan

I'm designing a system and wanted to get some feedback on a potential
performance problem down the road while it is still cheap to fix.

The system is similar to an accounting system where a system tracks
"Things"
which move between different "Buckets". The system answers these
questions:
- How many Things are in this Bucket today?
- Move Things from Bucket1 to Bucket2...
- Now how many Things are in each Bucket?

So each time a Thing is moved between Buckets, I imagine a DB row like
this:
| id | thingId | toBucket | fromBucket | qty |

Then to find how many Things are in a certain Bucket:
1. Start with initial qty in the bucket at the beginning of time
2. *Add* all qty moved *to* the bucket since beginning of time
3. *Subtract* all qty moved *from* the bucket since beginning of time

Simple system to "account" for Things.

My problem is this. This design will inherantly get slower as time
goes on.
As the number of rows recording a transfer between buckets increases,
the query
to see how many Things are in a Bucket will get slower. I experience
this when
I use gnucash (which I love). I don't do "closing entries" at the end
of the
year, so each account has every transaction I have every made. I see
it getting
slower. It is nothing I am going to do anything about, because it is
still fast
enough for me. But I have to wonder how big companies with thousands
of
transactions a day do this?

One solution would be to do a "closing entry" at certain periods in
time, so old
info would be archived. Each bucket would start the new time period
with a
balance of Things equal to what it was at the point in time we
"closed".

How else to keep a record of every transaction, but not have the speed
of the
question "How many Things in Bucket x" depend on looking @ every
transaction
record ever made?
 
J

Jonathan Gardner

How else to keep a record of every transaction, but not have the speed
of the
question "How many Things in Bucket x" depend on looking @ every
transaction
record ever made?

You can have three different tables in your database:

(1) The transaction log. (You described it above.)

(2) What the current state of the entire system is---where everything
is.

(3) A materialized view of the total of what is in each bucket.

Note that you didn't specify that the system should answer the
question "What is the history of this item?" or "What is the
historical contents of this bucket?" Because of this, you really don't
need (1). Gnucash has this requirement since people would like to go
back in time and see what happened historically.

Your requirements also don't specify that the system should answer the
question, "What is in this bucket right now?" Because of that, you
really don't need to expose the data in (2). However, I assume you'll
need some way of saying, "Can't move X from A to B because X isn't in
A!" so (2) is necessary.

(3) gives you exactly what you need, with linear look up by bucket ID
if you use a hash index for any number of buckets. You can also obtain
the value of all buckets linearly.

Do some research into what a "materialized view" is and how to keep it
in sync with the actual state of the data so that you can do this
properly. I believe that should solve your problem.

If you intend to implement this in memory by yourself, please
familiarize yourself on the various issues you will run into with
transactional and persistent data, especially with multi-threading if
you will allow it.

If you use a proper database to manage transactions for you, there is
no reason that the transaction log, current contents, and current
summary materialized view should ever get out of sync with each other,
as long as your transactions are all correct. For best results, write
a database function that does what you need and have all the clients
update the tables through it.
 
J

Jon Clements

I'm designing a system and wanted to get some feedback on a potential
performance problem down the road while it is still cheap to fix.

The system is similar to an accounting system where a system tracks
"Things"
which move between different "Buckets".  The system answers these
questions:
- How many Things are in this Bucket today?
- Move Things from Bucket1 to Bucket2...
- Now how many Things are in each Bucket?

So each time a Thing is moved between Buckets, I imagine a DB row like
this:
 | id | thingId | toBucket | fromBucket | qty |

Then to find how many Things are in a certain Bucket:
1. Start with initial qty in the bucket at the beginning of time
2. *Add* all qty moved *to* the bucket since beginning of time
3. *Subtract* all qty moved *from* the bucket since beginning of time

Simple system to "account" for Things.

My problem is this.  This design will inherantly get slower as time
goes on.
As the number of rows recording a transfer between buckets increases,
the query
to see how many Things are in a Bucket will get slower.  I experience
this when
I use gnucash (which I love).  I don't do "closing entries" at the end
of the
year, so each account has every transaction I have every made.  I see
it getting
slower.  It is nothing I am going to do anything about, because it is
still fast
enough for me.  But I have to wonder how big companies with thousands
of
transactions a day do this?

One solution would be to do a "closing entry" at certain periods in
time, so old
info would be archived.  Each bucket would start the new time period
with a
balance of Things equal to what it was at the point in time we
"closed".

How else to keep a record of every transaction, but not have the speed
of the
question "How many Things in Bucket x" depend on looking @ every
transaction
record ever made?

As well as what Jonathan has said, the following might be worth
researching:

() "Table Partitioning" - a specific example for postgres is here:
http://www.postgresql.org/docs/8.1/static/ddl-partitioning.html.
(Although other DB's support similar techniques)
() Shared Nothing Architecture (aka. 'Sharding') with appropriate
'buckets' which is massively scalable, but I'm guessing overkill :)
http://en.wikipedia.org/wiki/Shared_nothing_architecture - IIRC
SQLAlchemy has a basic implementation of this in its examples.

Cheers,

Jon.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top