Tree ADT vs Database

djcredo · Apr 8, 2006

Hi all,

I have an application that currently works with two very large n-ary
trees. A lot of time is spent doing depth-first searches on these
trees.

I'm thinking of moving away from using trees to using a database
backend, and using SQL to query the database. Would I get any
performance improvement this way? What sort of datastructure does a
database use? What sort of Big-Oh times would I be likely to get?

Thanks in advance!
Matt

Rhino · Apr 8, 2006

Hi all,

I have an application that currently works with two very large n-ary
trees. A lot of time is spent doing depth-first searches on these
trees.

I'm thinking of moving away from using trees to using a database
backend, and using SQL to query the database. Would I get any
performance improvement this way?

That's impossible to answer: we don't know what performance you are getting
with your current code! Database queries can run very quickly under ideal
circumstances but a lot depends on:
- what sort of database you are using - flat file, hierarchical, relational,
etc. Since you mention SQL, I assume you want a relational database but
there are some non-relational databases that use some form of SQL.
- what specific database you are using, e.g. DB2, Oracle, MySQL
- how well you have normalized and de-normalized your data
- how fast your hardware is
- how your data is clustered
- how efficient your SQL is
- etc. etc.

What sort of datastructure does a database use?

I'm not sure what you mean by that. Logically, you always think of
relational database data being stored as rows in tables; that's the paradigm
underlying SQL. Internally, however, a relational database may use other
structures to make it easier to find your data. For instance, DB2 uses
b-trees very heavily "under the covers", although you don't need to worry
about that when you are writing your SQL. Other vendors use other approaches
and it's not unusual to see a mix of approaches in each vendor's product.

What sort of Big-Oh times would I be likely to get?

I've been working with database for 25+ years and I've never heard the term
"Big-Oh": what is it? Assuming that it is some kind of performance
measurement, refer to my earlier remarks.

In short, databases have many virtues and can have excellent performance but
that doesn't mean they will _inevitably_ perform better than whatever you're
doing now.

If you're new to databases, you will need to allow time for you - and the
rest of your development team if you are not working alone - to get up to
speed. You'll need to learn proper database design and all of the various
issues surrounding databases, like installation, administration,
maintenance, etc. It might be best for you to attempt a smallish prototype
first so that you can get familiar with all of the implications of using a
database; you can do some benchmarking with your prototype and see if you
are getting the performance you want without having to invest an excessive
amount of time and money on databases.

djcredo · Apr 8, 2006

Hi,

Thanks for your reply. I guess I could of included a few more
information in my original post. I mention Big-O notation because it
provides a good performance estimate for common operations, without
needing to know details such as hardware performance, etc.

If you search for "Big O notation" in wikipedia, you can look for
yourself at what i means. To put it simply, it mathematically gives you
time for operations. For example, accessing an array takes O(1) time,
which is constant time. Searching an array takes O(n) time, which is
proportional to the number of entries "n" in the array. This is also
known as "Linear time".

Big O notation gives a good comparison between data structures, because
you don't need to know anything about the implementation. For example,
comparing arrays vs Hash Tables, searching requires O(n) vs O(1)
respectively. You can also state the best case time (say, sorting an
array that is already sorted), and the worst case time (eg, accessing
elements in a binary tree that is poorly constructed, with all nodes on
the left side).

I've spent a long time working with relational databases, so I don't
think I'll have any problem there. If my data is normalized to the 3rd
normal form, would you expect searching to be quicker than a simple
tree? Thats kind of why I was interested in the underlying data
structures of a standard database (MySQL for example), to get an idea
as to whether it would be faster or not.

djcredo · Apr 8, 2006

Also, lets assume for sake of simplicity, that we are working in the
Best case scenario - when the data is clustered best possible way, when
the database is most efficient etc.

I'm trying to see using a database would be (in the best case), any
faster than a n-ary tree (also in the best case)

Cheers!

Patricia Shanahan · Apr 8, 2006

Rhino said:
I'm not sure what you mean by that. Logically, you always think of
relational database data being stored as rows in tables; that's the paradigm
underlying SQL. Internally, however, a relational database may use other
structures to make it easier to find your data. For instance, DB2 uses
b-trees very heavily "under the covers", although you don't need to worry
about that when you are writing your SQL. Other vendors use other approaches
and it's not unusual to see a mix of approaches in each vendor's product.

According to Cormen et. al, "Introduction to Algorithms", the number of
disk pages accessed in a B-tree search is:

\Theta(h) = \Theta(log_t n)

where h is the height of the tree, n is the number of keys, and t is the
minimum degree parameter.

The total CPU time for the search is:

O(th) = O(t log_t n)

I've been working with database for 25+ years and I've never heard the term
"Big-Oh": what is it? Assuming that it is some kind of performance
measurement, refer to my earlier remarks.

It's a standard way of looking at algorithm performance. See e.g.
http://www.nist.gov/dads/HTML/bigOnotation.html.

The good news is that it can be calculated without getting into a lot of
details about the hardware running the algorithm. It is a way of
discussing the broad picture, without getting into too many details.
The bad news is that, in some cases, the constant multipliers matter,
and big-O loses them.

\Theta (capital Greek letter theta) is part of the same family of
complexity measures. It sets a lower bound, as well as the upper bound
implied by big-O.

Patricia

.Jesper. · Apr 8, 2006

I think the concepts you need is in-memory versus disk access. All
other things equal, reads from a harddisk are slower than if you have
your whole data structure in memory. If you chose the database approach
your design will determine how slow or fast things are. Relational
databases are an excellent way of housing your data, but in some cases,
it makes sense to have things in memory.

Big-Oh is the asymptotic complexity. It is asymptotic because we want
to isolate the most dominant term in a function, and then show its
growth rate based on the number of items, n.
A quadratic algorithm can be notated with O(n^2) ("order n squared").
To determine big-oh for a database read you'd have to know a whole lot
about how the database is implemented. A common appproach is to use
query analysis tools. They will not say anything in terms of Big-oh,
but they will give you an idea of how the database plans on performing
your sql statment, algorithmically. They will give you a "query
execution plan" for your statement.

Best regards - Jesper

Patricia Shanahan · Apr 8, 2006

Also, lets assume for sake of simplicity, that we are working in the
Best case scenario - when the data is clustered best possible way, when
the database is most efficient etc.

I'm trying to see using a database would be (in the best case), any
faster than a n-ary tree (also in the best case)

Cheers!

I suspect you would lose slightly going to database, at least compared
to the best you could do tuning your own data structures. Database
structures tend to be trade-offs between search speed and insert/delete
speed. A B-tree, for example, tolerates unused slots in nodes.

It sounds as though you have a situation in which search is very much
more frequent than insert or delete.

Patricia

Missaka Wijekoon · Apr 8, 2006

Hi all,

I have an application that currently works with two very large n-ary
trees. A lot of time is spent doing depth-first searches on these
trees.

I'm thinking of moving away from using trees to using a database
backend, and using SQL to query the database. Would I get any
performance improvement this way? What sort of datastructure does a
database use? What sort of Big-Oh times would I be likely to get?

Thanks in advance!
Matt

Matt,

As many of the posters have told you, whether a database using SQL is
faster (or better) is dependent on how you plan to use and access that
data. I have written memory based trees for speed and used SQL for
longer term storage, etc.

Remember, that in order to "talk" to a database the db library has to
transfer and parse/unparse SQL back and forth. In addition, there may
be IPC, network and/or i/o latencies in communicating with a relational
database like BD2 or MySQL. The extra overhead involved in doing SQL
may be more time consuming than direct memory accesses involved in your
own tree.

How many nodes are you keeping in memory? How big is a node? Is it
read much more often than it is updated? Does it need to be persistant?
How many nodes per second does your application access?

From my own personal experience, a relational database is slower than
even a slow memory based tree. Also, normalization helps to compress
the data. In accessing the data, the time required to de-normalize the
data may require more time.

-Misk

Roedy Green · Apr 8, 2006

It's a standard way of looking at algorithm performance. See e.g.
http://www.nist.gov/dads/HTML/bigOnotation.html.

also see http://mindprod.com/jgloss/onsquared.html for a more informal
explanation.

Roedy Green · Apr 8, 2006

I think the concepts you need is in-memory versus disk access. All
other things equal, reads from a harddisk are slower than if you have
your whole data structure in memory.

Another factor is algorithm analysis tends to count node accesses, key
comparisons etc. These may not be where the algorithm spends all its
time, especially for pure in-RAM access and integer keys.

new/delete vs. vector implementing n-ary tree	4	Apr 14, 2008
ANN: eGenix mxODBC Django Database Engine - Django ODBC Adapter 1.2.0	0	Jun 18, 2013
Asp.Net session vs. Database Queries.	3	Nov 29, 2007
SQLite logic error or missing database	1	Jul 30, 2013
TransactionScope vs SqlTransaction vs (BEGIN TRAN/COMMIT TRAN) on a SINGLE 2005 database	2	Aug 26, 2008
Monoliths vs. "Microlibraries"	5	May 14, 2011
[ANN] Struqtural: High level database interface library	3	Jul 17, 2010
Large Database System	4	Oct 19, 2007

Tree ADT vs Database

djcredo

Rhino

djcredo

djcredo

Patricia Shanahan

.Jesper.

Patricia Shanahan

Missaka Wijekoon

Roedy Green

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads