B+-trees

Rico · Jul 29, 2004

This is not a Java-specific question but it would definitely be meant for
a programmer. So I'm thinking I'm at least half on-topic here.

I'm trying to get a firmer grasp of B+-trees as used for indexing.
I have an input text file with around a thousand lines as follows:

0|Item65| leaguers rebuked realizing southwest haply <....> |81.44|1
1|Item55| Portugal movings grooms backstitching octane <.....> |86.8|3

A unique line number, an item label, a list of words associated with the
item, a price and an inventory level.

http://cs.ecs.baylor.edu/~speegle/5335/indexes.html

I think that a purely RAM implementation means that I'd have my own node
structures into which I insert the data from the input file.
What if I need to make each node its own file and the pointers are
supposed to be filenames?
Am I still supposed to load the data into node structures or is the point
to just rely on the O/S taking care of loading the page as a disk block?

Also, the specifications just say that each node can have from 50 to 200
pointers. I think it makes sense for non-leaf nodes: with a disk block
size of 4096 bytes, and 20 bytes being enough to store a keyword/pointer
from which I can derive the filename, 200 pointers can fit in.

But I'm not sure whether that variable number of pointers applies to
leaf nodes, or bucket files, as well. After analyzing the data, e.g
highest occurrence of a keyword on distinct lines and longest keyword,
I'm thinking that it doesn't because 50 keywords and the list of line
numbers on which each occurs wouldn't fit in one disk block, let alone
then 200 keywords.

What I would like to know is, am I doing it right?
The quick analysis of the data was done programmatically using
hashtables. Is that "cheating" or is making such assumptions about the
data necessary in designing the B+-tree ?

Thanks.
Rico.

Michael Borgwardt · Jul 30, 2004

Rico said:
I'm trying to get a firmer grasp of B+-trees as used for indexing.
I think that a purely RAM implementation means that I'd have my own node
structures into which I insert the data from the input file.

True, but a pure RAM implementation of a B-Tree is pretty pointless
except as a learning exercise. The main (only) advantage of B-Trees is
that they are very well suited to be kept on harddisk because they
minimize the speed penalties of HD access. In RAM, it's relativel slow.

What if I need to make each node its own file and the pointers are
supposed to be filenames?

That would be a next step, allowing you to experiment and see how the
B-Tree is much faster than other tree structures for very large
and therefore disk-based data. In a "professional" implementation,
you would not use files but directly access disk blocks.

Am I still supposed to load the data into node structures or is the point
to just rely on the O/S taking care of loading the page as a disk block?

That's not the point, the point is that you somehow have to process
the data to find the correct child node. The easiest way of implementing
this is to convert the node to an object structure, but of course it's
pretty bad for performance.

Also, the specifications just say that each node can have from 50 to 200
pointers.

Then the specifications do not in fact describe a B-Tree, because by
definition a node in such a tree must never be less than half full.

But I'm not sure whether that variable number of pointers applies to
leaf nodes, or bucket files, as well. After analyzing the data, e.g
highest occurrence of a keyword on distinct lines and longest keyword,
I'm thinking that it doesn't because 50 keywords and the list of line
numbers on which each occurs wouldn't fit in one disk block, let alone
then 200 keywords.

I think it would be perfectly OK for leaf nodes to be bigger than
a disk block. In real databases, a far more complex technique is used,
and data items (records) are placed separately, with the B-tree just
containing a pointer to them (there can be several indices for the
same data, after all).

What I would like to know is, am I doing it right?
The quick analysis of the data was done programmatically using
hashtables. Is that "cheating" or is making such assumptions about the
data necessary in designing the B+-tree ?

I'm not quite sure what you're asking.
The main point of balanced trees in general is that you do NOT have to
make any assumptions about the data, since the data structure is designed
to avoid degeneration (and thus guarantee fast access) no matter what
kind of data you put into it.

Rico · Jul 30, 2004

Hello Michael.

Rico wrote:

True, but a pure RAM implementation of a B-Tree is pretty pointless
except as a learning exercise. The main (only) advantage of B-Trees is
that they are very well suited to be kept on harddisk because they
minimize the speed penalties of HD access. In RAM, it's relativel slow.

I think it would be perfectly OK for leaf nodes to be bigger than
a disk block. In real databases, a far more complex technique is used,
and data items (records) are placed separately, with the B-tree just
containing a pointer to them (there can be several indices for the
same data, after all).

Is it right to say that the B-tree minimizes the speed penalties of HD
access because it is designed so as to minimize page faults?

http://www.cs.umd.edu/~meesh/kmconroy/cmsc420/BPTree.html

Have a look at the part where the above URL talks about a "post-it".
Would that be relevant in the case of database indexing?
E.g when the B-tree just contains a pointer to the records, aren't we
setting ourselves up for page-faulting like crazy here?

I am probably missing something because replicating the records for
every index is a spooky idea, but how do we get around the 'post-it'
causing page faults and affecting performance then?

<Rest of a very useful post snipped for brevity.>

Thanks for the insight.

Rico.

Michael Borgwardt · Jul 30, 2004

Rico said:
Is it right to say that the B-tree minimizes the speed penalties of HD
access because it is designed so as to minimize page faults?
http://www.cs.umd.edu/~meesh/kmconroy/cmsc420/BPTree.html

Basically, yes. The core characteristic is the extremely large branching
degree, which means that you need to traverse very few nodes to get to a
leaf, no more than 4 or 5 (at least 2 of which are probably cached
in main memory) even if you have billions of elements. On the
downside, you have to do a lot more work in each node, but that's OK
because all of the data is in one place and working on it is much faster
than skipping across nodes that need to be fetched from harddisk.
Harddisk vs. RAM is main issue. CPU cache vs. main memory is a much
smaller speed difference and thinking about it usually is pointless when
you're using a language like Java. Some of the details asserted in that
discussion may in fact vary between different VMs.

Have a look at the part where the above URL talks about a "post-it".
Would that be relevant in the case of database indexing?
E.g when the B-tree just contains a pointer to the records, aren't we
setting ourselves up for page-faulting like crazy here?

Possibly, but a DB server is doing that most of the time anyway.
You just do everything you can to reduce the number of disk
accesses.

I am probably missing something because replicating the records for
every index is a spooky idea,

I'd say it's absolutely unfeasible. DBs need to deal with a lot of
write access, and think of the synchronization overhead it would cause!

but how do we get around the 'post-it'
causing page faults and affecting performance then?

Probably we don't, at least not all the time.

A big relief might be the fact that you often don't need to access all of
the records you find in the index, because the important fact is whether
they're in it at all. E.g. when you want to find all records that match
two criteria and you have indices on both relevant fields, you only fetch
those records that occur in *both* indices.

Roedy Green · Jul 30, 2004

Is it right to say that the B-tree minimizes the speed penalties of HD
access because it is designed so as to minimize page faults?

When you write a btree implementation you control precisely where
everything goes on hard disk. You place things likely to be fetched
together. e.g. the next record by key, in the same disk block.

You keep higher order indexes in blocks by themselves.

You periodically balance the tree to try to keep a certain proportion
of free space in each data block. That way you can quickly add a
record without splitting the block and having to fix up pointers.

If you leave too much space, you end up doing physical i/o on
emptiness.

Writing any linked structure on disk is an order of magnitude more
difficult at least than doing it in RAM. You are constantly trying to
avoid moving things. You can't just allocate space. You have to keep
track yourself where you can get some, and also recycle space from
deleted records. Records can grow or shrink, meaning often blocks
must be split or recombined.

The one I did was crashproof, which added even more difficulty. The
goal there is you want the database to be logically consistent even if
the power fails at almost any time. Then you must never update
records, but write new ones and only when everything is in place flip
an index pointer over to the new index records.

In the one I did indexing information grew up in each block and data
information grew down. Sorry I don't have the source code. I did it
for BC Hydro for use in real time photogrammetry. I had to be able to
handle 30 inserts a second on a PDP-11, quite a clip in those days.
This meant there was manually managed caching as well, all done in
Pascal which did not have a proper concept of container.

When you do this sort of thing purely in RAM, the automatic membory
allocation, gc, and the OS caching do much of the work for you.
Further there are no head motions to worry about. You can hop about
all over the place pretty well without penalty, at least compared with
what happens when you do that directly on hard disk.

Roedy Green · Jul 30, 2004

I'm trying to get a firmer grasp of B+-trees as used for indexing.
I have an input text file with around a thousand lines as follows:

For my Abundance work, I use Pervasive Btrieve, a Btree
implementation.

Roedy Green · Jul 30, 2004

For my Abundance work, I use Pervasive Btrieve, a Btree
implementation.

Inside SQL engines is some sort of disk-based btree lookup. The SQL
hides the mechanics of it from you, but with the open source ones you
could go in and look how it works. It is extremely tedious code. The
key really is understanding the disk structure.

Michael Borgwardt · Jul 30, 2004

Roedy said:
The one I did was crashproof, which added even more difficulty. The
goal there is you want the database to be logically consistent even if
the power fails at almost any time. Then you must never update
records, but write new ones and only when everything is in place flip
an index pointer over to the new index records.

In the one I did indexing information grew up in each block and data
information grew down.

This sounds a lot like what I was taught as standard technique in my
database design lecture a few years ago.

Roedy Green · Jul 30, 2004

This sounds a lot like what I was taught as standard technique in my
database design lecture a few years ago.

This was early 80s. Btrees have been around a long time. There were
some pretty inept implementations,e.g. IBM ISAM, in the early 70s.

Rico · Aug 1, 2004

This was early 80s. Btrees have been around a long time. There were
some pretty inept implementations,e.g. IBM ISAM, in the early 70s.

Thanks for the input Roedy and Michael. Well worth the read

Rico.

Eric Sosman · Aug 2, 2004

Roedy said:
This was early 80s. Btrees have been around a long time. There were
some pretty inept implementations,e.g. IBM ISAM, in the early 70s.

ISAM wasn't a B-tree, nor any variation thereof.
You may be thinking of VSAM.

B-trees	12	Aug 2, 2008
Binary search trees (AVL trees)	34	Jan 3, 2010
recommended libraries for B-trees / linked lists / sort etc . . .	2	Mar 10, 2008
bench: binary trees	10	Jul 29, 2012
Behavior trees in ruby	2	Oct 24, 2010
Lock-free binary trees	27	Jul 2, 2006
Builtin classes list, set, dict reimplemented via B-trees	2	Sep 14, 2005
comparing binary trees in C	12	May 1, 2009

B+-trees

Rico

Michael Borgwardt

Rico

Michael Borgwardt

Roedy Green

Roedy Green

Roedy Green

Michael Borgwardt

Roedy Green

Rico

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads