B-trees

R

rsprawls

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.
 
E

Erik Wikström

I found a disk for a b-tree algorithm that I purchased back in 93 or
so. I'd hoped to find this, but now I'd like to know how worthwhile
are b-trees in today's advancements? This is old C code using pascal
and cdecl declarations, but it would be a minor project to bring up to
date in C, but I was thinking it might be a good project to convert it
over to C++, especially since I don't know C++ very well and would
like a good sorting algorithm for medium sized datafiles.

Are b-trees still a valued data structure today? It's been so long
since I set foot in programming, I'm itching to start again, but I'm
wanting a starting point.

Not really topical for this group, you might want to ask in a general
programming group like comp.programming.

AFAIK B-Trees or one of its derivatives are used in most modern
filesystems, since the time it takes to read in a node from the disk is
so high you want to have as few reads as possible. I would also suspect
that they are commonly used in databases and other places which deals
with on-disk storage. If the datastructur is in RAM other structures are
generally better.
 
J

James Kanze

On 2008-08-03 00:02, rsprawls wrote:
Not really topical for this group, you might want to ask in a
general programming group like comp.programming.
AFAIK B-Trees or one of its derivatives are used in most
modern filesystems, since the time it takes to read in a node
from the disk is so high you want to have as few reads as
possible.

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.
I would also suspect that they are commonly used in databases
and other places which deals with on-disk storage.

In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x > a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing" structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:).)
If the data structure is in RAM other structures are generally
better.

That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).
 
E

Erik Wikström

I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

The nice things about B-Trees is that they keep the depth of the tree
very low, which means that the number of nodes that have to be examined
before you find the data you want is also low. But making each node in
the tree the size of (or a multiple of) a disk-block you can make the
number of disk-reads equal to the number of nodes examined. And since
disk-reads are far more expensive than almost any operation you might do
on the data this is a Good Thing.

An example of how one might organise a filesystem would be to keep four
things on disk, a directory (containing directories and files), a B-Tree
of inodes (where an inode contains the data about a file, includin
information about where the actual file data is stores), file data, and
a B-Tree with free space information.

Each entry in the directory represents a file in the system and contains
the inode number of the file. The directory can probably be made small
enough to keep at least part of it in cache.

To read the contents of a file you first look it up in the directory
which gives you the inode number, which is used to find the inode in the
B-Tree. The inode then contains data about the file and in which disk-
blocks the file data is stored. For large files (or sparse files) you
can once again use a B-Tree for the disk-blocks to speed up random access.

The free space tree is much simpler, it just contains the address of
each free disk-block.
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...

[ ... Warning: this post is long, and topicality is marginal at best: ]
I'm not sure what you're considering a B-Tree here. I don't
think that the OS does anything special with directories,
however: the general philosophy is that if your directory
structure is well organized, no individual directory will
contain enough entries to make anything other than linear search
worth the effort. Also, the main use of B-Trees is to keep
sorted data; the OS's I know don't maintain the directory
entries sorted, so a B-Tree wouldn't improve access in any way.

Windows NTFS does use B-trees for directories. Small directories (up to
something like 4K of total data) are stored entirely in the directory's
MFT entry as essentially a vector. Large directories (anything where the
directory entries won't fit entirely in the MFT) are stored as B-trees.

Most Unix file systems use hashing instead. E.g. ext2 and ext3 are quite
common on Linux. There's a pretty decent description of their hashing
at: http://www.nongnu.org/ext2-doc/ext2.html#INDEXED-DIRECTORY

Interestingly, though this uses hashing, it's not "flat" hashing -- the
hash is used as the key in (you've probably guessed by now) a balanced
tree -- pretty much a B-tree.

Though the decision as to when to use this indexing is made a bit
differently than it is in NTFS, the same basic idea applies: only large
directories are indexed in this fashion.

There are certainly a few primitive file systems (e.g. FAT) that don't
support indexing directories at all, but at this point, I'd consider
them more the exception than the rule. Oddly, one of the primary uses of
FAT at this point is in things like cards in digital cameras -- which
typically organize your pictures as a single directory containing _lots_
of files (hundreds or thousands of files is quite common). IOW, one of
the most common uses of FAT also closely resembles its worst case.

[ ... ]
In a relational database, it probably depends on the types of
indexes, and the use which is made of them. I think modern
databases dynamically tune the organization according to the
actual accesses they see, which means that if sorted access
isn't useful, they won't use a B-Tree. (Sorted access is useful
not only when you request sorted output, but also when you have
index criteria like "where x > a and x < b". On the other hand,
a hash table is generally better for "where x = a", and off
hand, I can't think of any "optimizing" structure for things
like "where x like '...%'"---I've also noticed that when I use
such requests, my access times go up significantly:).)

If you didn't mind a huge index, you could create an index that pointed
to a particular record with the string starting from every possible
offset in a field. For example, given two records with keys of "abcd"
and "xyz", the index would look something like:

key record number
abcd 0
bcd 0
cd 0
d 0
xyz 1
yz 1
z 1

Given its (almost inevitably large) size, you'd probably want to
organize that index as a b-tree.

Then, matching a search with a leading wildcard would be essentially
similar to matching something without the leading wildcard. Given the
larger index size, you might easily have an extra level or two in the
tree, so access might be somewhat slower, but not the orders of
magnitude slower that we expect with current SQL servers.

Of course, on its own, this design only optimizes access in the case of
a leading wildcard -- but IME, that's a very common scenario. OTOH, with
only a small amount of extra data, this index could also be used for
embedded wildcards: you'd also store the offset into the string for each
entry. In this case, searching for something like 'ab%f' would consist
of searching for 'ab' with an offset of 0 and 'f' with an offset >= 2.

[ ... ]
That used to be true, but I'm not sure now, with paging, etc.
If you need random access to sorted data, using a B-Tree with
each node filling an entire page (and page aligned) might be
worth it (but you'd need one awfully big table to see the
difference).

I wouldn't be surprised if it did pretty well -- like paging, the
performance of caching also tends to depend heavily upon locality of
reference -- and that's a factor B-trees were designed to maximize to a
large degree.
 
J

James Kanze

On 2008-08-03 10:08, James Kanze wrote:

[...]
The nice things about B-Trees is that they keep the depth of
the tree very low, which means that the number of nodes that
have to be examined before you find the data you want is also
low.

A well designed hash table will typically require even less
nodes to be examined, at least with larger quantities of data.
I'm pretty sure that B-Trees are only anvantageous when ordering
is also required.
An example of how one might organise a filesystem would be to
keep four things on disk, a directory (containing directories
and files), a B-Tree of inodes (where an inode contains the
data about a file, includin information about where the actual
file data is stores), file data, and a B-Tree with free space
information.

I rather suspect that most systems keep the mapping inode number
to where the inode is located on disk in memory, initializing it
once, when the disk is mounted. (The original Unix kept the
inodes in a separate contiguous sub-partition, with the inode
number being used as a direct index. A very clever scheme to
maximize head movement:). I'm not sure what is being used
today, but for large volumes, at least, a hash table would seem
most appropriate.
Each entry in the directory represents a file in the system
and contains the inode number of the file. The directory can
probably be made small enough to keep at least part of it in
cache.

The size of the directory depends on the number of files it
contains, and possibly (likely) on the length of each filename.
Again, I've not had the occasion to measure recently, but on
older systems (Sun 3, in this case), performance would degrade
radically if you put too many files in a directory. (I've seen
a simple open take close to five minutes.) The response from
system support at the time was: don't put so many files in a
directory. On the whole, I think that they're right.
To read the contents of a file you first look it up in the
directory which gives you the inode number, which is used to
find the inode in the B-Tree. The inode then contains data
about the file and in which disk- blocks the file data is
stored. For large files (or sparse files) you can once again
use a B-Tree for the disk-blocks to speed up random access.

A full implementation of a B-Tree would probably slow things
down, compared to most implementations. You can't insert bytes
into the middle of a file, so the system can systematically keep
all nodes to a specific level full (and know the exact limits
from the size of the file, which is stored in the inode). Thus,
to access a given address, it can go immediately to the correct
node, without any comparisons, with just a little modulo
arithmetic. (And until recently, the depth was seriously
limited, since you don't need much to cover 2GB.)
The free space tree is much simpler, it just contains the
address of each free disk-block.

Since you don't have to access the free space list randomly, nor
maintain any order, there's absolutely no need for any sort of
tree structure; a simple linked list will do.
 
J

Jerry Coffin

On 2008-08-03 10:08, James Kanze wrote:
[...]
The nice things about B-Trees is that they keep the depth of
the tree very low, which means that the number of nodes that
have to be examined before you find the data you want is also
low.

A well designed hash table will typically require even less
nodes to be examined, at least with larger quantities of data.
I'm pretty sure that B-Trees are only anvantageous when ordering
is also required.

Not so -- B-trees provide much stronger locality of reference than
hashing, as a rule.

[ ... ]
The size of the directory depends on the number of files it
contains, and possibly (likely) on the length of each filename.
Again, I've not had the occasion to measure recently, but on
older systems (Sun 3, in this case), performance would degrade
radically if you put too many files in a directory. (I've seen
a simple open take close to five minutes.) The response from
system support at the time was: don't put so many files in a
directory. On the whole, I think that they're right.

I haven't looked at Solaris recently to see what it does, but most
current Unix systems have much better support for directories that
contain a lot of files.

[ ... ]
Since you don't have to access the free space list randomly, nor
maintain any order, there's absolutely no need for any sort of
tree structure; a simple linked list will do.

I haven't studied this particular problem enough to be really sure, but
at least at first glance, it seems to me that you'd like a free list
that supported random access. Offhand, I can think of two situations in
which you're going to access the free list: either you're creating a new
file, or you're extending an existing file.

When you're extending an existing file, you almost certainly want to
pick a new block that's as close as possible to the current end of file.

When you create a new file, it'll most often be in a directory where
there are already some files -- and it's fair to guess that files in the
same directory will tend to be used together, so to minimize head
movement you'd also like to allocate them close together.

When you create the first file in a directory, you probably want to
allocate from a relatively large block of free space, to support
creating more files nearby as above, since it's _fairly_ unusual to
create a new directory that'll only ever contain one file.

The directory structure gives at least some clue of usage patterns, and
it would be best to take advantage of that information in allocating
file space.
 
E

Erik Wikström

I rather suspect that most systems keep the mapping inode number
to where the inode is located on disk in memory, initializing it
once, when the disk is mounted. (The original Unix kept the
inodes in a separate contiguous sub-partition, with the inode
number being used as a direct index. A very clever scheme to
maximize head movement:). I'm not sure what is being used
today, but for large volumes, at least, a hash table would seem
most appropriate.

With old FSs it was common to pre-allocate a certain number of inodes
when the disk was formatted, which had two problems, either you had more
inodes than you needed or you did not have enough. In the first case you
wasted space, in the second you were out of luck. New FSs allocates
inodes on the fly and the trend seems to be to put as much as possible
in the B-Tree (or Hash-tree) nodes and only keep the directory and file
data separate.
The size of the directory depends on the number of files it
contains, and possibly (likely) on the length of each filename.
Again, I've not had the occasion to measure recently, but on
older systems (Sun 3, in this case), performance would degrade
radically if you put too many files in a directory. (I've seen
a simple open take close to five minutes.) The response from
system support at the time was: don't put so many files in a
directory. On the whole, I think that they're right.

One possible solution to this is to hash the path and filename and use
that as the key into the tree.
A full implementation of a B-Tree would probably slow things
down, compared to most implementations.

Of course, none uses a plain B-Tree (or B+Tree), there are always some
form of adaptations (such as linking leaf-nodes as a list, etc.)
Since you don't have to access the free space list randomly, nor
maintain any order, there's absolutely no need for any sort of
tree structure; a simple linked list will do.

A linked list would be horrible, imagine how many reads you might have
to do before you enough free space to create the file, or space which is
close enough to the file you are appending to. XFS used two synced
B-Trees to manage free space, one was sorted by logical address, which
is handy when trying to find free space close to some already allocated
file (such as when appending data). The other tree was sorted by size of
the free space, which is nice when trying to find the best place for a
file (as when creating).

Anyway, this is moving far off topic. What I wanted to demonstrate to
the OP was that B-Trees are still in use today, though they might not be
as common as some other datastructures.
 
J

James Kanze

On 2008-08-03 10:08, James Kanze wrote:
[...]
The nice things about B-Trees is that they keep the depth
of the tree very low, which means that the number of nodes
that have to be examined before you find the data you want
is also low.
A well designed hash table will typically require even less
nodes to be examined, at least with larger quantities of
data. I'm pretty sure that B-Trees are only anvantageous
when ordering is also required.
Not so -- B-trees provide much stronger locality of reference
than hashing, as a rule.

I don't really see why. Wouldn't it depend on the allocation
policy for the hash table?
[ ... ]
The size of the directory depends on the number of files it
contains, and possibly (likely) on the length of each
filename. Again, I've not had the occasion to measure
recently, but on older systems (Sun 3, in this case),
performance would degrade radically if you put too many
files in a directory. (I've seen a simple open take close
to five minutes.) The response from system support at the
time was: don't put so many files in a directory. On the
whole, I think that they're right.
I haven't looked at Solaris recently to see what it does, but
most current Unix systems have much better support for
directories that contain a lot of files.

One would hope so. But the couple of times I've had to deal
with directories with a lot of files under Solaris 8, it was
very slow as well. Of course, each time as been the result of
a programming error generating millions of files in the
directory; presumably, optimizations would still be tuned to a
somewhat smaller number.

What I can say is that the entries in a directory are not stored
in any particular order. Whereas ls and the expansion of * in
the shell return alphabetical order. Which means that the
entire directory must be loaded into memory, which in turn means
thrashing if it is too big. Regardless of the structure on
disk. (But of course, this means that there's not much
advangage in optimizing that structure for very large numbers of
files.)
[ ... ]
Since you don't have to access the free space list randomly, nor
maintain any order, there's absolutely no need for any sort of
tree structure; a simple linked list will do.
I haven't studied this particular problem enough to be really
sure, but at least at first glance, it seems to me that you'd
like a free list that supported random access. Offhand, I can
think of two situations in which you're going to access the
free list: either you're creating a new file, or you're
extending an existing file.
When you're extending an existing file, you almost certainly
want to pick a new block that's as close as possible to the
current end of file.
When you create a new file, it'll most often be in a directory
where there are already some files -- and it's fair to guess
that files in the same directory will tend to be used
together, so to minimize head movement you'd also like to
allocate them close together.
When you create the first file in a directory, you probably
want to allocate from a relatively large block of free space,
to support creating more files nearby as above, since it's
_fairly_ unusual to create a new directory that'll only ever
contain one file.

(You don't know the Unix culture. I've seen empty directories
created for locking purposes:).)

More generally, those both sound like interesting heuristics.
But somehow, given the evolution of systems (if it ain't broke,
don't fix it, etc.), I doubt that there widespread in current
Unix or Windows. (Maybe Linux... where the philosophy seems at
times to be "if it ain't broke, it ain't complicated enough".)
The directory structure gives at least some clue of usage
patterns, and it would be best to take advantage of that
information in allocating file space.

I'd actually be interested in knowing if work has been done in
these directions, and whether it has been incorporated into
modern systems. They sound interesting, but from what little I
know of actual systems, it would surprise me if such ideas had
been incorporated, even if research had proven that they were
effective.

(There's also the question of what happens in a multiuser
environment. There's no gain in having the sectors in your file
adjacent, if between your accesses, another user has accessed a
sector somewhere completely different.)
 
J

James Kanze

With old FSs it was common to pre-allocate a certain number of
inodes when the disk was formatted, which had two problems,
either you had more inodes than you needed or you did not have
enough. In the first case you wasted space, in the second you
were out of luck. New FSs allocates inodes on the fly and the
trend seems to be to put as much as possible in the B-Tree (or
Hash-tree) nodes and only keep the directory and file data
separate.

New, being at least everything since the Berkley 4.2 kernel,
right:). Except that I've never heard of inodes being kept in a
B-Tree, and I don't see what that would gain (or even how it
would be possible, since you don't even have any reasonable
ordering criteria).
One possible solution to this is to hash the path and filename
and use that as the key into the tree.

There are lots of possible solutions. What I do know is that,
under Solaris at least, the files in a directory are in no
particular order. So I can't see what a B-Tree would be doing
in there.
A linked list would be horrible, imagine how many reads you
might have to do before you enough free space to create the
file, or space which is close enough to the file you are
appending to.

One read per allocation block. (Supposing you don't have
anything cached. I would imagine that a good system would have
at least some of the data cached.)
XFS used two synced B-Trees to manage free space, one was
sorted by logical address, which is handy when trying to find
free space close to some already allocated file (such as when
appending data). The other tree was sorted by size of the free
space, which is nice when trying to find the best place for a
file (as when creating).

If the file system supports contiguous files, this is definitely
an advantage. The filesystems I'm familiar with today don't.
:-(.
Anyway, this is moving far off topic. What I wanted to
demonstrate to the OP was that B-Trees are still in use today,
though they might not be as common as some other data
structures.

I'm sure that they have some uses. Any time you need both
random and sorted access, disk based.
 
E

Erik Wikström

New, being at least everything since the Berkley 4.2 kernel,
right:). Except that I've never heard of inodes being kept in a
B-Tree, and I don't see what that would gain (or even how it
would be possible, since you don't even have any reasonable
ordering criteria).

New as in the last few years. I might have confused you by using the
term inode, perhaps meta-data block or some other term would be better
to avoid associations with UFS. One ordering criteria would be to use
the inode (or meta-data block) number, another to hash the path.

As for the gain, while I'm not an expert on the area there are at least
two FSs that I know of (one in early development) which uses unified
B-Trees to store everything but the actual file data (and perhaps the
free blocks, not too sure about how that works).
There are lots of possible solutions. What I do know is that,
under Solaris at least, the files in a directory are in no
particular order. So I can't see what a B-Tree would be doing
in there.

Solaris (AFAIK) uses UFS which is hardly a modern FS, but it is old and
proven which is usually more important than features and performance in
the enterprise market (especially when both features and performance can
be added by in the form of extra software and RADI arrays). Actually UFS
(or rather FFS) was built around and idea which is not longer possible
(optimisation based on disk layout) since the disks hides their actual
geometry.

And UFS is not used it's some other enterprise storage solution. But
fact is that few if any commodity FSs were designed for multi TB disks,
multi GB files, and the number of files you can put on a modern disk.
And as TB disks become commodity products new FSs to handle them are
being developed and the sizes of the new disks (coupled with
requirements such as virtually instant crash-recovery, eg. no need for
fsck) requires new designs, and B-Trees (or derivatives) seems to be the
structure of choice.

If the file system supports contiguous files, this is definitely
an advantage. The filesystems I'm familiar with today don't.
:-(.

No FS I know of do, but that does not mean that trying to keep up
locality of reference is a bad idea.
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...

[ ... ]
(You don't know the Unix culture. I've seen empty directories
created for locking purposes:).)

That's not a problem at all -- an empty directory (or an empty file)
only involves creating a directory entry. In addition, you're only
really concerned when you have difficulty allocating free space on the
disk because it has all been fragmented -- so something created for
locking purposes isn't likely to last long enough to make any real
difference anyway.
More generally, those both sound like interesting heuristics.
But somehow, given the evolution of systems (if it ain't broke,
don't fix it, etc.), I doubt that there widespread in current
Unix or Windows. (Maybe Linux... where the philosophy seems at
times to be "if it ain't broke, it ain't complicated enough".)

:)

[ I don't have time to write more now, but I'll probably post another
follow-up to your post when I can...]
 
J

James Kanze

New as in the last few years. I might have confused you by
using the term inode, perhaps meta-data block or some other
term would be better to avoid associations with UFS. One
ordering criteria would be to use the inode (or meta-data
block) number, another to hash the path.

But why? That's what I'm trying to figure out. I'm familiar
with what B-Trees are classically used for (ISAM), and they seem
very much optimized for that: rapid direct access, but also
sequential access according to an ordering criterion. I don't
see where the second would ever apply to inodes or meta-blocks.
As for the gain, while I'm not an expert on the area there are
at least two FSs that I know of (one in early development)
which uses unified B-Trees to store everything but the actual
file data (and perhaps the free blocks, not too sure about how
that works).

OK. I can sort of see why one might want a unified method, and
if B-Trees have distinct advantages in some cases, why not use
them systematically. Jokingly: if all you have is a hammer,
everything looks like a nail. Realistically, however: you have
finite develoment time, and its probably preferable spending it
to get one structure 100% right, rather than three different
structures only more or less right. Once you've got the B-Tree,
you use it unless there is a distinct disadvantage in doing so.

[...]
Solaris (AFAIK) uses UFS which is hardly a modern FS,

I didn't say it was modern:).
but it is old and proven which is usually more important than
features and performance in the enterprise market (especially
when both features and performance can be added by in the form
of extra software and RADI arrays).

That's sort of what I hinted at in my reply to Jerry: if it
ain't broke, don't fix it (vs if it ain't broke, it ain't
complicated enough). There's also the fact that there doesn't
seem to be as much money to be made making your system more
effecient as there is to be made selling a new, more powerful
hardware, and that I've never heard of anyone benchmarking this
sort of thing in the procuration procedures. Solaris is by far
the system I'm most familiar with, and it was definitly my
impression that what I'd learned back when I was working on
kernel software (some 20 or more years ago) hadn't really
changed that much.

[...]
No FS I know of do, but that does not mean that trying to keep
up locality of reference is a bad idea.

It depends. If you're going to read sector by sector anyway, on
a multi-user system, it probably doesn't matter. Someone else
is going to access elsewhere between your accesses anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,161
Latest member
GertrudeMa
Top