Builtin classes list, set, dict reimplemented via B-trees

B

barnesc

B-trees are specifically designed for disk storage. Seeking to a
page takes much longer than reading a page of several kilobytes.
B-trees read the keys for a many-way comparison in one seek.

For in-memory comparison-based dictionaries, Red-Black trees,
AVL trees, 2-3 trees, or skip-lists, are a better choice.

---------------------------------------------------------------------
Memory Usage
---------------------------------------------------------------------

Here overhead is compared to a C array of > 1 million PyObject *s.

The B-tree is assumed to have a branch factor of ~1024.

A B-tree has an average of 1.5x overhead (2x at worst, 1x at best,
using malloc). This assumes that B-tree nodes contain a fixed
size C array for storing values (which will be half-full on
average), and that the child array is only allocated for internal
nodes as needed.
Any balanced binary tree with left and right child pointers has
a 6x overhead (using malloc, measured via a gcc win32 test program),
or with a free list, a 3x overhead (though memory will not
be returned until the data structure's destructor is called).
I haven't tried using PyObject_Malloc(), but it should be in
between these two factors.
A skip list has n values and n next pointers on the bottom level.
In the ideal case (with respect to memory usage), we make the
probability sufficiently small so that the higher levels take up a
negligible amount of memory. Thus we have a 4x overhead (using
malloc, measured via a gcc win32 test program), or with a free list,
an overhead slightly greater than 2x (again, memory will only be
returned when the destructor is called). I didn't test
PyObject_Malloc(), but it should give an overhead between 2x and 4x.

Thus, on average, a > 1 million element B-tree uses 25% less memory
than any other balanced data structure which I am aware of, and 50%
more memory than a raw C array.


---------------------------------------------------------------------
Speed
---------------------------------------------------------------------

I have no idea how B-trees compare to skip lists (the likely
contender) in terms of speed.

To do this, one would need two high performance C implementations.
From this, the Python performance can presumably be deduced (although
caching causes unpredictable effects, the faster C algorithm should be
faster in Python).

You could start with optimized parallel C implementations, such as
the ones by John-Mark Gurney:

* http://resnet.uoregon.edu/~gurney_j/jmpc/btree.html
* http://resnet.uoregon.edu/~gurney_j/jmpc/skiplist.html

He indicates that the skip list is slower than the B-tree, but doesn't
give any solid numbers. Of course you'd have to check his code and
optimize it to death (Perhaps representing a sorted set of integers
using each respective data structure would be a good test).

Also note that my B-tree code currently has optimized special cases
for block insertions, deletions, and retrievals, since this is easy
(er, well, relatively speaking) to do in the B-tree case. In the skip
list case, block retrievals are easy (though not as fast as B-tree
block retrievals due to pointer indirection and cache problems). I
have no idea how to do a fast block insert or delete on a skip list.

----------------------------------------------------------------------

Well, enough ivory tower silliness and academic blustering.

<on_topic>

Are there any *practical* applications for in-memory balanced data
structures (e.g. skip list, AVL tree, RB tree, B tree) ?

</on_topic>

- Connelly Barnes
 
S

Szabolcs Nagy

IMO sorted dict implementation can be useful, eg. one can get an
interval:
L = D['A' : 'K']

other useful data types:
linkedlist
queue, stack (well deque can do it efficiently in py 2.4)
prioritydict (for graph algorithms)
multimap, multiset (i've never used it but it's in the c++ stl)
mutable string (kind of list/array of chars, but with string functions)

nsz
 
B

Bryan Olson

> Here overhead is compared to a C array of > 1 million PyObject *s.
>
> Thus, on average, a > 1 million element B-tree uses 25% less memory
> than any other balanced data structure which I am aware of, and 50%
> more memory than a raw C array.

That's overhead of indexing; it doesn't consider the space
already used to store the keys and values. The B-tree can get by
with modestly fewer pointers, because it has fewer internal
nodes that need to be referenced by other internal pointers.
That assumes that the B-tree nodes are kept as linear arrays,
which means that either inserting into them will time
proportional to the fan-out, or searching them will.

[...]
> I have no idea how B-trees compare to skip lists (the likely
> contender) in terms of speed.

I'd expect Red-Black trees to be at least as good a contender.
> Are there any *practical* applications for in-memory balanced data
> structures (e.g. skip list, AVL tree, RB tree, B tree) ?

Yes, absolutely. Efficient ordered sub-ranges; efficient rank
queries; sustainable performance with adversarial inputs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top