Sorted indexes in Python

G

Glenn Linderman

So for a particular project, it has a "design center" to handle
memory databases with 10,000 rows or so... but we might have a
few dozen tables, so we want to be efficient, anyway.

One of my "hare-brained" ideas is to not save the indexes on
disk, but to rebuild them when reading in the database. One benefit is
that the files are smaller, another is that the indexes shouldn't get
too fragmented (always a problem to address with btree indexes).
To make this viable, we have to be able to rebuild the indexes quickly.

There are two kinds of indexes: hash, and sorted. Hash indexes are
generally faster, but are not sorted, must be unique key values, and
cannot be used to find ranges of values. Sorted indexes are generally
slower, need not be unique key values, and can find ranges of values.

There are five types of operations that can be done on indexes: load
index, insert row, delete row, delete index, lookup row. I've been
benchmarking the first two.

Load is whatever way one can dream up to create the index fast, in
bulk, when the complete set of data already exists, whereas insert
must be able to take an existing index, and add a new entry. One can
load the index using insert operations, but it generally is slower
than other techniques.

First, I measured how long it takes to generate a table of random, or
calculated values. I determined that with 10,000 rows, all the
techniques I tried are reasonably fast. So I then tried with 100,000
rows, and with 1,000,000 rows. These are big enough to start to
measure the timings more accurately, and reflect the differences in
the algorithms.

The numbers below are in operations per second, and the first column
is for 100,000 rows, and the second column is for 1,000,000 rows.

create rows: 61821 39215
hash creation: 217371 41078
red black tree: 14168 <35
insertion sort: 18100 1060
loaded sort vector: 90278 32464
Glenn's special: 46557 19381
Glenn's special load: 88569 30480

Note that the more rows there are the longer it takes to create them,
per row. This might seem surprising, until you realize that the list
itself (even if not the data contained within it) has to keep being
reallocated and copied, and it is not clear that Python's algorithm
for that is particularly brilliant... in any case, the copy operation
keeps taking longer. One would hope that the allocation of additional
space for a list would be some form of doubling of size, but it is hard
to tell.

The next thing to notice is that hashes, an algorithm built-in to
Python, seem to slow down significantly with large numbers of
entries. They are extremely fast up to 100,000 entries, but then slow
down. I don't know if that is due to some limit in the
implementation, or whether the data I generated eventually reached a
point where more buckets didn't help, because the hash values started
colliding. I could research that by doing some statistics on the hash
values, likely. But it seems that for our expected row counts, we
aren't going to have a problem with hashes.

Not all indexes can usefully be expressed as hashes, however. The
typical database sorted index is based around btrees or some
variant. Btrees are designed to for efficiency when not all the data
can be kept in memory. Since we intend to keep data in memory, other
tree structures seemed interesting: binary trees, AVL trees, red-black
trees. The C++ language STL implements red-black trees. Curiously,
there is not an implementation of _any_ tree structure in the Python
standard library (including the recently released 2.6).

Of course, in using an interpreted language, it is helpful to find
some of the complex algorithms "built in", and thus coded in the
faster implementation language. I found a Python implementation of
red-black trees, and some implementations of AVL trees, one of which
was coded in C, but not ported to all the platforms we are interested
in (it may be a trivial port, but... you don't know for sure until you
try). So I tried only the Python implementation, not expecting it to
complete well with hashes, but I expected, with sufficient numbers of
rows, for it to compete well with an insertion sort. It didn't. In
fact, the million row creation didn't even complete in 8 hours (I was
sleeping), so it is unknown how much worse than the above reported
number it would really be.

With the insertion sort, each row was created, and added to the right
position in a sorted list. The Python standard library bisect module,
coded in C, implements a binary search (similar in performance to a
binary tree, but based on a sorted vector) to quickly determine the
appopriate insertion point. The insertion itself, even though coded
in C, is a function of the size of the data O(n-squared). So you see
a significant difference between 100,000 and million row cases.

The loaded sort vector case really speeds up the initial creation of
the sorted vector by creating an unsorted vector, and then sorting
it. Clever, huh? The sort algorithm (I think a merge sort) is faster
than the insertion sort business. But that only works for the first
time through, because it would be slower to resort the data for each
newly inserted row. The benefit is obtained by having whole bunches
of rows to start with, when doing the sort. But it would make the
load fast, and the other operations are distributed among user
interactions. At 10,000 rows, it might actually be a workable
solution.

But, I thought, perhaps there is some way to be faster. Maybe
finding, porting, or coding one of the tree structures in C would be
faster... but it could be a lot of work to get that right. Such would
be an alternative that we could attempt in the future, perhaps. And
then, the old "aha" feeling hit me! What about taking the big sorted
vector, and splitting it into little sorted vector-lets, each a subset
of the whole, and having another vector that points to each of the
vector-lets? That is what Glenn's special is: a custom sort object.
Conveniently, the main vector and each vector-let can be manipulated
by the provided bisect module. When inserting, only one vector-let
needs to be expanded. To keep the vector-lets about the same size, I
defined a maximum and minimum length for the (the user can define it
when constructing the sort object; empirically, it seems that the
square-root of the total number of rows is a good number for the
maximum size of a vector-let – that keeps the cost of the binary
searches nearly identical to the single vector).

You'll note that the insertion rate for Glenn's special, is
significantly better than that of the plain insertion sort, especially
for larger data sets. On the other hand, due to the more complex data
structure, the best "load" operation I could come up with is to create
the single vector, sort it, and then divide it into vector-lets.
Doing so is clearly a bit more costly than the loaded sort vector
case, because it is doing more work. Fortunately, it is not a drastic
slowdown, and is probably worth the extra cost at startup time.

I think I'll wrap bisect module into a custom sort object with the
same interface as the Glenn's special; it should also be an adequate
interface for a future sorted technique, should an appropriate search-
tree structure be added to the Python standard library, or become
available elsewhere. That'll give me an easy way to try out algorithms
that might be suggested here, as time permits... just wrap the algorithm
with the minimally necessary API, and start measuring!

Comments and suggestions welcome on any of the related subtopics!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top