Data structure recommendation?

S

Steven Clark

Hi all-

I'm looking for a data structure that is a bit like a dictionary or a
hash map. In particular, I want a mapping of floats to objects.
However, I want to map a RANGE of floats to an object.

This will be used for timestamped storage / lookup, where the float
represents the timestamp.
get(x) should return the object with the "newest" (biggest) timestamp
y <= x, if it exists.
Example:

foo = Foo()

foo.get(1.5)
-> None
foo.put(1.3, 'a')
foo.put(2.6, 'b')
foo.get(1.5)
-> 'a'
foo.get(7.8)
-> 'b'
foo.put(5.0, 'c')
foo.get(7.8)
-> 'c'

In otherwords, by the end here, for foo.get(x),
x < 1.3 maps to None,
1.3 <= x < 2.6 maps to 'a',
2.6 <= x < 5.0 maps to 'b',
5.0 <= x maps to 'c'.

I know that foo.get() will be called many times for each foo.put(). Is
there any way to achieve O(1) performance for foo.get(), maybe via
some kind of hash function? Or is the best thing to use some kind of
binary search?

Thanks for any advice.

-Steven
 
M

Martin v. Löwis

I know that foo.get() will be called many times for each foo.put(). Is
there any way to achieve O(1) performance for foo.get(), maybe via
some kind of hash function? Or is the best thing to use some kind of
binary search?

If you know something about the density of the input values, O(1) is
possible. E.g if there is a guarantee that there will be between 1
and 10 values per unit of input, then truncate the "event time" to
the next-lower int, and use that as an index k into a list; the list
item will be a list of events between k and k+1. As you know that
there is an upper bound to the number of such events (10), you
know that searching the list will take bounded (i.e. constant) time.
Likewise, as you know that there will be atleast one event per
(k,k+1) interval, you know that you have to scan only one list.

If you know no such thing, you'll have to use a binary search.

Regards,
Martin
 
M

Martin v. Löwis

I know that foo.get() will be called many times for each foo.put(). Is
there any way to achieve O(1) performance for foo.get(), maybe via
some kind of hash function? Or is the best thing to use some kind of
binary search?

If you know something about the density of the input values, O(1) is
possible. E.g if there is a guarantee that there will be between 1
and 10 values per unit of input, then truncate the "event time" to
the next-lower int, and use that as an index k into a list; the list
item will be a list of events between k and k+1. As you know that
there is an upper bound to the number of such events (10), you
know that searching the list will take bounded (i.e. constant) time.
Likewise, as you know that there will be atleast one event per
(k,k+1) interval, you know that you have to scan only one list.

If you know no such thing, you'll have to use a binary search.

Regards,
Martin
 
J

Jochen Schulz

* Steven Clark:
Hi all-

I'm looking for a data structure that is a bit like a dictionary or a
hash map. In particular, I want a mapping of floats to objects.
However, I want to map a RANGE of floats to an object.

This solution may be more than you actually need, but I implemented two
metric space indexes which would do what you want (and I wanted to plug
it anyway :)):

http://well-adjusted.de/mspace.py/

You can do arbitrary range searches and nearest-neighbour search in
these indexes provided you have a metric distance function for the
objects you want to index. The distance function in your case would
simply be the absolute difference between your floats.

If every log message was contained in objects of this type:

class LogMessage(object)
def __init__(self, timestamp, object):
self.timestamp = timestamp
self.message = message

you could write a distance function for these objects like this:

def log_distance(log1, log2):
return abs(log1.timestamp - log2.timestamp)

and then you would create and search an index like this:

from mspace import VPTree
index = VPTree(all_log_messages, log_distance)
index.search(single_msg, 10)

The last line would yield all log messages within ten seconds (or
whatever the unit of your timestamps is) of the given LogMessage
single_msg. You could also use

index.nn_search(single_msg, 3)

To find the three messages closest to the given single_msg.

Searching should be quite fast (O(log(n)), but indexing might take some
time (O(n*log(n))). The problem is that VPTrees have to be constructed
from the complete data set initially. They cannot grow. The alternative
from mspace.py, BKTrees, may grow over time but they don't work with
non-discrete distances. If you want to use them, you'd have to convert
your timestamps to int/long first.

J.
 
B

bearophileHUGS

Jochen Schulz:
This solution may be more than you actually need, but I implemented two
metric space indexes which would do what you want (and I wanted to plug
it anyway :)):

Please plug such good things. It seems the Python community is an
endless source of interesting modules I didn't know about. Your
(single) module looks very nice. I'll take a better look later.

Few notes:
Use xrange instead of range, for Psyco they are often the same, but if
you don't have Psyco the range may be slower and usually requires more
memory.
Java is faster than Python, but if you have some care and you know how
things are implemented in Python, you can often write ""fast"" Python
code.

In the levenshtein function you have this:
for i in range(1, m+1):
prev, cur = cur, + [0]*n
You can speed up that levenshtein, using nearly constant (heap)
memory, avoiding that re-creation of prev and cur (see below for D
code you can back-translate to Python).

You can translate this module to D (I may even do it myself, or I may
help you do it), this has several advantages:
- If you are careful it may become (quite) faster than your Java
version.
- The resulting code may be probably as long as the Python one, or not
too much longer (but it may have or not have some extra limitations. D
is flexible but it's a statically typed language, unlike Python);
- Coding in D may be similar enough to Java coding for you, quite
differently from coding it in C;
- You can then use Pyd (http://pyd.dsource.org ) to create in a very
simple way a single compiled module with the functions/classes that
can be called by python (no intermediate python needed). Using Pyd is
quite simpler than using C + Swig or writing a C module for Python.
Later you can release the D code plus the already compiled module for
Win too.
- Disadvantages: you don't know D, you don't know how to use Pyd, and
most people out there may prefer to maintain/modify C code, even if
it's 3/4 times longer than the D code.
- The following is to show you an example of D code (that uses my D
libs for few bits, like that range(), the min(), but it's not too much
difficult to avoid using them) that you may compare to the Python one.
Note that this code is generic (it's a function template), and it
takes as input any array, that means Unicode Strings too (utf8, 16 bit
too, 32 bit too), it runs about 105-125 times faster than a quite
similar Python version (Psyco is able to speed up this Python code
about 20-25 times, so this is 4-5 times faster than Psyco):

int editDistance(T)(T[] s1, T[] s2) {
if (len(s1) > len(s2)) { auto sa = s1; s1 = s2; s2 = sa; }
auto r1 = range(len(s2) + 1);
auto r2 = new int[len(r1)];
foreach (i, c1; s1) {
r2[0] = i + 1;
foreach (j, c2; s2)
r2[j+1] = c1 == c2 ? r1[j] : min(r2[j], r1[j], r1[j+1]) +
1;
auto ra = r1; r1 = r2; r2 = ra; // swap lines
}
return r1[$ - 1];
}

If you have questions feel free to ask :)

Bye,
bearophile
 
B

bearophileHUGS

Few more notes on the code:

You may use the @property in such situations (or you may just use
attributes, dropping the property). Note that Python doesn't inline
functions calls like Java HotSpot does quite often.
def __children(self):
raise NotImplementedError()
children = property(__children)
==> (not tested!)
@property
def __children(self):
raise NotImplementedError()


If the profiling shows this is s slow, and it's used on lot of items,
there are faster O(1) algorithms for this, like this one:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/466330
@staticmethod
def determine_median(numbers):
return sorted(numbers)[ len(numbers) / 2 ]

I have seen you use lot of double underscores (I think one may often
suffice, but usually two aren't a problem):
def __children(self):

Stripped of comments, docstrings, and empty lines the code is just 280
lines long (it's 1498 with them!), so a translation doesn't seem too
much work (expecially if the original version was Java). The code seem
really well commented, organized, etc. And I like it, it looks better
than 90+% of the commercial Python code I see :)

Bye,
bearophile
 
B

bearophileHUGS

More bits from your code:

neighbours = list()
==>
neighbours = []

If you have a recent enough version of Python you can use:
candidate_is_neighbour = any(distance < n[1] for n in neighbours)
Instead of:
candidate_is_neighbour = bool([1 for n in neighbours if distance <
n[1]])

It's shorter & simpler, and it stops as soon as it finds a true
condition.

Bye,
bearophile
 
J

Jochen Schulz

* (e-mail address removed):
Please plug such good things. It seems the Python community is an
endless source of interesting modules I didn't know about. Your
(single) module looks very nice. I'll take a better look later.

Could you please send me an email with an existing From: address? I
tried to reply to you but apparently your From: is forged.

J.
 
B

bearophileHUGS

Jochen Schulz:
Could you please send me an email with an existing From: address? I
tried to reply to you but apparently your From: is forged.

Sorry for the delay, I'll send you an email.
In the meantime I have created a fast BK-tree implementation for
Psyco, on my PC it's about 26+ times faster than mspace (that uses
psyco still), but it's less complete. You can find it here:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/572156

(I have created a D version too, that's about 4.5 times faster than my
Python version, so it's about 126 times faster than the mspace with
Psyco).

Enjoy,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
arinsharma
Top