efficient interval containment lookup

P

Per Freem

hello,

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this? One simple way is:

for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

where is_within simply checks if the first argument is within the
second.

I'm not sure if it's more efficient to have the iteration over listA
be on the outside or listB. But perhaps there's a way to index this
that makes things more efficient? I.e. a smart way of indexing listA
such that I can instantly get all of its elements that are within some
element
of listB, maybe? Something like a hash, where this look up can be
close to constant time rather than an iteration over all lists... if
there's any built-in library functions that can help in this it would
be great.

any suggestions on this would be awesome. thank you.
 
T

Tim Chase

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this? One simple way is:

for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

Could you detail the is_within() function? most importantly, is
it inclusive, such that "a1 <= b1 <= b2 <= a2", or is it
overlapping such that "a1 <= b2 or a2 <= b1". Additionally, do
you want to count how many intervals of A overlap with intervals
of B, or do you just want a count of how many intervals in B have
*any* overlap within A?

My leaning would be to make a custom "is this in A" function,
iterate over B and test against an "appropriate subset" of A:

listA = sorted([...])
min_a1 = min(a1 for (a1, a2) in listA)
max_a2 = max(a2 for (a1, a2) in listA)
def in_a(b1, b2):
for a1, a2 in smartly_chosen_subset(listA, b1, b2):
if is_within((b1, b2), (a1, a2)): return True
return False
i = 0
for b1, b2 in sorted(listB):
if b2 < min_a1: continue
if b1 > max_a2: break
if in_a(b1, b2): i += 1

The in_a() function can be optimized to terminate early if a1>b2
or a2 < b1 (perhaps even choosing an smart starting point for
iterating). Short-circuiting by returning True as soon as you
know there's a match will trim some of the time.

How distributed are the endpoints? If there's a lot of
commonality in the data, you might be able to cache results to
cut even further.

Just a few ideas, and questions that might help develop better
solutions.

-tkc
 
R

Robert Kern

[Apologies for piggybacking, but I think GMane had a hiccup today and missed the
original post]

[Somebody wrote]:
suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

Interval trees.

http://en.wikipedia.org/wiki/Interval_tree

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
S

Steven D'Aprano

hello,

suppose I have two lists of intervals, one significantly larger than the
other. [...]
What is an efficient way to this? One simple way is:
http://en.wikipedia.org/wiki/Interval_tree


for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

where is_within simply checks if the first argument is within the
second.

This will be O(m*n) where m is the size of the smaller list and n is the
size of the big one. It shouldn't matter which order you do the loops,
you still have to iterate over every point in each list, for every point
in the other list.

Using an interval tree may let you decrease the time to O(m*log n), at
the cost of extra overhead, but for millions of items the saving should
be large.
 
P

Per Freem

thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
= c and b <= d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant,
as the overlapping interval problem is way harder than what I am
trying to do. Please correct me if I'm wrong on this...

Scott Daniels, I was hoping you could elaborate on your comment about
bisect. I am trying to use it as follows: I try to grid my space
(since my intervals have an upper and lower bound) into segments (e.g.
of 100) and then I take these "bins" and put them into a bisect list,
so that it is sorted. Then when a new interval comes in, I try to
place it within one of those bins. But this is getting messy: I don't
know if I should place it there by its beginning number or end
number. Also, if I have an interval that overlaps my boundaries --
i.e. (900, 1010) when my first interval is (0, 1000), I may miss some
items from listB when i make my count. Is there an elegant solution
to this? Gridding like you said seemed straight forward but now it
seems complicated..

I'd like to add that this is *not* a homework problem, by the way.

[Apologies for piggybacking, but I think GMane had a hiccup today and missed the
original post]

[Somebody wrote]:
suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

Interval trees.

http://en.wikipedia.org/wiki/Interval_tree

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco
 
S

Steven D'Aprano

thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
overlap... this is why interval trees seem to me to not be relevant, as
the overlapping interval problem is way harder than what I am trying to
do. Please correct me if I'm wrong on this...


To test for contained intervals:
a >= c and b <= d

To test for overlapping intervals:

not (b < c or a > d)

Not exactly what I would call "way harder".
 
P

Per Freem

To test for contained intervals:
a >= c and b <= d

To test for overlapping intervals:

not (b < c or a > d)

Not exactly what I would call "way harder".

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here: http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.
 
P

Per Freem

i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
results = []
for interval in intervals:
if interval.start >= start and interval.stop <= stop:
results.append(interval)
return results

To test for contained intervals:
a >= c and b <= d
To test for overlapping intervals:
not (b < c or a > d)
Not exactly what I would call "way harder".

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
    start = int(round(random.random()
*1000))
    end = start + int(round(random.random()*500+1))
    init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.
 
B

brent

To test for contained intervals:
a >= c and b <= d
To test for overlapping intervals:
not (b < c or a > d)
Not exactly what I would call "way harder".

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
    start = int(round(random.random()
*1000))
    end = start + int(round(random.random()*500+1))
    init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:
http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/operations/quicksect.py
for a more robust implementation.
-brentp
 
P

Per Freem

hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link. could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

hi Steven,
i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.
here is the exact code i used to make the comparison, plus the code at
the link i have above:
class Interval():
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop
import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
    start = int(round(random.random()
*1000))
    end = start + int(round(random.random()*500+1))
    init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)
on one run, i get:
interval tree: 8584.682
brute force: 8201.644
is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
for a more robust implementation.
-brentp
 
S

Steven D'Aprano

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

Here are three runs on my computer:


interval tree: 10132.682
brute force: 12054.988

interval tree: 10355.058
brute force: 12119.227

interval tree: 9975.414
brute force: 12052.174


I find that interval tree is consistently faster than the brute force
method, by a factor of at least 16%. I'd expect that benefit to increase
dramatically as the number of intervals gets bigger.
 
B

brent

hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link.  could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

On Jan 12, 10:58 pm, Steven D'Aprano
On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
= c and b <= d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant, as
the overlapping interval problem is way harder than what I am trying to
do. Please correct me if I'm wrong on this...
To test for contained intervals:
a >= c and b <= d
To test for overlapping intervals:
not (b < c or a > d)
Not exactly what I would call "way harder".
--
Steven
hi Steven,
i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.
here is the exact code i used to make the comparison, plus the code at
the link i have above:
class Interval():
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop
import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
    start = int(round(random.random()
*1000))
    end = start + int(round(random.random()*500+1))
    init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)
on one run, i get:
interval tree: 8584.682
brute force: 8201.644
is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.
hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787
so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
for a more robust implementation.
-brentp

well, i think if your search interval covers the entire span of the
tree, you can't do better than just naive search as the tree just adds
overhead. as the len of the search interval decreases relative to the
span of the tree, the tree performs better.
if all the intervals in the tree itself are long and overlapping, that
tree just ... ugh ... doesnt work, because it has to do all the extra
checks in the find() method anyway. the bx-python tree probably does a
better job, but you set up a pretty rough test there.
 
G

Gabriel Genellina

i found an implementation (which is exactly how i'd write it based on
the description) here:
http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

I think there is an algorithm by Baeza-Yates that handles efficiently what
you want; but I can't find it. Perhaps Google works better for you. It
might be only available in Spanish, though.
 
T

Tim Chase

Per said:
i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
results = []
for interval in intervals:
if interval.start >= start and interval.stop <= stop:
results.append(interval)
return results

I don't know if using a list-comprehension here is a better
choice, but your code looks very much like

def naive_find(intervals, start, stop):
return [interval for interval in intervals
if interval.start >= start and interval.stop <= stop]

which may even be simple enough to include in-line rather than as
a sub-function (with any associated call overhead).

I've found that usually Python's smart/efficient handling of
list-comprehensions can make an appreciable difference within
critical-loop constructs.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,042
Latest member
icassiem

Latest Threads

Top