efficient interval containment lookup

Per Freem · Jan 12, 2009

hello,

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this? One simple way is:

for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

where is_within simply checks if the first argument is within the
second.

I'm not sure if it's more efficient to have the iteration over listA
be on the outside or listB. But perhaps there's a way to index this
that makes things more efficient? I.e. a smart way of indexing listA
such that I can instantly get all of its elements that are within some
element
of listB, maybe? Something like a hash, where this look up can be
close to constant time rather than an iteration over all lists... if
there's any built-in library functions that can help in this it would
be great.

any suggestions on this would be awesome. thank you.

Tim Chase · Jan 12, 2009

suppose I have two lists of intervals, one significantly larger than

the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this? One simple way is:

for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

Could you detail the is_within() function? most importantly, is
it inclusive, such that "a1 <= b1 <= b2 <= a2", or is it
overlapping such that "a1 <= b2 or a2 <= b1". Additionally, do
you want to count how many intervals of A overlap with intervals
of B, or do you just want a count of how many intervals in B have
*any* overlap within A?

My leaning would be to make a custom "is this in A" function,
iterate over B and test against an "appropriate subset" of A:

listA = sorted([...])
min_a1 = min(a1 for (a1, a2) in listA)
max_a2 = max(a2 for (a1, a2) in listA)
def in_a(b1, b2):
for a1, a2 in smartly_chosen_subset(listA, b1, b2):
if is_within((b1, b2), (a1, a2)): return True
return False
i = 0
for b1, b2 in sorted(listB):
if b2 < min_a1: continue
if b1 > max_a2: break
if in_a(b1, b2): i += 1

The in_a() function can be optimized to terminate early if a1>b2
or a2 < b1 (perhaps even choosing an smart starting point for
iterating). Short-circuiting by returning True as soon as you
know there's a match will trim some of the time.

How distributed are the endpoints? If there's a lot of
commonality in the data, you might be able to cache results to
cut even further.

Just a few ideas, and questions that might help develop better
solutions.

-tkc

Robert Kern · Jan 12, 2009

[Apologies for piggybacking, but I think GMane had a hiccup today and missed the
original post]

[Somebody wrote]:

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

Click to expand...

Interval trees.

http://en.wikipedia.org/wiki/Interval_tree

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Steven D'Aprano · Jan 12, 2009

hello,

suppose I have two lists of intervals, one significantly larger than the
other. [...]
What is an efficient way to this? One simple way is:
http://en.wikipedia.org/wiki/Interval_tree

for a_range in listA:
for b_range in listB:
is_within(b_range, a_range):
# accumulate a counter here

where is_within simply checks if the first argument is within the
second.

This will be O(m*n) where m is the size of the smaller list and n is the
size of the big one. It shouldn't matter which order you do the loops,
you still have to iterate over every point in each list, for every point
in the other list.

Using an interval tree may let you decrease the time to O(m*log n), at
the cost of extra overhead, but for millions of items the saving should
be large.

Per Freem · Jan 12, 2009

thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a

= c and b <= d. Note that I am not looking for intervals that

overlap... this is why interval trees seem to me to not be relevant,
as the overlapping interval problem is way harder than what I am
trying to do. Please correct me if I'm wrong on this...

Scott Daniels, I was hoping you could elaborate on your comment about
bisect. I am trying to use it as follows: I try to grid my space
(since my intervals have an upper and lower bound) into segments (e.g.
of 100) and then I take these "bins" and put them into a bisect list,
so that it is sorted. Then when a new interval comes in, I try to
place it within one of those bins. But this is getting messy: I don't
know if I should place it there by its beginning number or end
number. Also, if I have an interval that overlaps my boundaries --
i.e. (900, 1010) when my first interval is (0, 1000), I may miss some
items from listB when i make my count. Is there an elegant solution
to this? Gridding like you said seemed straight forward but now it
seems complicated..

I'd like to add that this is *not* a homework problem, by the way.

[Apologies for piggybacking, but I think GMane had a hiccup today and missed the
original post]

[Somebody wrote]:

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

Click to expand...

Click to expand...

Interval trees.

http://en.wikipedia.org/wiki/Interval_tree

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Steven D'Aprano · Jan 13, 2009

thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
overlap... this is why interval trees seem to me to not be relevant, as
the overlapping interval problem is way harder than what I am trying to
do. Please correct me if I'm wrong on this...

To test for contained intervals:
a >= c and b <= d

To test for overlapping intervals:

not (b < c or a > d)

Not exactly what I would call "way harder".

Per Freem · Jan 13, 2009

To test for contained intervals:
a >= c and b <= d

To test for overlapping intervals:

not (b < c or a > d)

Not exactly what I would call "way harder".

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here: http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

Per Freem · Jan 13, 2009

i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
results = []
for interval in intervals:
if interval.start >= start and interval.stop <= stop:
results.append(interval)
return results

To test for contained intervals:
a >= c and b <= d

Click to expand...

To test for overlapping intervals:

Click to expand...

not (b < c or a > d)

Click to expand...

Not exactly what I would call "way harder".

Click to expand...

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

brent · Jan 13, 2009

To test for contained intervals:
a >= c and b <= d

Click to expand...

To test for overlapping intervals:

Click to expand...

not (b < c or a > d)

Click to expand...

Not exactly what I would call "way harder".

Click to expand...

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:
http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/operations/quicksect.py
for a more robust implementation.
-brentp

Per Freem · Jan 13, 2009

hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link. could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

hi Steven,

Click to expand...

i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html

Click to expand...

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

Click to expand...

here is the exact code i used to make the comparison, plus the code at
the link i have above:

Click to expand...

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

Click to expand...

import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

Click to expand...

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

Click to expand...

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

Click to expand...

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
for a more robust implementation.
-brentp

Steven D'Aprano · Jan 13, 2009

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

Here are three runs on my computer:

interval tree: 10132.682
brute force: 12054.988

interval tree: 10355.058
brute force: 12119.227

interval tree: 9975.414
brute force: 12052.174

I find that interval tree is consistently faster than the brute force
method, by a factor of at least 16%. I'd expect that benefit to increase
dramatically as the number of intervals gets bigger.

brent · Jan 13, 2009

hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link. could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

On Jan 12, 10:58 pm, Steven D'Aprano
On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
= c and b <= d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant, as
the overlapping interval problem is way harder than what I am trying to
do. Please correct me if I'm wrong on this...
To test for contained intervals:
a >= c and b <= d
To test for overlapping intervals:
not (b < c or a > d)
Not exactly what I would call "way harder".
--
Steven
hi Steven,
i found an implementation (which is exactly how i'd write it based on
the description) here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.
here is the exact code i used to make the comparison, plus the code at
the link i have above:
class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop
import random
import time
num_ints = 30000
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)
on one run, i get:
interval tree: 8584.682
brute force: 8201.644
is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.

Click to expand...

Click to expand...

hi, the tree is inefficient when the interval is large. as the size of
the interval shrinks to much less than the expanse of the tree, the
tree will be faster. changing 500 to 50 in both cases in your script,
i get:
interval tree: 3233.404
brute force: 9807.787

Click to expand...

so the tree will work for limited cases. but it's quite simple. check
the tree in bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
for a more robust implementation.
-brentp

Click to expand...

well, i think if your search interval covers the entire span of the
tree, you can't do better than just naive search as the tree just adds
overhead. as the len of the search interval decreases relative to the
span of the tree, the tree performs better.
if all the intervals in the tree itself are long and overlapping, that
tree just ... ugh ... doesnt work, because it has to do all the extra
checks in the find() method anyway. the bx-python tree probably does a
better job, but you set up a pretty rough test there.

Gabriel Genellina · Jan 13, 2009

i found an implementation (which is exactly how i'd write it based on
the description) here:
http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

I think there is an algorithm by Baeza-Yates that handles efficiently what
you want; but I can't find it. Perhaps Google works better for you. It
might be only available in Spanish, though.

Tim Chase · Jan 13, 2009

Per said:
i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
results = []
for interval in intervals:
if interval.start >= start and interval.stop <= stop:
results.append(interval)
return results

I don't know if using a list-comprehension here is a better
choice, but your code looks very much like

def naive_find(intervals, start, stop):
return [interval for interval in intervals
if interval.start >= start and interval.stop <= stop]

which may even be simple enough to include in-line rather than as
a sub-function (with any associated call overhead).

I've found that usually Python's smart/efficient handling of
list-comprehensions can make an appreciable difference within
critical-loop constructs.

-tkc

Inconsistent list/pointer problem	7	Feb 1, 2007
NEWB: General purpose list iteration?	4	Aug 12, 2005
Minimum Total Difficulty	0	Nov 15, 2023
Efficient python 2-d arrays?	6	Jan 17, 2011
Announce: Interval Template Library (itl): Generic intervalcontainers.	3	Jan 18, 2010
Connected SQLite to my java program but information are not submitted	2	Aug 2, 2022
Efficient way to break up a list into two pieces	16	Feb 21, 2010
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022

efficient interval containment lookup

Per Freem

Tim Chase

Robert Kern

Steven D'Aprano

Per Freem

Steven D'Aprano

Per Freem

Per Freem

brent

Per Freem

Steven D'Aprano

brent

Gabriel Genellina

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads