# merging intervals repeatedly

Discussion in 'Python' started by Magdoll, Mar 11, 2008.

1. ### MagdollGuest

Hi,

I have to read through a file that will give me a bunch of intervals.
My ultimate goal is to produce a final set of intervals such that not
two intervals overlap by more than N, where N is a predetermined
length.

For example, I could read through this input:
(1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

btw, the input is not guaranteed to be in any sorted order.

say N = 5, so the final set should be
(1,15), (20, 30), (29, 40), (50, 100)

Is there already some existing code in Python that I can easily take
advantage of to produce this? Right now I've written my own simple
solution, which is just to maintain a list of the intervals. I can use
the Interval module, but it doesn't really affect much. I read one
interval from the input file at a time, and use bisect to insert it in
order. The problem comes with merging, which sometimes can be

ex:
read (51,65) ==> put (51,65) in list
read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
merge with (62,100)

Of course I can check for cascading merges by pivoting around the
position where the insertion would've taken place...but just
wondering, is there some other nice way to do this? I also intuitively
don't sense a need for using trees, unless someone's already written
it, the interface is easy to use, and it won't result in more insert/
delete/structure changes that nullifies its beauty...(also I'm not
using the output list for searching)

Magdoll, Mar 11, 2008

2. ### MagdollGuest

Correct. I meant the final should be
(1,30), (29,40), (50,100)

On Mar 11, 3:41 pm, Magdoll <> wrote:
> Hi,
>
> I have to read through a file that will give me a bunch of intervals.
> My ultimate goal is to produce a final set of intervals such that not
> two intervals overlap by more than N, where N is a predetermined
> length.
>
> For example, I could read through this input:
> (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
>
> btw, the input is not guaranteed to be in any sorted order.
>
> say N = 5, so the final set should be
> (1,15), (20, 30), (29, 40), (50, 100)
>
> Is there already some existing code in Python that I can easily take
> advantage of to produce this? Right now I've written my own simple
> solution, which is just to maintain a list of the intervals. I can use
> the Interval module, but it doesn't really affect much. I read one
> interval from the input file at a time, and use bisect to insert it in
> order. The problem comes with merging, which sometimes can be
>
> ex:
> read (51,65) ==> put (51,65) in list
> read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> merge with (62,100)
>
> Of course I can check for cascading merges by pivoting around the
> position where the insertion would've taken place...but just
> wondering, is there some other nice way to do this? I also intuitively
> don't sense a need for using trees, unless someone's already written
> it, the interface is easy to use, and it won't result in more insert/
> delete/structure changes that nullifies its beauty...(also I'm not
> using the output list for searching)
>

Magdoll, Mar 11, 2008

3. ### Dennis Lee BieberGuest

On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <>
declaimed the following in comp.lang.python:

> Correct. I meant the final should be
> (1,30), (29,40), (50,100)
>

Actually, even that is incorrect -- note that ,30 overlaps 29,

Using an extended data file (note: I found parsing the data file
took me longer than coding the merging) containing (the first part was
from your sample -- I extended starting at 200 to test all
combinations):

(1,10)
(3,15)
(20,30)
(29,40)
(51,65)
(62,100)
(50,66)
(200, 210)
(199, 211)
(250, 260)
(251, 259)
(300, 310)
(301, 311)
(320, 330)
(319, 329)
(350, 360)
(360, 370)
(340, 350)

>pythonw -u "Intervals.py"

Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(199, 211)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
>Exit code: 0

Note the first three result sets are correct for your list.

Since this feels like a homework assignment, I won't be posting my
code...

I only state that, 1) I did not bother sorting the interval list --
unless you have very many intervals to process, a simple sequential
search is probably faster than using bisection/insert algorithm; 2) the
algorithm inserts one interval at a time; 3) merge is accomplished by
removing/returning the first candidate interval to the insert method,
computing the min/max of the candidate and input intervals, and
recursively inserting that new interval -- if there is no candidate
overlap interval, no interval is returned, and the insert just appends
the input interval to the list.
--
Wulfraed Dennis Lee Bieber KD6MOG

HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: )
HTTP://www.bestiaria.com/

Dennis Lee Bieber, Mar 12, 2008
4. ### Guest

> > Correct. I meant the final should be
> > (1,30), (29,40), (50,100)

>
>         Actually, even that is incorrect -- note that ,30 overlaps 29,

Actually, given the specification, (overlaps > N count), (1,15), (20,
30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. [1]

>         Since this feels like a homework assignment, I won't be posting my
> code...

What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
resource? If someone asks for help, there's probably a reason, and if
they cheat, then they don't learn Python. Besides, two or three in a
row and that's justified. Besides, business programmers get plenty of
answers here-- that's cheating too. Besides, just 'cause you help
somebody else, doesn't mean everyone will. Besides, your solution
doesn't work for floats.

[1] > My ultimate goal is to produce a final set of intervals such
that not
> two intervals overlap by more than N, where N is a predetermined
> length.

, Mar 12, 2008
5. ### MagdollGuest

On Mar 11, 11:01 pm, Dennis Lee Bieber <> wrote:
> On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <>
> declaimed the following in comp.lang.python:
>
> > Correct. I meant the final should be
> > (1,30), (29,40), (50,100)

>
> Actually, even that is incorrect -- note that ,30 overlaps 29,

As someone else pointed out, I can't merge (1,30), (29,40) because N =
5. If N were always 1, I could've easily used IntervalSet to do this
and I would not have to write more than 5 lines total to get this
done. I'm asking precisely because N can be varied and maybe in the
future will be more complex than just an integer.

>
> Since this feels like a homework assignment, I won't be posting my
> code...

It isn't an homework assignment. I'm writing it for my research. Or
more precisely, I've already written a solution and wanted to see if
any people here know more salient solutions than mine

>
> I only state that, 1) I did not bother sorting the interval list --
> unless you have very many intervals to process, a simple sequential
> search is probably faster than using bisection/insert algorithm;

Unfortunately that is the case. My input can be more than 1 million
intervals, although they can be further subdivided into smaller lists,
but still, I would be looking at a list of intervals on a size of at
least thousands, so binary search would definitely win over sequential
search.

2) the
> algorithm inserts one interval at a time; 3) merge is accomplished by
> removing/returning the first candidate interval to the insert method,
> computing the min/max of the candidate and input intervals, and
> recursively inserting that new interval -- if there is no candidate
> overlap interval, no interval is returned, and the insert just appends
> the input interval to the list.

This looks like pretty much what my current solution looks like,
except that it's unsorted.

Magdoll, Mar 12, 2008
6. ### Robert BossyGuest

Magdoll wrote:
> Hi,
>
> I have to read through a file that will give me a bunch of intervals.
> My ultimate goal is to produce a final set of intervals such that not
> two intervals overlap by more than N, where N is a predetermined
> length.
>
> For example, I could read through this input:
> (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
>
> btw, the input is not guaranteed to be in any sorted order.
>
> say N = 5, so the final set should be
> (1,15), (20, 30), (29, 40), (50, 100)
>

Hi,

The problem, as stated here, may have several solutions. For instance
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or
just one?
If you want just one, there's another question: which one? the one with
the most intervals? any one?
If you want all of them, then I suggest using prolog rather than python
(I hope I won't be flamed for advocating another language here).

> Is there already some existing code in Python that I can easily take
> advantage of to produce this? Right now I've written my own simple
> solution, which is just to maintain a list of the intervals. I can use
> the Interval module, but it doesn't really affect much. I read one
> interval from the input file at a time, and use bisect to insert it in
> order. The problem comes with merging, which sometimes can be
>
> ex:
> read (51,65) ==> put (51,65) in list
> read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> merge with (62,100)

The way this algorithm is presented suggests an additional constraint:
you cannot merge two intervals if their overlap <= N. In that case,
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
still <= 5.

If you have a reasonable number of intervals, you're algorithm seems
fine. But it is O(n**2), so in the case you read a lot of intervals and
you observe unsatisfying performances, you will have to store the
intervals in a cleverer data structure, see one of these:
http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree

Cheers,
RB

Robert Bossy, Mar 12, 2008
7. ### MagdollGuest

> Hi,
>
> The problem, as stated here, may have several solutions. For instance
> the following set of intervals also satisfies the constraint:
> (1,15), (20,40), (50,100)
>
> One question you should ask yourself is: do you want all solutions? or
> just one?
> If you want just one, there's another question: which one? the one with
> the most intervals? any one?

I actually don't know which solution I want, and that's why I keep
trying different solutions

> If you want all of them, then I suggest using prolog rather than python
> (I hope I won't be flamed for advocating another language here).

Will I be able to switch between using prolog & python back and forth
though? Cuz the bulk of my code will still be written in python and
this is just a very small part of it.

>
> > Is there already some existing code in Python that I can easily take
> > advantage of to produce this? Right now I've written my own simple
> > solution, which is just to maintain a list of the intervals. I can use
> > the Interval module, but it doesn't really affect much. I read one
> > interval from the input file at a time, and use bisect to insert it in
> > order. The problem comes with merging, which sometimes can be

>
> > ex:
> > read (51,65) ==> put (51,65) in list
> > read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
> > read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
> > merge with (62,100)

>
> The way this algorithm is presented suggests an additional constraint:
> you cannot merge two intervals if their overlap <= N. In that case,
> there is a single solution indeed...
> Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
> still <= 5.

Sorry I keep messing up my example. I meant, I would merge them if
they overlap by >= N....but anyways.

>
> If you have a reasonable number of intervals, you're algorithm seems
> fine. But it is O(n**2), so in the case you read a lot of intervals and
> you observe unsatisfying performances, you will have to store the
> intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree

Thanks! Both of these look interesting and potentially useful

Magdoll, Mar 12, 2008
8. ### Dennis Lee BieberGuest

On Tue, 11 Mar 2008 23:30:16 -0700 (PDT), Magdoll <>
declaimed the following in comp.lang.python:

> As someone else pointed out, I can't merge (1,30), (29,40) because N =
> 5. If N were always 1, I could've easily used IntervalSet to do this
> and I would not have to write more than 5 lines total to get this
> done. I'm asking precisely because N can be varied and maybe in the
> future will be more complex than just an integer.
>

Somehow I'd missed that particular constraint...

> >
> > Since this feels like a homework assignment, I won't be posting my
> > code...

>
> It isn't an homework assignment. I'm writing it for my research. Or
> more precisely, I've already written a solution and wanted to see if
> any people here know more salient solutions than mine
>

Okay -- it just came across as the type of problem one might find as
homework. In the absence of code and a discrete request for "what is
wrong" I felt it better to avoid posting my own code...

A few minutes did get my code to work with a "length of overlap"
constraint... Note that a literal reading of that requirement means that
a short interval, totally contained within a longer interval, will NOT
be merged if it does not have the required number of elements.

>pythonw -u "Intervals.py"

Testing with required overlap of 0
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 1
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 2
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 3
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 4
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 5
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
>Exit code: 0

--
Wulfraed Dennis Lee Bieber KD6MOG

HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: )
HTTP://www.bestiaria.com/

Dennis Lee Bieber, Mar 13, 2008
9. ### Robert BossyGuest

Magdoll wrote:
>> One question you should ask yourself is: do you want all solutions? or
>> just one?
>> If you want just one, there's another question: which one? the one with
>> the most intervals? any one?
>>

>
> I actually don't know which solution I want, and that's why I keep
> trying different solutions
>

You should think about what is your data and what is probably the "best"
solution.

>> If you want all of them, then I suggest using prolog rather than python
>> (I hope I won't be flamed for advocating another language here).
>>

>
> Will I be able to switch between using prolog & python back and forth
> though? Cuz the bulk of my code will still be written in python and
> this is just a very small part of it.
>

You'll have to popen a prolog interpreter and parse its output. Not very
sexy.
Moreover if you've never done prolog, well, you should be warned it's a
"different" language (but still beautiful) with an important learning
curve. Maybe not worth it for just one single problem.

>> If you have a reasonable number of intervals, you're algorithm seems
>> fine. But it is O(n**2), so in the case you read a lot of intervals and
>> you observe unsatisfying performances, you will have to store the
>> intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree
>>

>
> Thanks! Both of these look interesting and potentially useful
>

Indeed. However these structures are clearly heavyweight if the number
of intervals is moderate. I would consider them only if I expected more
than several thousands of intervals.

Cheers,
RB

Robert Bossy, Mar 14, 2008