# merging intervals repeatedly

Discussion in 'Python' started by Magdoll, Mar 11, 2008.

1. ### MagdollGuest

Hi,

I have to read through a file that will give me a bunch of intervals.
My ultimate goal is to produce a final set of intervals such that not
two intervals overlap by more than N, where N is a predetermined
length.

For example, I could read through this input:
(1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

btw, the input is not guaranteed to be in any sorted order.

say N = 5, so the final set should be
(1,15), (20, 30), (29, 40), (50, 100)

Is there already some existing code in Python that I can easily take
advantage of to produce this? Right now I've written my own simple
solution, which is just to maintain a list of the intervals. I can use
the Interval module, but it doesn't really affect much. I read one
interval from the input file at a time, and use bisect to insert it in
order. The problem comes with merging, which sometimes can be

ex:
read (51,65) ==> put (51,65) in list
read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
merge with (62,100)

Of course I can check for cascading merges by pivoting around the
position where the insertion would've taken place...but just
wondering, is there some other nice way to do this? I also intuitively
don't sense a need for using trees, unless someone's already written
it, the interface is easy to use, and it won't result in more insert/
delete/structure changes that nullifies its beauty...(also I'm not
using the output list for searching)

Magdoll, Mar 11, 2008

2. ### MagdollGuest

Correct. I meant the final should be
(1,30), (29,40), (50,100)

Magdoll, Mar 11, 2008

3. ### Dennis Lee BieberGuest

Actually, even that is incorrect -- note that ,30 overlaps 29,

Using an extended data file (note: I found parsing the data file
took me longer than coding the merging) containing (the first part was
from your sample -- I extended starting at 200 to test all
combinations):

(1,10)
(3,15)
(20,30)
(29,40)
(51,65)
(62,100)
(50,66)
(200, 210)
(199, 211)
(250, 260)
(251, 259)
(300, 310)
(301, 311)
(320, 330)
(319, 329)
(350, 360)
(360, 370)
(340, 350)

Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(199, 211)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Note the first three result sets are correct for your list.

Since this feels like a homework assignment, I won't be posting my
code...

I only state that, 1) I did not bother sorting the interval list --
unless you have very many intervals to process, a simple sequential
search is probably faster than using bisection/insert algorithm; 2) the
algorithm inserts one interval at a time; 3) merge is accomplished by
removing/returning the first candidate interval to the insert method,
computing the min/max of the candidate and input intervals, and
recursively inserting that new interval -- if there is no candidate
overlap interval, no interval is returned, and the insert just appends
the input interval to the list.
--
Wulfraed Dennis Lee Bieber KD6MOG

HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: )
HTTP://www.bestiaria.com/

Dennis Lee Bieber, Mar 12, 2008
4. ### castironpiGuest

Correct. I meant the final should be
Actually, given the specification, (overlaps > N count), (1,15), (20,
30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. 
What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
resource? If someone asks for help, there's probably a reason, and if
they cheat, then they don't learn Python. Besides, two or three in a
row and that's justified. Besides, business programmers get plenty of
answers here-- that's cheating too. Besides, just 'cause you help
somebody else, doesn't mean everyone will. Besides, your solution
doesn't work for floats.

 > My ultimate goal is to produce a final set of intervals such
that not

castironpi, Mar 12, 2008
5. ### MagdollGuest

As someone else pointed out, I can't merge (1,30), (29,40) because N =
5. If N were always 1, I could've easily used IntervalSet to do this
and I would not have to write more than 5 lines total to get this
done. I'm asking precisely because N can be varied and maybe in the
future will be more complex than just an integer.
It isn't an homework assignment. I'm writing it for my research. Or
more precisely, I've already written a solution and wanted to see if
any people here know more salient solutions than mine
Unfortunately that is the case. My input can be more than 1 million
intervals, although they can be further subdivided into smaller lists,
but still, I would be looking at a list of intervals on a size of at
least thousands, so binary search would definitely win over sequential
search.

2) the
This looks like pretty much what my current solution looks like,
except that it's unsorted.

Magdoll, Mar 12, 2008
6. ### Robert BossyGuest

Hi,

The problem, as stated here, may have several solutions. For instance
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or
just one?
If you want just one, there's another question: which one? the one with
the most intervals? any one?
If you want all of them, then I suggest using prolog rather than python
(I hope I won't be flamed for advocating another language here).

The way this algorithm is presented suggests an additional constraint:
you cannot merge two intervals if their overlap <= N. In that case,
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
still <= 5.

If you have a reasonable number of intervals, you're algorithm seems
fine. But it is O(n**2), so in the case you read a lot of intervals and
you observe unsatisfying performances, you will have to store the
intervals in a cleverer data structure, see one of these:
http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree

Cheers,
RB

Robert Bossy, Mar 12, 2008
7. ### MagdollGuest

I actually don't know which solution I want, and that's why I keep
trying different solutions Will I be able to switch between using prolog & python back and forth
though? Cuz the bulk of my code will still be written in python and
this is just a very small part of it.
Sorry I keep messing up my example. I meant, I would merge them if
they overlap by >= N....but anyways.
Thanks! Both of these look interesting and potentially useful Magdoll, Mar 12, 2008
8. ### Dennis Lee BieberGuest

Somehow I'd missed that particular constraint...
Okay -- it just came across as the type of problem one might find as
homework. In the absence of code and a discrete request for "what is
wrong" I felt it better to avoid posting my own code...

A few minutes did get my code to work with a "length of overlap"
constraint... Note that a literal reading of that requirement means that
a short interval, totally contained within a longer interval, will NOT
be merged if it does not have the required number of elements.
Testing with required overlap of 0
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 1
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 2
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 3
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 4
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 5
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
--
Wulfraed Dennis Lee Bieber KD6MOG

HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: )
HTTP://www.bestiaria.com/

Dennis Lee Bieber, Mar 13, 2008
9. ### Robert BossyGuest

You should think about what is your data and what is probably the "best"
solution.

You'll have to popen a prolog interpreter and parse its output. Not very
sexy.
Moreover if you've never done prolog, well, you should be warned it's a
"different" language (but still beautiful) with an important learning
curve. Maybe not worth it for just one single problem.

Indeed. However these structures are clearly heavyweight if the number
of intervals is moderate. I would consider them only if I expected more
than several thousands of intervals.

Cheers,
RB

Robert Bossy, Mar 14, 2008