merging intervals repeatedly

M

Magdoll

Hi,

I have to read through a file that will give me a bunch of intervals.
My ultimate goal is to produce a final set of intervals such that not
two intervals overlap by more than N, where N is a predetermined
length.

For example, I could read through this input:
(1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

btw, the input is not guaranteed to be in any sorted order.

say N = 5, so the final set should be
(1,15), (20, 30), (29, 40), (50, 100)

Is there already some existing code in Python that I can easily take
advantage of to produce this? Right now I've written my own simple
solution, which is just to maintain a list of the intervals. I can use
the Interval module, but it doesn't really affect much. I read one
interval from the input file at a time, and use bisect to insert it in
order. The problem comes with merging, which sometimes can be
cascading.

ex:
read (51,65) ==> put (51,65) in list
read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
merge with (62,100)

Of course I can check for cascading merges by pivoting around the
position where the insertion would've taken place...but just
wondering, is there some other nice way to do this? I also intuitively
don't sense a need for using trees, unless someone's already written
it, the interface is easy to use, and it won't result in more insert/
delete/structure changes that nullifies its beauty...(also I'm not
using the output list for searching)


Thanks in advance.
 
D

Dennis Lee Bieber

Correct. I meant the final should be
(1,30), (29,40), (50,100)
Actually, even that is incorrect -- note that ,30 overlaps 29,

Using an extended data file (note: I found parsing the data file
took me longer than coding the merging) containing (the first part was
from your sample -- I extended starting at 200 to test all
combinations):

(1,10)
(3,15)
(20,30)
(29,40)
(51,65)
(62,100)
(50,66)
(200, 210)
(199, 211)
(250, 260)
(251, 259)
(300, 310)
(301, 311)
(320, 330)
(319, 329)
(350, 360)
(360, 370)
(340, 350)

pythonw -u "Intervals.py"
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(199, 211)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Exit code: 0

Note the first three result sets are correct for your list.


Since this feels like a homework assignment, I won't be posting my
code...

I only state that, 1) I did not bother sorting the interval list --
unless you have very many intervals to process, a simple sequential
search is probably faster than using bisection/insert algorithm; 2) the
algorithm inserts one interval at a time; 3) merge is accomplished by
removing/returning the first candidate interval to the insert method,
computing the min/max of the candidate and input intervals, and
recursively inserting that new interval -- if there is no candidate
overlap interval, no interval is returned, and the insert just appends
the input interval to the list.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
C

castironpi

Correct. I meant the final should be
        Actually, even that is incorrect -- note that ,30 overlaps 29,

Actually, given the specification, (overlaps > N count), (1,15), (20,
30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. [1]
        Since this feels like a homework assignment, I won't be posting my
code...

What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
resource? If someone asks for help, there's probably a reason, and if
they cheat, then they don't learn Python. Besides, two or three in a
row and that's justified. Besides, business programmers get plenty of
answers here-- that's cheating too. Besides, just 'cause you help
somebody else, doesn't mean everyone will. Besides, your solution
doesn't work for floats.

[1] > My ultimate goal is to produce a final set of intervals such
that not
 
M

Magdoll

Actually, even that is incorrect -- note that ,30 overlaps 29,

As someone else pointed out, I can't merge (1,30), (29,40) because N =
5. If N were always 1, I could've easily used IntervalSet to do this
and I would not have to write more than 5 lines total to get this
done. I'm asking precisely because N can be varied and maybe in the
future will be more complex than just an integer.
Since this feels like a homework assignment, I won't be posting my
code...

It isn't an homework assignment. I'm writing it for my research. Or
more precisely, I've already written a solution and wanted to see if
any people here know more salient solutions than mine
I only state that, 1) I did not bother sorting the interval list --
unless you have very many intervals to process, a simple sequential
search is probably faster than using bisection/insert algorithm;

Unfortunately that is the case. My input can be more than 1 million
intervals, although they can be further subdivided into smaller lists,
but still, I would be looking at a list of intervals on a size of at
least thousands, so binary search would definitely win over sequential
search.

2) the
algorithm inserts one interval at a time; 3) merge is accomplished by
removing/returning the first candidate interval to the insert method,
computing the min/max of the candidate and input intervals, and
recursively inserting that new interval -- if there is no candidate
overlap interval, no interval is returned, and the insert just appends
the input interval to the list.

This looks like pretty much what my current solution looks like,
except that it's unsorted.
 
R

Robert Bossy

Magdoll said:
Hi,

I have to read through a file that will give me a bunch of intervals.
My ultimate goal is to produce a final set of intervals such that not
two intervals overlap by more than N, where N is a predetermined
length.

For example, I could read through this input:
(1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

btw, the input is not guaranteed to be in any sorted order.

say N = 5, so the final set should be
(1,15), (20, 30), (29, 40), (50, 100)
Hi,

The problem, as stated here, may have several solutions. For instance
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or
just one?
If you want just one, there's another question: which one? the one with
the most intervals? any one?
If you want all of them, then I suggest using prolog rather than python
(I hope I won't be flamed for advocating another language here).

Is there already some existing code in Python that I can easily take
advantage of to produce this? Right now I've written my own simple
solution, which is just to maintain a list of the intervals. I can use
the Interval module, but it doesn't really affect much. I read one
interval from the input file at a time, and use bisect to insert it in
order. The problem comes with merging, which sometimes can be
cascading.

ex:
read (51,65) ==> put (51,65) in list
read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
merge with (62,100)
The way this algorithm is presented suggests an additional constraint:
you cannot merge two intervals if their overlap <= N. In that case,
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
still <= 5.

If you have a reasonable number of intervals, you're algorithm seems
fine. But it is O(n**2), so in the case you read a lot of intervals and
you observe unsatisfying performances, you will have to store the
intervals in a cleverer data structure, see one of these:
http://en.wikipedia.org/wiki/Interval_tree
http://en.wikipedia.org/wiki/Segment_tree


Cheers,
RB
 
M

Magdoll

Hi,

The problem, as stated here, may have several solutions. For instance
the following set of intervals also satisfies the constraint:
(1,15), (20,40), (50,100)

One question you should ask yourself is: do you want all solutions? or
just one?
If you want just one, there's another question: which one? the one with
the most intervals? any one?

I actually don't know which solution I want, and that's why I keep
trying different solutions :p
If you want all of them, then I suggest using prolog rather than python
(I hope I won't be flamed for advocating another language here).

Will I be able to switch between using prolog & python back and forth
though? Cuz the bulk of my code will still be written in python and
this is just a very small part of it.
The way this algorithm is presented suggests an additional constraint:
you cannot merge two intervals if their overlap <= N. In that case,
there is a single solution indeed...
Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
still <= 5.

Sorry I keep messing up my example. I meant, I would merge them if
they overlap by >= N....but anyways.
If you have a reasonable number of intervals, you're algorithm seems
fine. But it is O(n**2), so in the case you read a lot of intervals and
you observe unsatisfying performances, you will have to store the
intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree

Thanks! Both of these look interesting and potentially useful :)
 
D

Dennis Lee Bieber

As someone else pointed out, I can't merge (1,30), (29,40) because N =
5. If N were always 1, I could've easily used IntervalSet to do this
and I would not have to write more than 5 lines total to get this
done. I'm asking precisely because N can be varied and maybe in the
future will be more complex than just an integer.
Somehow I'd missed that particular constraint...
It isn't an homework assignment. I'm writing it for my research. Or
more precisely, I've already written a solution and wanted to see if
any people here know more salient solutions than mine
Okay -- it just came across as the type of problem one might find as
homework. In the absence of code and a discrete request for "what is
wrong" I felt it better to avoid posting my own code...


A few minutes did get my code to work with a "length of overlap"
constraint... Note that a literal reading of that requirement means that
a short interval, totally contained within a longer interval, will NOT
be merged if it does not have the required number of elements.
pythonw -u "Intervals.py"
Testing with required overlap of 0
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 1
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 370)
Testing with required overlap of 2
Final list of intervals:
(1, 15)
(20, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 3
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 4
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Testing with required overlap of 5
Final list of intervals:
(1, 15)
(20, 30)
(29, 40)
(50, 100)
(110, 120)
(113, 115)
(250, 260)
(300, 311)
(319, 330)
(340, 350)
(350, 360)
(360, 370)
Exit code: 0
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
R

Robert Bossy

Magdoll said:
I actually don't know which solution I want, and that's why I keep
trying different solutions :p
You should think about what is your data and what is probably the "best"
solution.

Will I be able to switch between using prolog & python back and forth
though? Cuz the bulk of my code will still be written in python and
this is just a very small part of it.
You'll have to popen a prolog interpreter and parse its output. Not very
sexy.
Moreover if you've never done prolog, well, you should be warned it's a
"different" language (but still beautiful) with an important learning
curve. Maybe not worth it for just one single problem.

Thanks! Both of these look interesting and potentially useful :)
Indeed. However these structures are clearly heavyweight if the number
of intervals is moderate. I would consider them only if I expected more
than several thousands of intervals.

Cheers,
RB
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top