merging intervals repeatedly

Discussion in 'Python' started by Magdoll, Mar 11, 2008.

  1. Magdoll

    Magdoll Guest

    Hi,

    I have to read through a file that will give me a bunch of intervals.
    My ultimate goal is to produce a final set of intervals such that not
    two intervals overlap by more than N, where N is a predetermined
    length.

    For example, I could read through this input:
    (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

    btw, the input is not guaranteed to be in any sorted order.

    say N = 5, so the final set should be
    (1,15), (20, 30), (29, 40), (50, 100)

    Is there already some existing code in Python that I can easily take
    advantage of to produce this? Right now I've written my own simple
    solution, which is just to maintain a list of the intervals. I can use
    the Interval module, but it doesn't really affect much. I read one
    interval from the input file at a time, and use bisect to insert it in
    order. The problem comes with merging, which sometimes can be
    cascading.

    ex:
    read (51,65) ==> put (51,65) in list
    read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
    read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
    merge with (62,100)

    Of course I can check for cascading merges by pivoting around the
    position where the insertion would've taken place...but just
    wondering, is there some other nice way to do this? I also intuitively
    don't sense a need for using trees, unless someone's already written
    it, the interface is easy to use, and it won't result in more insert/
    delete/structure changes that nullifies its beauty...(also I'm not
    using the output list for searching)


    Thanks in advance.
     
    Magdoll, Mar 11, 2008
    #1
    1. Advertising

  2. Magdoll

    Magdoll Guest

    Correct. I meant the final should be
    (1,30), (29,40), (50,100)

    On Mar 11, 3:41 pm, Magdoll <> wrote:
    > Hi,
    >
    > I have to read through a file that will give me a bunch of intervals.
    > My ultimate goal is to produce a final set of intervals such that not
    > two intervals overlap by more than N, where N is a predetermined
    > length.
    >
    > For example, I could read through this input:
    > (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
    >
    > btw, the input is not guaranteed to be in any sorted order.
    >
    > say N = 5, so the final set should be
    > (1,15), (20, 30), (29, 40), (50, 100)
    >
    > Is there already some existing code in Python that I can easily take
    > advantage of to produce this? Right now I've written my own simple
    > solution, which is just to maintain a list of the intervals. I can use
    > the Interval module, but it doesn't really affect much. I read one
    > interval from the input file at a time, and use bisect to insert it in
    > order. The problem comes with merging, which sometimes can be
    > cascading.
    >
    > ex:
    > read (51,65) ==> put (51,65) in list
    > read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
    > read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
    > merge with (62,100)
    >
    > Of course I can check for cascading merges by pivoting around the
    > position where the insertion would've taken place...but just
    > wondering, is there some other nice way to do this? I also intuitively
    > don't sense a need for using trees, unless someone's already written
    > it, the interface is easy to use, and it won't result in more insert/
    > delete/structure changes that nullifies its beauty...(also I'm not
    > using the output list for searching)
    >
    > Thanks in advance.
     
    Magdoll, Mar 11, 2008
    #2
    1. Advertising

  3. On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <>
    declaimed the following in comp.lang.python:

    > Correct. I meant the final should be
    > (1,30), (29,40), (50,100)
    >

    Actually, even that is incorrect -- note that ,30 overlaps 29,

    Using an extended data file (note: I found parsing the data file
    took me longer than coding the merging) containing (the first part was
    from your sample -- I extended starting at 200 to test all
    combinations):

    (1,10)
    (3,15)
    (20,30)
    (29,40)
    (51,65)
    (62,100)
    (50,66)
    (200, 210)
    (199, 211)
    (250, 260)
    (251, 259)
    (300, 310)
    (301, 311)
    (320, 330)
    (319, 329)
    (350, 360)
    (360, 370)
    (340, 350)


    >pythonw -u "Intervals.py"

    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (199, 211)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    >Exit code: 0


    Note the first three result sets are correct for your list.


    Since this feels like a homework assignment, I won't be posting my
    code...

    I only state that, 1) I did not bother sorting the interval list --
    unless you have very many intervals to process, a simple sequential
    search is probably faster than using bisection/insert algorithm; 2) the
    algorithm inserts one interval at a time; 3) merge is accomplished by
    removing/returning the first candidate interval to the insert method,
    computing the min/max of the candidate and input intervals, and
    recursively inserting that new interval -- if there is no candidate
    overlap interval, no interval is returned, and the insert just appends
    the input interval to the list.
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Mar 12, 2008
    #3
  4. Magdoll

    Guest

    > > Correct. I meant the final should be
    > > (1,30), (29,40), (50,100)

    >
    >         Actually, even that is incorrect -- note that ,30 overlaps 29,


    Actually, given the specification, (overlaps > N count), (1,15), (20,
    30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. [1]

    >         Since this feels like a homework assignment, I won't be posting my
    > code...


    What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
    resource? If someone asks for help, there's probably a reason, and if
    they cheat, then they don't learn Python. Besides, two or three in a
    row and that's justified. Besides, business programmers get plenty of
    answers here-- that's cheating too. Besides, just 'cause you help
    somebody else, doesn't mean everyone will. Besides, your solution
    doesn't work for floats.

    [1] > My ultimate goal is to produce a final set of intervals such
    that not
    > two intervals overlap by more than N, where N is a predetermined
    > length.
     
    , Mar 12, 2008
    #4
  5. Magdoll

    Magdoll Guest

    On Mar 11, 11:01 pm, Dennis Lee Bieber <> wrote:
    > On Tue, 11 Mar 2008 15:43:40 -0700 (PDT), Magdoll <>
    > declaimed the following in comp.lang.python:
    >
    > > Correct. I meant the final should be
    > > (1,30), (29,40), (50,100)

    >
    > Actually, even that is incorrect -- note that ,30 overlaps 29,


    As someone else pointed out, I can't merge (1,30), (29,40) because N =
    5. If N were always 1, I could've easily used IntervalSet to do this
    and I would not have to write more than 5 lines total to get this
    done. I'm asking precisely because N can be varied and maybe in the
    future will be more complex than just an integer.

    >
    > Since this feels like a homework assignment, I won't be posting my
    > code...


    It isn't an homework assignment. I'm writing it for my research. Or
    more precisely, I've already written a solution and wanted to see if
    any people here know more salient solutions than mine

    >
    > I only state that, 1) I did not bother sorting the interval list --
    > unless you have very many intervals to process, a simple sequential
    > search is probably faster than using bisection/insert algorithm;


    Unfortunately that is the case. My input can be more than 1 million
    intervals, although they can be further subdivided into smaller lists,
    but still, I would be looking at a list of intervals on a size of at
    least thousands, so binary search would definitely win over sequential
    search.

    2) the
    > algorithm inserts one interval at a time; 3) merge is accomplished by
    > removing/returning the first candidate interval to the insert method,
    > computing the min/max of the candidate and input intervals, and
    > recursively inserting that new interval -- if there is no candidate
    > overlap interval, no interval is returned, and the insert just appends
    > the input interval to the list.


    This looks like pretty much what my current solution looks like,
    except that it's unsorted.
     
    Magdoll, Mar 12, 2008
    #5
  6. Magdoll

    Robert Bossy Guest

    Magdoll wrote:
    > Hi,
    >
    > I have to read through a file that will give me a bunch of intervals.
    > My ultimate goal is to produce a final set of intervals such that not
    > two intervals overlap by more than N, where N is a predetermined
    > length.
    >
    > For example, I could read through this input:
    > (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)
    >
    > btw, the input is not guaranteed to be in any sorted order.
    >
    > say N = 5, so the final set should be
    > (1,15), (20, 30), (29, 40), (50, 100)
    >

    Hi,

    The problem, as stated here, may have several solutions. For instance
    the following set of intervals also satisfies the constraint:
    (1,15), (20,40), (50,100)

    One question you should ask yourself is: do you want all solutions? or
    just one?
    If you want just one, there's another question: which one? the one with
    the most intervals? any one?
    If you want all of them, then I suggest using prolog rather than python
    (I hope I won't be flamed for advocating another language here).


    > Is there already some existing code in Python that I can easily take
    > advantage of to produce this? Right now I've written my own simple
    > solution, which is just to maintain a list of the intervals. I can use
    > the Interval module, but it doesn't really affect much. I read one
    > interval from the input file at a time, and use bisect to insert it in
    > order. The problem comes with merging, which sometimes can be
    > cascading.
    >
    > ex:
    > read (51,65) ==> put (51,65) in list
    > read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
    > read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
    > merge with (62,100)

    The way this algorithm is presented suggests an additional constraint:
    you cannot merge two intervals if their overlap <= N. In that case,
    there is a single solution indeed...
    Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
    still <= 5.

    If you have a reasonable number of intervals, you're algorithm seems
    fine. But it is O(n**2), so in the case you read a lot of intervals and
    you observe unsatisfying performances, you will have to store the
    intervals in a cleverer data structure, see one of these:
    http://en.wikipedia.org/wiki/Interval_tree
    http://en.wikipedia.org/wiki/Segment_tree


    Cheers,
    RB
     
    Robert Bossy, Mar 12, 2008
    #6
  7. Magdoll

    Magdoll Guest


    > Hi,
    >
    > The problem, as stated here, may have several solutions. For instance
    > the following set of intervals also satisfies the constraint:
    > (1,15), (20,40), (50,100)
    >
    > One question you should ask yourself is: do you want all solutions? or
    > just one?
    > If you want just one, there's another question: which one? the one with
    > the most intervals? any one?


    I actually don't know which solution I want, and that's why I keep
    trying different solutions :p

    > If you want all of them, then I suggest using prolog rather than python
    > (I hope I won't be flamed for advocating another language here).


    Will I be able to switch between using prolog & python back and forth
    though? Cuz the bulk of my code will still be written in python and
    this is just a very small part of it.

    >
    > > Is there already some existing code in Python that I can easily take
    > > advantage of to produce this? Right now I've written my own simple
    > > solution, which is just to maintain a list of the intervals. I can use
    > > the Interval module, but it doesn't really affect much. I read one
    > > interval from the input file at a time, and use bisect to insert it in
    > > order. The problem comes with merging, which sometimes can be
    > > cascading.

    >
    > > ex:
    > > read (51,65) ==> put (51,65) in list
    > > read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
    > > read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
    > > merge with (62,100)

    >
    > The way this algorithm is presented suggests an additional constraint:
    > you cannot merge two intervals if their overlap <= N. In that case,
    > there is a single solution indeed...
    > Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
    > still <= 5.


    Sorry I keep messing up my example. I meant, I would merge them if
    they overlap by >= N....but anyways.

    >
    > If you have a reasonable number of intervals, you're algorithm seems
    > fine. But it is O(n**2), so in the case you read a lot of intervals and
    > you observe unsatisfying performances, you will have to store the
    > intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree


    Thanks! Both of these look interesting and potentially useful :)
     
    Magdoll, Mar 12, 2008
    #7
  8. On Tue, 11 Mar 2008 23:30:16 -0700 (PDT), Magdoll <>
    declaimed the following in comp.lang.python:


    > As someone else pointed out, I can't merge (1,30), (29,40) because N =
    > 5. If N were always 1, I could've easily used IntervalSet to do this
    > and I would not have to write more than 5 lines total to get this
    > done. I'm asking precisely because N can be varied and maybe in the
    > future will be more complex than just an integer.
    >

    Somehow I'd missed that particular constraint...

    > >
    > > Since this feels like a homework assignment, I won't be posting my
    > > code...

    >
    > It isn't an homework assignment. I'm writing it for my research. Or
    > more precisely, I've already written a solution and wanted to see if
    > any people here know more salient solutions than mine
    >

    Okay -- it just came across as the type of problem one might find as
    homework. In the absence of code and a discrete request for "what is
    wrong" I felt it better to avoid posting my own code...


    A few minutes did get my code to work with a "length of overlap"
    constraint... Note that a literal reading of that requirement means that
    a short interval, totally contained within a longer interval, will NOT
    be merged if it does not have the required number of elements.

    >pythonw -u "Intervals.py"

    Testing with required overlap of 0
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    Testing with required overlap of 1
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    Testing with required overlap of 2
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 3
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 4
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (113, 115)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 5
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (113, 115)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    >Exit code: 0

    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Mar 13, 2008
    #8
  9. Magdoll

    Robert Bossy Guest

    Magdoll wrote:
    >> One question you should ask yourself is: do you want all solutions? or
    >> just one?
    >> If you want just one, there's another question: which one? the one with
    >> the most intervals? any one?
    >>

    >
    > I actually don't know which solution I want, and that's why I keep
    > trying different solutions :p
    >

    You should think about what is your data and what is probably the "best"
    solution.


    >> If you want all of them, then I suggest using prolog rather than python
    >> (I hope I won't be flamed for advocating another language here).
    >>

    >
    > Will I be able to switch between using prolog & python back and forth
    > though? Cuz the bulk of my code will still be written in python and
    > this is just a very small part of it.
    >

    You'll have to popen a prolog interpreter and parse its output. Not very
    sexy.
    Moreover if you've never done prolog, well, you should be warned it's a
    "different" language (but still beautiful) with an important learning
    curve. Maybe not worth it for just one single problem.


    >> If you have a reasonable number of intervals, you're algorithm seems
    >> fine. But it is O(n**2), so in the case you read a lot of intervals and
    >> you observe unsatisfying performances, you will have to store the
    >> intervals in a cleverer data structure, see one of these:http://en.wikipedia.org/wiki/Interval_treehttp://en.wikipedia.org/wiki/Segment_tree
    >>

    >
    > Thanks! Both of these look interesting and potentially useful :)
    >

    Indeed. However these structures are clearly heavyweight if the number
    of intervals is moderate. I would consider them only if I expected more
    than several thousands of intervals.

    Cheers,
    RB
     
    Robert Bossy, Mar 14, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SteveLB
    Replies:
    0
    Views:
    337
    SteveLB
    Aug 8, 2003
  2. Jarrod Hermer
    Replies:
    2
    Views:
    2,236
    kahlawat
    Jul 13, 2007
  3. Dan
    Replies:
    4
    Views:
    485
    Mr Newbie
    Dec 30, 2005
  4. =?Utf-8?B?TWljaGFlbCBFLiBPLiBCb3JjaGVydA==?=

    Integrated Security Problem? - User must log in repeatedly

    =?Utf-8?B?TWljaGFlbCBFLiBPLiBCb3JjaGVydA==?=, Jan 18, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    365
    Mark Milley
    Jan 19, 2006
  5. Replies:
    24
    Views:
    1,737
    alchemist
    Aug 4, 2005
Loading...

Share This Page