merging intervals repeatedly

Discussion in 'Python' started by Magdoll, Mar 11, 2008.

  1. Magdoll

    Magdoll Guest

    Hi,

    I have to read through a file that will give me a bunch of intervals.
    My ultimate goal is to produce a final set of intervals such that not
    two intervals overlap by more than N, where N is a predetermined
    length.

    For example, I could read through this input:
    (1,10), (3,15), (20,30),(29,40),(51,65),(62,100),(50,66)

    btw, the input is not guaranteed to be in any sorted order.

    say N = 5, so the final set should be
    (1,15), (20, 30), (29, 40), (50, 100)

    Is there already some existing code in Python that I can easily take
    advantage of to produce this? Right now I've written my own simple
    solution, which is just to maintain a list of the intervals. I can use
    the Interval module, but it doesn't really affect much. I read one
    interval from the input file at a time, and use bisect to insert it in
    order. The problem comes with merging, which sometimes can be
    cascading.

    ex:
    read (51,65) ==> put (51,65) in list
    read (62,100) ==> put (62,100) in list (overlap only be 4 <= N)
    read (50,66) ==> merge with (51,65) to become (50,66) ==> now can
    merge with (62,100)

    Of course I can check for cascading merges by pivoting around the
    position where the insertion would've taken place...but just
    wondering, is there some other nice way to do this? I also intuitively
    don't sense a need for using trees, unless someone's already written
    it, the interface is easy to use, and it won't result in more insert/
    delete/structure changes that nullifies its beauty...(also I'm not
    using the output list for searching)


    Thanks in advance.
     
    Magdoll, Mar 11, 2008
    #1
    1. Advertisements

  2. Magdoll

    Magdoll Guest

    Correct. I meant the final should be
    (1,30), (29,40), (50,100)
     
    Magdoll, Mar 11, 2008
    #2
    1. Advertisements

  3. Actually, even that is incorrect -- note that ,30 overlaps 29,

    Using an extended data file (note: I found parsing the data file
    took me longer than coding the merging) containing (the first part was
    from your sample -- I extended starting at 200 to test all
    combinations):

    (1,10)
    (3,15)
    (20,30)
    (29,40)
    (51,65)
    (62,100)
    (50,66)
    (200, 210)
    (199, 211)
    (250, 260)
    (251, 259)
    (300, 310)
    (301, 311)
    (320, 330)
    (319, 329)
    (350, 360)
    (360, 370)
    (340, 350)

    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (199, 211)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    Note the first three result sets are correct for your list.


    Since this feels like a homework assignment, I won't be posting my
    code...

    I only state that, 1) I did not bother sorting the interval list --
    unless you have very many intervals to process, a simple sequential
    search is probably faster than using bisection/insert algorithm; 2) the
    algorithm inserts one interval at a time; 3) merge is accomplished by
    removing/returning the first candidate interval to the insert method,
    computing the min/max of the candidate and input intervals, and
    recursively inserting that new interval -- if there is no candidate
    overlap interval, no interval is returned, and the insert just appends
    the input interval to the list.
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Mar 12, 2008
    #3
  4. Magdoll

    castironpi Guest

    Correct. I meant the final should be
    Actually, given the specification, (overlaps > N count), (1,15), (20,
    30), (29, 40), (50, 66), (62,100) is right, since 66-62= 4<= 5. [1]
    What are the pros and cons, Mr. Bieber? Is cl.py a nanny or a
    resource? If someone asks for help, there's probably a reason, and if
    they cheat, then they don't learn Python. Besides, two or three in a
    row and that's justified. Besides, business programmers get plenty of
    answers here-- that's cheating too. Besides, just 'cause you help
    somebody else, doesn't mean everyone will. Besides, your solution
    doesn't work for floats.

    [1] > My ultimate goal is to produce a final set of intervals such
    that not
     
    castironpi, Mar 12, 2008
    #4
  5. Magdoll

    Magdoll Guest

    As someone else pointed out, I can't merge (1,30), (29,40) because N =
    5. If N were always 1, I could've easily used IntervalSet to do this
    and I would not have to write more than 5 lines total to get this
    done. I'm asking precisely because N can be varied and maybe in the
    future will be more complex than just an integer.
    It isn't an homework assignment. I'm writing it for my research. Or
    more precisely, I've already written a solution and wanted to see if
    any people here know more salient solutions than mine
    Unfortunately that is the case. My input can be more than 1 million
    intervals, although they can be further subdivided into smaller lists,
    but still, I would be looking at a list of intervals on a size of at
    least thousands, so binary search would definitely win over sequential
    search.

    2) the
    This looks like pretty much what my current solution looks like,
    except that it's unsorted.
     
    Magdoll, Mar 12, 2008
    #5
  6. Magdoll

    Robert Bossy Guest

    Hi,

    The problem, as stated here, may have several solutions. For instance
    the following set of intervals also satisfies the constraint:
    (1,15), (20,40), (50,100)

    One question you should ask yourself is: do you want all solutions? or
    just one?
    If you want just one, there's another question: which one? the one with
    the most intervals? any one?
    If you want all of them, then I suggest using prolog rather than python
    (I hope I won't be flamed for advocating another language here).

    The way this algorithm is presented suggests an additional constraint:
    you cannot merge two intervals if their overlap <= N. In that case,
    there is a single solution indeed...
    Nitpick: you cannot merge (50,66) and (62,100) since their overlap is
    still <= 5.

    If you have a reasonable number of intervals, you're algorithm seems
    fine. But it is O(n**2), so in the case you read a lot of intervals and
    you observe unsatisfying performances, you will have to store the
    intervals in a cleverer data structure, see one of these:
    http://en.wikipedia.org/wiki/Interval_tree
    http://en.wikipedia.org/wiki/Segment_tree


    Cheers,
    RB
     
    Robert Bossy, Mar 12, 2008
    #6
  7. Magdoll

    Magdoll Guest

    I actually don't know which solution I want, and that's why I keep
    trying different solutions :p
    Will I be able to switch between using prolog & python back and forth
    though? Cuz the bulk of my code will still be written in python and
    this is just a very small part of it.
    Sorry I keep messing up my example. I meant, I would merge them if
    they overlap by >= N....but anyways.
    Thanks! Both of these look interesting and potentially useful :)
     
    Magdoll, Mar 12, 2008
    #7
  8. Somehow I'd missed that particular constraint...
    Okay -- it just came across as the type of problem one might find as
    homework. In the absence of code and a discrete request for "what is
    wrong" I felt it better to avoid posting my own code...


    A few minutes did get my code to work with a "length of overlap"
    constraint... Note that a literal reading of that requirement means that
    a short interval, totally contained within a longer interval, will NOT
    be merged if it does not have the required number of elements.
    Testing with required overlap of 0
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    Testing with required overlap of 1
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 370)
    Testing with required overlap of 2
    Final list of intervals:
    (1, 15)
    (20, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 3
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 4
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (113, 115)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    Testing with required overlap of 5
    Final list of intervals:
    (1, 15)
    (20, 30)
    (29, 40)
    (50, 100)
    (110, 120)
    (113, 115)
    (250, 260)
    (300, 311)
    (319, 330)
    (340, 350)
    (350, 360)
    (360, 370)
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Mar 13, 2008
    #8
  9. Magdoll

    Robert Bossy Guest

    You should think about what is your data and what is probably the "best"
    solution.

    You'll have to popen a prolog interpreter and parse its output. Not very
    sexy.
    Moreover if you've never done prolog, well, you should be warned it's a
    "different" language (but still beautiful) with an important learning
    curve. Maybe not worth it for just one single problem.

    Indeed. However these structures are clearly heavyweight if the number
    of intervals is moderate. I would consider them only if I expected more
    than several thousands of intervals.

    Cheers,
    RB
     
    Robert Bossy, Mar 14, 2008
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.