(e-mail address removed) writes:
[...]
Well since I attracted a couple people's attention I will describe the
problem in more detail. Describing the problem properly is probably as
hard as solving it, so excuse me if I struggle a bit.
The problem is for a health insurance company and involves the period
of time a person is covered. Most insurance companies allow not only
for the main member to be insured but his family: the spouse and the
dependents (children). This additional coverage costs extra but less
than a full new insurance. So for example if Alice buys an insurance
worth at 100 dollars a month, she can insure her husband Bob for an
additional 50 dollars. Under certain circumstances Alice may go off
the insurance and only Bob stays. In that case the price goes back to
100 dollars or maybe there is a deal for 80 or something like that. In
other words the cost of the insurance is dependent on the combination
of family members that participate in it. Additionally not only do we
have different family compositions but also different insurance
products. So you can get medical, dental and vision insurance.
All that data is stored in a database that is not very tidy and looks
something like this:
First Day of Coverage, Last Day of Coverage, Relationship, Product
5/3/2005, 5/3/2005, D, M
9/10/2005, 10/10/2005, S, V
3/15/2005, 7/15/2005, M, M
3/1/2005, 6/1/2005, S, M
5/15/2005, 7/20/2005, D, D
9/10/2005, 1/1/2140, M, V
2/1/2005, 5/3/2005, M, M
Where
Relationship: M=Main Member, S=Spouse, D=Dependent
Product: M=Medical, D=Dental, V=Vision
As far as I know at the present time there are no deals based on
combination of products purchased so we will handle each product
independently. What I want is a simple algorithm that allows me to
calculate something like this out of it (hopefully I didn’t make a
mistake):
Medical:
2/1/2005, 2/28/2005, M
3/1/2005, 5/2/2005, M&S
5/3/2005, 5/3/2005, M&S&D
5/4/2005, 6/1/2005, M&S
6/2/2005, 7/15/2005, M
Dental:
5/15/2005, 7/20/2005, D
Vision:
9/10/2005, 10/10/2005, M&S
10/11/2005, 1/1/2140, M
OK the approach I describe in my previous email works fine for this
particular problem. I implement it below - the function walk_ivals is at
the heart of it, I've made it as simple as possible and it's pretty
clear that it is O(nlogn).
The function that actually sorts the data is union(), and it's just a
call to walk_ivals with callback the function acc() which is constant
time, therefore union() itself has the same complexity as walk_ivals.
There are no comments - I don't have the time to add any, sorry!
--------------------------------------------------
import datetime
from collections import defaultdict
def walk_ivals(ivals, callback, endvals=(-1, 1)):
endpoints = [(x, data) for ival, data in ivals for x in zip(ival, endvals)]
endpoints.sort()
for (endpoint, endval), data in endpoints:
callback(endpoint, endval, data)
def union(ivals):
timelines = defaultdict(list)
mult = defaultdict(lambda: defaultdict(int))
def acc(date, step, data):
rel, prod = data
old_mult = mult[prod][rel]
mult[prod][rel] -= step
if not(old_mult and mult[prod][rel]):
rels = [rel for rel, m in mult[prod].iteritems() if m]
if timelines[prod] and timelines[prod][-1][0] == date:
timelines[prod][-1][1] = rels
else:
timelines[prod].append([date, rels])
walk_ivals(ivals, acc)
return timelines
test_data = """5/3/2005, 5/3/2005, D, M
9/10/2005, 10/10/2005, S, V
3/15/2005, 7/15/2005, M, M
3/1/2005, 6/1/2005, S, M
5/15/2005, 7/20/2005, D, D
9/10/2005, 1/1/2140, M, V
2/1/2005, 5/3/2005, M, M"""
def parse_date(date_string):
month, day, year = map(int, date_string.split('/'))
return datetime.date(year, month, day)
def parse_data(data_string):
data = []
for line in data_string.split("\n"):
start, end, rel, prod = line.split(",")
start, end = map(parse_date, (start + datetime.timedelta(1), end))
rel, prod = rel.strip(), prod.strip()
data.append([(start, end), (rel, prod)])
return data
def test():
ivals = parse_data(test_data)
timelines = union(ivals)
for prod, timeline in timelines.iteritems():
print "-"*20
print "Product", prod
for date, covers in timeline:
print date, ' & '.join(covers)
if __name__ == '__main__':
test()
--------------------------------------------------
Here is what it outputs:
marigold:junk arno$ python intervals2.py
--------------------
Product M
2005-02-01 M
2005-03-01 S & M
2005-05-03 S & M & D
2005-05-04 S & M
2005-06-02 M
2005-07-16
--------------------
Product D
2005-05-15 D
2005-07-21
--------------------
Product V
2005-09-10 S & M
2005-10-11 M
2140-01-02
--------------------------------------------------
The date output is slightly different from yours - I didn't realise you
had time intervals and now I don't have the time to change it, although
it's only a cosmetic change (the end-date of each interval is the
start-date of the next one minus 1 day).
HTH
PS: warning - I have done some minor editing of the code after pasting
it, so it might break (although it should be fine).