[SUMMARY] Preferable Pairs (#165)

Discussion in 'Ruby' started by Matthew Moss, Jun 12, 2008.

  1. Matthew Moss

    Matthew Moss Guest

    [Note: parts of this message were removed to make it a legal post.]

    Apologies for being a little late today... Took a little bit to understand
    some of what was going on in the solutions for the Preferable Pairs quiz,
    and I'm not certain I explained it well enough, but... here it is.


    This "Preferable Pairs" quiz is, as was noted, very similar to the [Stable
    Roommates][1] problem, though the goal as originally stated is slightly
    different from the typical presentation of the Stable Roommates problem.
    Still, the minor goal difference does not change that this is an
    [NP-complete][2] problem, and as such, large input sets would require some
    sort of approximation, heuristics or other techniques to keep the run time
    reasonable.

    Fortunately, I had no intention of testing the solutions provided on large
    sets. My test involved a set of 20 people, to be grouped into 10 pairs
    according to their preferences. Here are the results of my test on the
    solutions provided.

    Solution Time Score
    Andrea Fazzi 0.159 438
    Dustin Barker 0.020 589
    Eric Ivancich 4.671 311
    Steven Hahn -DNF-
    Eric Mahurin 0.211 311
    Matthew Moss 0.022 589
    Thomas ML 0.114 311

    I neglected to post my own solution, but it's nearly identical to Dustin's
    both in algorithm, results and performance (but mine is much uglier). Also,
    apologies to Steven... I tried to wait, really I did, but it just kept
    going, and going...

    Andrea, Dustin and myself made straightforward attempts that were fast but
    not optimal. While the numbers presented above don't show a huge disparity
    in performance, I suspect that would be a different story if we were
    attempting to solve larger datasets, of thousands of people as opposed to
    twenty. For the tested dataset, I believe there might be a few grumbling
    people, forced to play with someone they didn't like very much, but the
    tournament would go on.

    Eric Ivancich provided a genetic algorithm that is notably slower (at this
    sample size), and isn't guaranteed to get the most optimal answer, but it
    happened to do so here. It would be interesting to compare its performance
    against the optimal solution algorithms for larger data sets.

    Eric Mahurin and Thomas ML provided algorithms that find the optimal
    solution. There is a fair bit of setup code in both to get the data into a
    convenient form for the optimization algorithm to work. I'm going to skip
    over that and jump right into Thomas' `optimize` method. As input to
    `optimize`, the `pairings` argument looks like this for the sample data:

    [
    [["David", "Helen"], 0],
    [["David", "Vicki"], 1],
    [["Helen", "Vicki"], 2],
    [["Joseph", "Vicki"], 4],
    [["Helen", "Joseph"], 5],
    [["David", "Joseph"], 8]
    ]

    Basically, the input is an ordered list of all pairs with the corresponding
    score (as calculated according to the metric as described by the quiz). It
    is a combined score, and reflects the preferences of both people in the
    pair. For example, the pair of Helen and Joseph has a score of 5, because
    Helen scores Joseph as 4, while Joseph scores Helen as 1: so, 4 + 1 == 5.

    Before we conquer the whole of the algorithm, let's look at a simplified
    version of `optimize` to get a feel for the general structure. The input
    here (for argument `pairings`) is similar to the above, but without the
    scores.

    def optimize(pairings, names, pos, posmax)
    bestpairs = nil
    while pos < posmax
    pair = pairings[pos]
    pos += 1
    if names & pair == pair
    names1 = names - pair
    if names1.size < 2
    bestpairs = [pair]
    bestpairs << names1 unless names1.empty?
    return bestpairs
    elsif (rv = optimize(pairings, names1, pos, posmax)
    bestpairs = rv
    bestpairs << pair
    end
    end
    end
    return bestpairs
    end

    `bestpairs` starts off empty, but by the end should be an array of pairs
    chosen as the solution. Each iteration of the loop grabs the next pair from
    `pairings` and checks to see if it is still usable by set intersection:

    if names & pair == pair

    `names` will contain the names of people still available; that is, those
    that have not already been chosen previously. The `&` operator treats the
    two arrays as sets and performs set intersections. If the result of
    intersection is equal to one of the arguments, that argument must be a
    subset of the other, and in this context, is safe to choose for the
    solution.

    When a pair is chosen, those names are removed by Array difference (not
    quite the same as set difference, but close enough for this case). So it is
    with:

    names1 = names - pair

    That we remove the chosen pair from the list of remaining people. If that is
    the last possible pair to be made (there are fewer than 2 names remaining),
    we finish up by setting and returning the `bestpairs` array

    bestpairs = [pair]
    bestpairs << names1 unless names1.empty?
    return bestpairs

    In the case that there is at least one more pair remaining, we continue onto
    the recursive part of this solution:

    elsif (rv = optimize(pairings, names1, pos, posmax)
    bestpairs = rv
    bestpairs << pair

    We know that `bestpairs`, as returned by the recursive call to `optimize`,
    contains the best pairs from the subset of `names1`, which we got above
    after removing `pair`. So we make sure to concatenate `pair` onto
    `bestpairs` before it is returned to the caller.

    As simplified, this amounts to a greedy algorithm, since the pairs were
    initially sorted according to score, and the simplified algorithm, at each
    level, simply takes the first pair possible.

    Now let's go back to the complete `optimize` method.

    def optimize(pairings, names, pos, posmax, maxweight=nil)
    bestpairs = nil
    maxweight ||= pairings.size ** 2 + 1
    while pos < posmax
    pair, weight1 = pairings[pos]
    break unless weight1 * (names.size / 2).floor < maxweight
    pos += 1
    if names & pair == pair
    names1 = names - pair
    if names1.size < 2
    bestpairs = [pair]
    bestpairs << names1 unless names1.empty?
    return [weight1, bestpairs]
    elsif (rv = optimize(pairings, names1, pos, posmax,
    maxweight - weight1))
    maxweight, bestpairs = rv
    maxweight += weight1
    bestpairs << pair
    end
    end
    end
    return bestpairs && [maxweight, bestpairs]
    end

    The algorithm structure is basically the same: recursive, greedily selecting
    pairs and removing those names from the set of names available... The major
    difference here is the inclusion of the weights (i.e. scores) and possible
    rejection of pairs with respect to those weights.

    As recursive calls to `optimize` are made, the `maxweight` value is updated
    via this code:

    maxweight, bestpairs = rv
    maxweight += weight1

    and checked via this code:

    break unless weight1 * (names.size / 2).floor < maxweight

    So a pair may be skipped if it's weight (scaled by the number of names)
    exceeds the current maxweight, determined the the current set of choices.

    When a possible solution is found, the `maxweight` variable represents the
    total score for that solution. But the algorithm does not stop immediately;
    it keeps checking pairs and solutions of pairs, rejecting those solutions
    and partial solutions whose total score (i.e. `maxweight`) would exceed that
    previously known `maxweight`.

    In the end, the solution reported is the collection of pairs with the lowest
    total score, and is returned via the original invocation of `optimize`.

    [1]: http://en.wikipedia.org/wiki/Stable_roommates_problem
    [2]: http://en.wikipedia.org/wiki/Np_complete



    --

    Matthew Moss <>
     
    Matthew Moss, Jun 12, 2008
    #1
    1. Advertising

  2. Matthew Moss

    Eric Mahurin Guest

    [Note: parts of this message were removed to make it a legal post.]

    On 6/12/08, Matthew Moss <> wrote:

    > Solution Time Score
    > Andrea Fazzi 0.159 438
    > Dustin Barker 0.020 589
    > Eric Ivancich 4.671 311
    > Steven Hahn -DNF-
    > Eric Mahurin 0.211 311
    > Matthew Moss 0.022 589
    > Thomas ML 0.114 311



    Did anybody try to analyze the complexity of the final solution from Thomas?

    The problem with analyzing his solution is this line:

    break unless weight1 * (names.size / 2).floor < maxweight


    This causes some dependency on the data for run-time. It seems like this
    line should only be needed for improving runtime, but commenting it out
    gives non-optimal results. Don't get it. If you ignore this line, the
    analysis is easier. There are O(n**2) pairs and each recursive call
    decrements n by 2. When you get to a single pair, it is O(1). So,
    O(n**2*(n-2)**2*(n-4)**2*...*2**2*1) or O(n!**2). It is probably more, but
    most of this is hidden in the C code (for these size inputs, we can assume
    every C method call is O(1)).

    My solution is easy to analyze because it doesn't have any data dependent
    variation. It should need to fill a O(2**n) memoization cache which takes
    O(n) for each entry - so O(n*2**n).

    From the random runs I've done, the ThomML algorithm doesn't look like
    O(n!**2) at all. The randomized complexity looks similar to my O(n*2**n)
    algorithm based on random input experiments.

    I was wondering if a worst case scenario could make his algorithm behave
    more like O(n!**2). I tried two extreme scenarios: a) love-hate: the more
    one player loves another, the more the second player hates the first, b)
    popularity: each player has a popularity ranking and every player will
    prefer in the popularity order.

    I found some interesting results with these extremes. The love-hate case
    allowed the ThomML algorithm to go up to hundreds of players (400 players in
    30seconds). Probably O(n**2) or O(n**3) for this scenario. The popularity
    case was quite bad though - 14 players: 19s, 12 players: 1.8s, 10 players:
    0.2s. This looks to be O(n!).

    My algorithm doesn't have these extreme variations. For all cases, 24
    players: 3.8s, 26 players: 10s, 28 players: 30s.

    The script I used to generate input for these cases in below. The first arg
    is the # players. Here's what you can use for the second arg:
    nothing/default/-2: the popularity case, -1: love-hate case, 0: random w/o
    seeding, >0: random w/ seeding.

    Eric

    #!/usr/bin/env ruby

    NAMES = %w(Adam Anna Bob Betty Chuck David Daria Evan Ellen Fred Faye
    Greg Gail Hank Helen Irene John Janet Laura Matt Maria Nadine Oleg
    Olivia Peter Pamela Rick Rosa Steve Susan Theo Tracy Vicki Walter)

    def sample(n, seed)
    srand(seed) if seed>0
    names = NAMES.clone
    i = 1
    while names.size<n
    i += 1
    NAMES.each { |name| names << "#{name}#{i}" }
    end
    names = names[0, n].sort!
    prefs = Hash.new { |h,k| h[k] = [] }
    names.each { |name1|
    i = 0
    names2 = (seed>=0) ? names.sort_by { rand } : names
    names2.each { |name2|
    next if name2==name1
    next if prefs[name1].include?(name2)
    while prefs[name1]
    i += 1
    end
    prefs[name1] = name2
    if seed==-1
    !prefs[name2][names.size-2-i] or raise("#{name2} #{i}
    #{prefs.inspect}")
    prefs[name2][names.size-2-i] = name1
    end
    }
    }
    names.each { |name|
    puts("#{name}: #{prefs[name].join(' ')}")
    }
    end

    sample(Integer(ARGV[0] || 10), Integer(ARGV[1]||-2))
     
    Eric Mahurin, Jun 13, 2008
    #2
    1. Advertising

  3. Matthew Moss

    ThoML Guest

    Re: Preferable Pairs (#165)

    Hi,

    > I found some interesting results with these extremes. The love-hate case
    > allowed the ThomML algorithm to go up to hundreds of players (400 players in
    > 30seconds). Probably O(n**2) or O(n**3) for this scenario. The popularity
    > case was quite bad though - 14 players: 19s, 12 players: 1.8s, 10 players:
    > 0.2s. This looks to be O(n!).


    Thanks for the script for generating specific types of preferences.
    The performance depends on the adequacy of the underlying greedy
    algorithm for the data set. In the "love-hate" case, a simple greedy
    algorithm already returns optimal pairings. In the "popularity" case,
    such an approach doesn't work well.

    But the solution submitted is really simple and doesn't use too much
    memory. I experimented with memoizing subresults but came to the
    conclusion that it isn't worth the trouble for small random data sets
    because the cached data is hardly ever used. I assume memoization
    could help to avoid those variations you pointed out.

    Regards,
    Thomas.
     
    ThoML, Jun 13, 2008
    #3
  4. Matthew Moss

    Eric Mahurin Guest

    Re: Preferable Pairs (#165)

    [Note: parts of this message were removed to make it a legal post.]

    On 6/13/08, ThoML <> wrote:
    >
    > Hi,
    >
    >
    > > I found some interesting results with these extremes. The love-hate case
    > > allowed the ThomML algorithm to go up to hundreds of players (400 players

    > in
    > > 30seconds). Probably O(n**2) or O(n**3) for this scenario. The

    > popularity
    > > case was quite bad though - 14 players: 19s, 12 players: 1.8s, 10

    > players:
    > > 0.2s. This looks to be O(n!).

    >
    >
    > Thanks for the script for generating specific types of preferences.
    > The performance depends on the adequacy of the underlying greedy
    > algorithm for the data set. In the "love-hate" case, a simple greedy
    > algorithm already returns optimal pairings. In the "popularity" case,
    > such an approach doesn't work well.
    >
    > But the solution submitted is really simple and doesn't use too much
    > memory. I experimented with memoizing subresults but came to the
    > conclusion that it isn't worth the trouble for small random data sets
    > because the cached data is hardly ever used. I assume memoization
    > could help to avoid those variations you pointed out.



    I think to reasonably memoize in this problem and keep the memory low, you
    need to represent each player as a bit. With players<=31 (or <=63 for
    64-bit machine), you can represent any set of players as a Fixnum. A keys
    in the memoize cache can a set of players (Fixnum) and the value can be the
    optimal cost (also Fixnum). With 28 players (30 seconds), my attempt only
    uses 26MB.

    Here are a few ideas to improve what you have further:

    1. Represent each player as a bit. Translate going in and out of the
    algorithm. This won't change the complexity, but should give a speedup of
    several times because you are dealing with much simpler data (for the
    computer).
    2. Memoize costs. This should change the worst case complexity from
    factorial (squared?) to exponential (base 2).
    3. Try a more accurate lower bound on the total cost. Instead of cost*n/2,
    you might try half of the sum of the next n costs. Another possibility may
    be to take half of the sum of the lowest remaining costs for each player.
    These might not yield anything because the your current estimation is O(1)
    and these are O(n).
    4. Instead of sorting by simply pair cost, you might consider sorting by a
    total estimated cost when a pair is chosen. Not sure this will help since
    it may be valid for the first pair chosen, but not necessarily good for ones
    after that.
    5. Instead of maintaining the current best pair sequence when finding the
    lowest cost, this could be done in a separate lower complexity phase. With
    memoization this works well since you can easily use the costs in the cache
    to "follow the crumbs". Without, you may also be able to come up with some
    marker scheme to retrace your steps.

    The general scheme you are using is quite similar to the A* search
    algorithm. I also tried a more classic A* search on this problem, but
    wasn't able to improve from what I had. I think all of the extra overhead
    needed wasn't paying off.

    Eric
     
    Eric Mahurin, Jun 13, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. BekTek
    Replies:
    9
    Views:
    850
    Jonathan Turkanis
    Nov 29, 2004
  2. Antoon Pardon

    Which style is preferable?

    Antoon Pardon, Jul 16, 2004, in forum: Python
    Replies:
    5
    Views:
    297
    Antoon Pardon
    Jul 19, 2004
  3. Frank Millman

    Convert '165.0' to int

    Frank Millman, Jul 21, 2011, in forum: Python
    Replies:
    6
    Views:
    202
    Billy Mays
    Jul 25, 2011
  4. Leo Jay

    Re: Convert '165.0' to int

    Leo Jay, Jul 21, 2011, in forum: Python
    Replies:
    42
    Views:
    907
    Prasad, Ramit
    Aug 2, 2011
  5. Matthew Moss

    [QUIZ] Preferable Pairs (#165)

    Matthew Moss, Jun 6, 2008, in forum: Ruby
    Replies:
    16
    Views:
    210
    Andrea Fazzi
    Jun 10, 2008
Loading...

Share This Page