Tough sorting problem: or, I'm confusing myself

Discussion in 'Python' started by david jensen, Apr 9, 2010.

  1. david jensen

    david jensen Guest

    Hi all,

    I'm trying to find a good way of doing the following:

    Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
    value n-tuple (call them "scores" for clarity later). I'm currently
    storing them in a dictionary, by doing:

    ----
    res={}
    for i in itertools.combinations( range( 2**m ) , n):
    res[ i ] = getValues( i ) # getValues() is computationally
    expensive
    ----

    For each (n-1)-tuple, I need to find the two numbers that have the
    highest scores versus them. I know this isn't crystal clear, but
    hopefully an example will help: with m=n=3:

    Looking at only the (1, 3) case, assuming:
    getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
    highest "other" score, where 2 scores 125
    getValues( (1, 3, 4) ) == ( 50, -50, 0 )
    getValues( (1, 3, 5) ) == ( 25, 300, -325 )
    getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
    second-highest, where 6 scores 100
    getValues( (1, 3, 7) ) == ( 80, -90, 10 )
    getValues( (1, 3, 8) ) == ( 10, -5, -5 )

    I'd like to return ( (2, 125), (6, 100) ).

    The most obvious (to me) way to do this would be not to generate the
    res dictionary at the beginning, but just to go through each
    combinations( range( 2**m), n-1) and try every possibility... this
    will test each combination n times, however, and generating those
    values is expensive. [e.g. (1,2,3)'s scores will be generated when
    finding the best possibilities for (1,2), (1,3) and (2,3)]

    What I'm doing now is ugly, and i think is where i'm confusing myself:

    ----
    best2={}
    for i in itertools.combinations( range( 2**m), n-1):
    scorelist=[]
    for j in range( 2**m ):
    if j not in i:
    k=list(i)
    k.append(j)
    k=tuple(sorted(k)) #gets the key for looking up the
    scores in res
    scorelist.append((j,res[k][k.index(j)]))
    best2=sorted(scorelist,key=lambda x: -x[1])[:2]
    ----

    Am I missing an obviously better way?

    Many thanks!

    David
    david jensen, Apr 9, 2010
    #1
    1. Advertising

  2. On Apr 9, 8:03 am, david jensen <> wrote:
    > Hi all,
    >
    > I'm trying to find a good way of doing the following:
    >
    > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
    > value n-tuple (call them "scores" for clarity later). I'm currently
    > storing them in a dictionary, by doing:
    >
    > ----
    > res={}
    > for i in itertools.combinations( range( 2**m ) , n):
    >     res[ i ] = getValues( i )    # getValues() is computationally
    > expensive
    > ----
    >
    > For each (n-1)-tuple, I need to find the two numbers that have the
    > highest scores versus them. I know this isn't crystal clear, but
    > hopefully an example will help: with m=n=3:
    >
    > Looking at only the (1, 3) case, assuming:
    > getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
    > highest "other" score, where 2 scores 125
    > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
    > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
    > getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
    > second-highest, where 6 scores 100
    > getValues( (1, 3, 7) ) == ( 80, -90, 10  )
    > getValues( (1, 3, 8) ) == ( 10, -5, -5 )
    >
    > I'd like to return ( (2, 125), (6, 100) ).
    >
    > The most obvious (to me) way to do this would be not to generate the
    > res dictionary at the beginning, but just to go through each
    > combinations( range( 2**m), n-1) and try every possibility... this
    > will test each combination n times, however, and generating those
    > values is expensive. [e.g. (1,2,3)'s scores will be generated when
    > finding the best possibilities for (1,2), (1,3) and (2,3)]
    >
    > What I'm doing now is ugly, and i think is where i'm confusing myself:
    >
    > ----
    > best2={}
    > for i in itertools.combinations( range( 2**m), n-1):
    >     scorelist=[]
    >     for j in range( 2**m ):
    >         if j not in i:
    >             k=list(i)
    >             k.append(j)
    >             k=tuple(sorted(k))    #gets the key for looking up the
    > scores in res
    >             scorelist.append((j,res[k][k.index(j)]))
    >     best2=sorted(scorelist,key=lambda x: -x[1])[:2]
    > ----
    >
    > Am I missing an obviously better way?


    The overall algorithm looks about right.
    The inner-loop could be tighted-up a bit.
    And you could replace the outer sort with a heap.

    best2 = {}
    for i in itertools.combinations(range( 2**m), n-1):
    scorelist = []
    for j in range( 2**m ):
    if j not in i:
    k = tuple(sorted(i + (j,)))
    scorelist.append((j, res[k][k.index(j)]))
    best2 = heapq.nlargest(2, scorelist,
    key=operator.itemgetter(1))


    Raymond
    Raymond Hettinger, Apr 11, 2010
    #2
    1. Advertising

  3. david jensen

    Paul McGuire Guest

    On Apr 9, 10:03 am, david jensen <> wrote:
    > Hi all,
    >
    > I'm trying to find a good way of doing the following:
    >
    > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
    > value n-tuple (call them "scores" for clarity later). I'm currently
    > storing them in a dictionary, by doing:
    >
    > ----
    > res={}
    > for i in itertools.combinations( range( 2**m ) , n):
    >     res[ i ] = getValues( i )    # getValues() is computationally
    > expensive
    > ----
    >
    > For each (n-1)-tuple, I need to find the two numbers that have the
    > highest scores versus them. I know this isn't crystal clear, but
    > hopefully an example will help: with m=n=3:
    >
    > Looking at only the (1, 3) case, assuming:
    > getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
    > highest "other" score, where 2 scores 125
    > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
    > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
    > getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
    > second-highest, where 6 scores 100
    > getValues( (1, 3, 7) ) == ( 80, -90, 10  )
    > getValues( (1, 3, 8) ) == ( 10, -5, -5 )
    >
    > I'd like to return ( (2, 125), (6, 100) ).
    >
    > The most obvious (to me) way to do this would be not to generate the
    > res dictionary at the beginning, but just to go through each
    > combinations( range( 2**m), n-1) and try every possibility... this
    > will test each combination n times, however, and generating those
    > values is expensive. [e.g. (1,2,3)'s scores will be generated when
    > finding the best possibilities for (1,2), (1,3) and (2,3)]
    >


    Add a memoizing decorator to getValues, so that repeated calls will do
    lookups instead of repeated calculations.

    -- Paul
    Paul McGuire, Apr 12, 2010
    #3
  4. david jensen

    david jensen Guest

    On Apr 11, 9:39 pm, Raymond Hettinger <> wrote:

    > The overall algorithm looks about right.
    > The inner-loop could be tighted-up a bit.
    > And you could replace the outer sort with a heap.
    >
    > best2 = {}
    > for i in itertools.combinations(range( 2**m), n-1):
    >     scorelist = []
    >     for j in range( 2**m ):
    >         if j not in i:
    >             k = tuple(sorted(i + (j,)))
    >             scorelist.append((j, res[k][k.index(j)]))
    >     best2 = heapq.nlargest(2, scorelist,
    > key=operator.itemgetter(1))
    >
    > Raymond


    Thanks for the ideas... I should have seen the k = tuple(sorted(i +
    (j,))). I'm not sure a heap will help much, and at least to me,
    doesn't improve readability.

    Thanks for taking a look, I appreciate it!

    David
    david jensen, Apr 13, 2010
    #4
  5. david jensen

    david jensen Guest

    On Apr 12, 1:22 am, Paul McGuire <> wrote:
    > On Apr 9, 10:03 am, david jensen <> wrote:
    >
    >
    >
    > > Hi all,

    >
    > > I'm trying to find a good way of doing the following:

    >
    > > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
    > > value n-tuple (call them "scores" for clarity later). I'm currently
    > > storing them in a dictionary, by doing:

    >
    > > ----
    > > res={}
    > > for i in itertools.combinations( range( 2**m ) , n):
    > >     res[ i ] = getValues( i )    # getValues() is computationally
    > > expensive
    > > ----

    >
    > > For each (n-1)-tuple, I need to find the two numbers that have the
    > > highest scores versus them. I know this isn't crystal clear, but
    > > hopefully an example will help: with m=n=3:

    >
    > > Looking at only the (1, 3) case, assuming:
    > > getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
    > > highest "other" score, where 2 scores 125
    > > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
    > > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
    > > getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
    > > second-highest, where 6 scores 100
    > > getValues( (1, 3, 7) ) == ( 80, -90, 10  )
    > > getValues( (1, 3, 8) ) == ( 10, -5, -5 )

    >
    > > I'd like to return ( (2, 125), (6, 100) ).

    >
    > > The most obvious (to me) way to do this would be not to generate the
    > > res dictionary at the beginning, but just to go through each
    > > combinations( range( 2**m), n-1) and try every possibility... this
    > > will test each combination n times, however, and generating those
    > > values is expensive. [e.g. (1,2,3)'s scores will be generated when
    > > finding the best possibilities for (1,2), (1,3) and (2,3)]

    >
    > Add a memoizing decorator to getValues, so that repeated calls will do
    > lookups instead of repeated calculations.
    >
    > -- Paul


    Thanks for the idea... I'd thought of that, but didn't use it because
    I don't think it improves complexity or readability: just makes it a
    little more intuitive.

    But thanks for your time!

    David
    david jensen, Apr 13, 2010
    #5
  6. david jensen

    Paul Rubin Guest

    Raymond Hettinger <> writes:
    > Not sure what the readability issue is. The phrase "nlargest(2,
    > iterable)" does exactly what it says, finds the 2 largest elements
    > from an iterable. That makes the programmer's intent more clear than
    > the slower, but semanticly equivalent form: sorted(iterable)[:2].


    I think you meant

    sorted(iterable, reverse=True)[:2]
    Paul Rubin, Apr 14, 2010
    #6
  7. > I'm not sure a heap will help much, and at least to me,
    > doesn't improve readability.


    nlargest() should save quite a few comparisons and run much faster
    than sorted().

    Not sure what the readability issue is. The phrase "nlargest(2,
    iterable)" does exactly what it says, finds the 2 largest elements
    from an iterable. That makes the programmer's intent more clear than
    the slower, but semanticly equivalent form: sorted(iterable)[:2].

    The running time for your algorithm can be very long, depending on the
    inputs. Do you need an exact answer or can you approximate it with
    sampling random combinations (i.e. choose the best two scores out of
    10000 samples).

    Raymond
    Raymond Hettinger, Apr 14, 2010
    #7
  8. > > Not sure what the readability issue is.  The phrase "nlargest(2,
    > > iterable)" does exactly what it says, finds the 2 largest elements
    > > from an iterable.  That makes the programmer's intent more clear than
    > > the slower, but semanticly equivalent form:  sorted(iterable)[:2].

    >
    > I think you meant
    >
    >    sorted(iterable, reverse=True)[:2]


    :)


    Raymond
    Raymond Hettinger, Apr 14, 2010
    #8
  9. david jensen

    david jensen Guest

    On Apr 14, 6:06 pm, Raymond Hettinger <> wrote:
    > > I'm not sure a heap will help much, and at least to me,
    > > doesn't improve readability.

    >
    > nlargest() should save quite a few comparisons and run much faster
    > than sorted().
    >
    > Not sure what the readability issue is.  The phrase "nlargest(2,
    > iterable)" does exactly what it says, finds the 2 largest elements
    > from an iterable.  That makes the programmer's intent more clear than
    > the slower, but semanticly equivalent form:  sorted(iterable)[:2].
    >
    > The running time for your algorithm can be very long, depending on the
    > inputs.  Do you need an exact answer or can you approximate it with
    > sampling random combinations (i.e. choose the best two scores out of
    > 10000 samples).
    >
    > Raymond



    You are of course correct. I was running on lack of sleep. Thanks
    again for having taken the time to help!
    I need an exact answer, but don't anticipate 2**m or n to be huge so
    even with e.g. m=10 and n=7 it should still be feasible. I can see how
    sampling would become necessary if m is large though. Much larger, and
    i'd have to think about optimizing the getValues function first
    though.

    thank you,
    David
    david jensen, Apr 15, 2010
    #9
  10. david jensen

    Lie Ryan Guest

    On 04/15/10 02:03, Paul Rubin wrote:
    > Raymond Hettinger <> writes:
    >> Not sure what the readability issue is. The phrase "nlargest(2,
    >> iterable)" does exactly what it says, finds the 2 largest elements
    >> from an iterable. That makes the programmer's intent more clear than
    >> the slower, but semanticly equivalent form: sorted(iterable)[:2].

    >
    > I think you meant
    >
    > sorted(iterable, reverse=True)[:2]



    or sorted(iterable)[-2:]
    Lie Ryan, Apr 16, 2010
    #10
  11. david jensen

    Peter Otten Guest

    david jensen wrote:

    > Hi all,
    >
    > I'm trying to find a good way of doing the following:
    >
    > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
    > value n-tuple (call them "scores" for clarity later). I'm currently
    > storing them in a dictionary, by doing:
    >
    > ----
    > res={}
    > for i in itertools.combinations( range( 2**m ) , n):
    > res[ i ] = getValues( i ) # getValues() is computationally
    > expensive
    > ----
    >
    > For each (n-1)-tuple, I need to find the two numbers that have the
    > highest scores versus them. I know this isn't crystal clear, but
    > hopefully an example will help: with m=n=3:
    >
    > Looking at only the (1, 3) case, assuming:
    > getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
    > highest "other" score, where 2 scores 125
    > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
    > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
    > getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
    > second-highest, where 6 scores 100
    > getValues( (1, 3, 7) ) == ( 80, -90, 10 )
    > getValues( (1, 3, 8) ) == ( 10, -5, -5 )
    >
    > I'd like to return ( (2, 125), (6, 100) ).
    >
    > The most obvious (to me) way to do this would be not to generate the
    > res dictionary at the beginning, but just to go through each
    > combinations( range( 2**m), n-1) and try every possibility... this
    > will test each combination n times, however, and generating those
    > values is expensive. [e.g. (1,2,3)'s scores will be generated when
    > finding the best possibilities for (1,2), (1,3) and (2,3)]
    >
    > What I'm doing now is ugly, and i think is where i'm confusing myself:
    >
    > ----
    > best2={}
    > for i in itertools.combinations( range( 2**m), n-1):
    > scorelist=[]
    > for j in range( 2**m ):
    > if j not in i:
    > k=list(i)
    > k.append(j)
    > k=tuple(sorted(k)) #gets the key for looking up the
    > scores in res
    > scorelist.append((j,res[k][k.index(j)]))
    > best2=sorted(scorelist,key=lambda x: -x[1])[:2]
    > ----
    >
    > Am I missing an obviously better way?


    After some tinkering I came up with

    def calculate(getvalues):
    best2 = defaultdict(list)
    for t in combinations(range(2**m), n):
    values = getvalues(t)
    for i, k in enumerate(t):
    best2[t[:i] + t[i+1:]].append((k, values))
    return dict((k, nlargest(2, v, key=itemgetter(1))) for k, v in
    best2.iteritems())

    which is concise but slower than your code with Raymond's improvements. I
    then tried inlining the nlargest() operation:

    def calculate_inlined(getvalues):
    best2 = {}
    for t in combinations(range(2**m), n):
    values = getvalues(t)
    for i, k in enumerate(t):
    key = t[:i] + t[i+1:]
    value = values
    if key not in best2:
    best2[key] = [(k, value), (None, None)]
    else:
    a, b = best2[key]
    if value > a[1]:
    best2[key] = [(k, value), a]
    elif value > b[1]:
    best2[key] = [a, (k, value)]
    return best2

    This gives a speed boost (I measured with m=n=5) but is both messy and
    brittle. I've got a hunch that there is a better way; I just don't see it at
    the moment...

    Peter
    Peter Otten, Apr 16, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Prince

    Tough Problem

    Prince, Jan 26, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    398
    Michael Earls
    Jan 26, 2004
  2. W. Jordan
    Replies:
    6
    Views:
    570
    Patrick Olurotimi Ige
    May 17, 2005
  3. Bloody Viking

    tough XSLT problem

    Bloody Viking, Aug 23, 2005, in forum: XML
    Replies:
    6
    Views:
    461
    Peter Flynn
    Aug 24, 2005
  4. Tough Spawn Problem

    , Mar 6, 2005, in forum: Python
    Replies:
    1
    Views:
    369
    Jeff Epler
    Mar 6, 2005
  5. Replies:
    0
    Views:
    392
Loading...

Share This Page