# Tough sorting problem: or, I'm confusing myself

Discussion in 'Python' started by david jensen, Apr 9, 2010.

1. ### david jensenGuest

Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
res[ i ] = getValues( i ) # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10 )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
scorelist=[]
for j in range( 2**m ):
if j not in i:
k=list(i)
k.append(j)
k=tuple(sorted(k)) #gets the key for looking up the
scores in res
scorelist.append((j,res[k][k.index(j)]))
best2=sorted(scorelist,key=lambda x: -x[1])[:2]
----

Am I missing an obviously better way?

Many thanks!

David
david jensen, Apr 9, 2010

2. ### Raymond HettingerGuest

On Apr 9, 8:03 am, david jensen <> wrote:
> Hi all,
>
> I'm trying to find a good way of doing the following:
>
> Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> value n-tuple (call them "scores" for clarity later). I'm currently
> storing them in a dictionary, by doing:
>
> ----
> res={}
> for i in itertools.combinations( range( 2**m ) , n):
>     res[ i ] = getValues( i )    # getValues() is computationally
> expensive
> ----
>
> For each (n-1)-tuple, I need to find the two numbers that have the
> highest scores versus them. I know this isn't crystal clear, but
> hopefully an example will help: with m=n=3:
>
> Looking at only the (1, 3) case, assuming:
> getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
> highest "other" score, where 2 scores 125
> getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
> second-highest, where 6 scores 100
> getValues( (1, 3, 7) ) == ( 80, -90, 10  )
> getValues( (1, 3, 8) ) == ( 10, -5, -5 )
>
> I'd like to return ( (2, 125), (6, 100) ).
>
> The most obvious (to me) way to do this would be not to generate the
> res dictionary at the beginning, but just to go through each
> combinations( range( 2**m), n-1) and try every possibility... this
> will test each combination n times, however, and generating those
> values is expensive. [e.g. (1,2,3)'s scores will be generated when
> finding the best possibilities for (1,2), (1,3) and (2,3)]
>
> What I'm doing now is ugly, and i think is where i'm confusing myself:
>
> ----
> best2={}
> for i in itertools.combinations( range( 2**m), n-1):
>     scorelist=[]
>     for j in range( 2**m ):
>         if j not in i:
>             k=list(i)
>             k.append(j)
>             k=tuple(sorted(k))    #gets the key for looking up the
> scores in res
>             scorelist.append((j,res[k][k.index(j)]))
>     best2=sorted(scorelist,key=lambda x: -x[1])[:2]
> ----
>
> Am I missing an obviously better way?

The overall algorithm looks about right.
The inner-loop could be tighted-up a bit.
And you could replace the outer sort with a heap.

best2 = {}
for i in itertools.combinations(range( 2**m), n-1):
scorelist = []
for j in range( 2**m ):
if j not in i:
k = tuple(sorted(i + (j,)))
scorelist.append((j, res[k][k.index(j)]))
best2 = heapq.nlargest(2, scorelist,
key=operator.itemgetter(1))

Raymond
Raymond Hettinger, Apr 11, 2010

3. ### Paul McGuireGuest

On Apr 9, 10:03 am, david jensen <> wrote:
> Hi all,
>
> I'm trying to find a good way of doing the following:
>
> Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> value n-tuple (call them "scores" for clarity later). I'm currently
> storing them in a dictionary, by doing:
>
> ----
> res={}
> for i in itertools.combinations( range( 2**m ) , n):
>     res[ i ] = getValues( i )    # getValues() is computationally
> expensive
> ----
>
> For each (n-1)-tuple, I need to find the two numbers that have the
> highest scores versus them. I know this isn't crystal clear, but
> hopefully an example will help: with m=n=3:
>
> Looking at only the (1, 3) case, assuming:
> getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
> highest "other" score, where 2 scores 125
> getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
> second-highest, where 6 scores 100
> getValues( (1, 3, 7) ) == ( 80, -90, 10  )
> getValues( (1, 3, 8) ) == ( 10, -5, -5 )
>
> I'd like to return ( (2, 125), (6, 100) ).
>
> The most obvious (to me) way to do this would be not to generate the
> res dictionary at the beginning, but just to go through each
> combinations( range( 2**m), n-1) and try every possibility... this
> will test each combination n times, however, and generating those
> values is expensive. [e.g. (1,2,3)'s scores will be generated when
> finding the best possibilities for (1,2), (1,3) and (2,3)]
>

Add a memoizing decorator to getValues, so that repeated calls will do

-- Paul
Paul McGuire, Apr 12, 2010
4. ### david jensenGuest

On Apr 11, 9:39 pm, Raymond Hettinger <> wrote:

> The overall algorithm looks about right.
> The inner-loop could be tighted-up a bit.
> And you could replace the outer sort with a heap.
>
> best2 = {}
> for i in itertools.combinations(range( 2**m), n-1):
>     scorelist = []
>     for j in range( 2**m ):
>         if j not in i:
>             k = tuple(sorted(i + (j,)))
>             scorelist.append((j, res[k][k.index(j)]))
>     best2 = heapq.nlargest(2, scorelist,
> key=operator.itemgetter(1))
>
> Raymond

Thanks for the ideas... I should have seen the k = tuple(sorted(i +
(j,))). I'm not sure a heap will help much, and at least to me,

Thanks for taking a look, I appreciate it!

David
david jensen, Apr 13, 2010
5. ### david jensenGuest

On Apr 12, 1:22 am, Paul McGuire <> wrote:
> On Apr 9, 10:03 am, david jensen <> wrote:
>
>
>
> > Hi all,

>
> > I'm trying to find a good way of doing the following:

>
> > Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> > value n-tuple (call them "scores" for clarity later). I'm currently
> > storing them in a dictionary, by doing:

>
> > ----
> > res={}
> > for i in itertools.combinations( range( 2**m ) , n):
> >     res[ i ] = getValues( i )    # getValues() is computationally
> > expensive
> > ----

>
> > For each (n-1)-tuple, I need to find the two numbers that have the
> > highest scores versus them. I know this isn't crystal clear, but
> > hopefully an example will help: with m=n=3:

>
> > Looking at only the (1, 3) case, assuming:
> > getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
> > highest "other" score, where 2 scores 125
> > getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> > getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> > getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
> > second-highest, where 6 scores 100
> > getValues( (1, 3, 7) ) == ( 80, -90, 10  )
> > getValues( (1, 3, 8) ) == ( 10, -5, -5 )

>
> > I'd like to return ( (2, 125), (6, 100) ).

>
> > The most obvious (to me) way to do this would be not to generate the
> > res dictionary at the beginning, but just to go through each
> > combinations( range( 2**m), n-1) and try every possibility... this
> > will test each combination n times, however, and generating those
> > values is expensive. [e.g. (1,2,3)'s scores will be generated when
> > finding the best possibilities for (1,2), (1,3) and (2,3)]

>
> Add a memoizing decorator to getValues, so that repeated calls will do
> lookups instead of repeated calculations.
>
> -- Paul

Thanks for the idea... I'd thought of that, but didn't use it because
I don't think it improves complexity or readability: just makes it a
little more intuitive.

David
david jensen, Apr 13, 2010
6. ### Paul RubinGuest

Raymond Hettinger <> writes:
> Not sure what the readability issue is. The phrase "nlargest(2,
> iterable)" does exactly what it says, finds the 2 largest elements
> from an iterable. That makes the programmer's intent more clear than
> the slower, but semanticly equivalent form: sorted(iterable)[:2].

I think you meant

sorted(iterable, reverse=True)[:2]
Paul Rubin, Apr 14, 2010
7. ### Raymond HettingerGuest

> I'm not sure a heap will help much, and at least to me,

nlargest() should save quite a few comparisons and run much faster
than sorted().

Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

The running time for your algorithm can be very long, depending on the
inputs. Do you need an exact answer or can you approximate it with
sampling random combinations (i.e. choose the best two scores out of
10000 samples).

Raymond
Raymond Hettinger, Apr 14, 2010
8. ### Raymond HettingerGuest

> > Not sure what the readability issue is.  The phrase "nlargest(2,
> > iterable)" does exactly what it says, finds the 2 largest elements
> > from an iterable.  That makes the programmer's intent more clear than
> > the slower, but semanticly equivalent form:  sorted(iterable)[:2].

>
> I think you meant
>
>    sorted(iterable, reverse=True)[:2]

Raymond
Raymond Hettinger, Apr 14, 2010
9. ### david jensenGuest

On Apr 14, 6:06 pm, Raymond Hettinger <> wrote:
> > I'm not sure a heap will help much, and at least to me,

>
> nlargest() should save quite a few comparisons and run much faster
> than sorted().
>
> Not sure what the readability issue is.  The phrase "nlargest(2,
> iterable)" does exactly what it says, finds the 2 largest elements
> from an iterable.  That makes the programmer's intent more clear than
> the slower, but semanticly equivalent form:  sorted(iterable)[:2].
>
> The running time for your algorithm can be very long, depending on the
> inputs.  Do you need an exact answer or can you approximate it with
> sampling random combinations (i.e. choose the best two scores out of
> 10000 samples).
>
> Raymond

You are of course correct. I was running on lack of sleep. Thanks
again for having taken the time to help!
I need an exact answer, but don't anticipate 2**m or n to be huge so
even with e.g. m=10 and n=7 it should still be feasible. I can see how
sampling would become necessary if m is large though. Much larger, and
i'd have to think about optimizing the getValues function first
though.

thank you,
David
david jensen, Apr 15, 2010
10. ### Lie RyanGuest

On 04/15/10 02:03, Paul Rubin wrote:
> Raymond Hettinger <> writes:
>> Not sure what the readability issue is. The phrase "nlargest(2,
>> iterable)" does exactly what it says, finds the 2 largest elements
>> from an iterable. That makes the programmer's intent more clear than
>> the slower, but semanticly equivalent form: sorted(iterable)[:2].

>
> I think you meant
>
> sorted(iterable, reverse=True)[:2]

or sorted(iterable)[-2:]
Lie Ryan, Apr 16, 2010
11. ### Peter OttenGuest

david jensen wrote:

> Hi all,
>
> I'm trying to find a good way of doing the following:
>
> Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
> value n-tuple (call them "scores" for clarity later). I'm currently
> storing them in a dictionary, by doing:
>
> ----
> res={}
> for i in itertools.combinations( range( 2**m ) , n):
> res[ i ] = getValues( i ) # getValues() is computationally
> expensive
> ----
>
> For each (n-1)-tuple, I need to find the two numbers that have the
> highest scores versus them. I know this isn't crystal clear, but
> hopefully an example will help: with m=n=3:
>
> Looking at only the (1, 3) case, assuming:
> getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
> highest "other" score, where 2 scores 125
> getValues( (1, 3, 4) ) == ( 50, -50, 0 )
> getValues( (1, 3, 5) ) == ( 25, 300, -325 )
> getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
> second-highest, where 6 scores 100
> getValues( (1, 3, 7) ) == ( 80, -90, 10 )
> getValues( (1, 3, 8) ) == ( 10, -5, -5 )
>
> I'd like to return ( (2, 125), (6, 100) ).
>
> The most obvious (to me) way to do this would be not to generate the
> res dictionary at the beginning, but just to go through each
> combinations( range( 2**m), n-1) and try every possibility... this
> will test each combination n times, however, and generating those
> values is expensive. [e.g. (1,2,3)'s scores will be generated when
> finding the best possibilities for (1,2), (1,3) and (2,3)]
>
> What I'm doing now is ugly, and i think is where i'm confusing myself:
>
> ----
> best2={}
> for i in itertools.combinations( range( 2**m), n-1):
> scorelist=[]
> for j in range( 2**m ):
> if j not in i:
> k=list(i)
> k.append(j)
> k=tuple(sorted(k)) #gets the key for looking up the
> scores in res
> scorelist.append((j,res[k][k.index(j)]))
> best2=sorted(scorelist,key=lambda x: -x[1])[:2]
> ----
>
> Am I missing an obviously better way?

After some tinkering I came up with

def calculate(getvalues):
best2 = defaultdict(list)
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
best2[t[:i] + t[i+1:]].append((k, values))
return dict((k, nlargest(2, v, key=itemgetter(1))) for k, v in
best2.iteritems())

which is concise but slower than your code with Raymond's improvements. I
then tried inlining the nlargest() operation:

def calculate_inlined(getvalues):
best2 = {}
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
key = t[:i] + t[i+1:]
value = values
if key not in best2:
best2[key] = [(k, value), (None, None)]
else:
a, b = best2[key]
if value > a[1]:
best2[key] = [(k, value), a]
elif value > b[1]:
best2[key] = [a, (k, value)]
return best2

This gives a speed boost (I measured with m=n=5) but is both messy and
brittle. I've got a hunch that there is a better way; I just don't see it at
the moment...

Peter
Peter Otten, Apr 16, 2010