Tough sorting problem: or, I'm confusing myself

D

david jensen

Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
res[ i ] = getValues( i ) # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10 )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
scorelist=[]
for j in range( 2**m ):
if j not in i:
k=list(i)
k.append(j)
k=tuple(sorted(k)) #gets the key for looking up the
scores in res
scorelist.append((j,res[k][k.index(j)]))
best2=sorted(scorelist,key=lambda x: -x[1])[:2]
----

Am I missing an obviously better way?

Many thanks!

David
 
R

Raymond Hettinger

Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
    res[ i ] = getValues( i )    # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10  )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
    scorelist=[]
    for j in range( 2**m ):
        if j not in i:
            k=list(i)
            k.append(j)
            k=tuple(sorted(k))    #gets the key for looking up the
scores in res
            scorelist.append((j,res[k][k.index(j)]))
    best2=sorted(scorelist,key=lambda x: -x[1])[:2]


The overall algorithm looks about right.
The inner-loop could be tighted-up a bit.
And you could replace the outer sort with a heap.

best2 = {}
for i in itertools.combinations(range( 2**m), n-1):
scorelist = []
for j in range( 2**m ):
if j not in i:
k = tuple(sorted(i + (j,)))
scorelist.append((j, res[k][k.index(j)]))
best2 = heapq.nlargest(2, scorelist,
key=operator.itemgetter(1))


Raymond
 
P

Paul McGuire

Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
    res[ i ] = getValues( i )    # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10  )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

Add a memoizing decorator to getValues, so that repeated calls will do
lookups instead of repeated calculations.

-- Paul
 
D

david jensen

The overall algorithm looks about right.
The inner-loop could be tighted-up a bit.
And you could replace the outer sort with a heap.

best2 = {}
for i in itertools.combinations(range( 2**m), n-1):
    scorelist = []
    for j in range( 2**m ):
        if j not in i:
            k = tuple(sorted(i + (j,)))
            scorelist.append((j, res[k][k.index(j)]))
    best2 = heapq.nlargest(2, scorelist,
key=operator.itemgetter(1))

Raymond


Thanks for the ideas... I should have seen the k = tuple(sorted(i +
(j,))). I'm not sure a heap will help much, and at least to me,
doesn't improve readability.

Thanks for taking a look, I appreciate it!

David
 
D

david jensen

I'm trying to find a good way of doing the following:
Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:
----
res={}
for i in itertools.combinations( range( 2**m ) , n):
    res[ i ] = getValues( i )    # getValues() is computationally
expensive
----
For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:
Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 )    # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 )    # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10  )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )
I'd like to return ( (2, 125), (6, 100) ).
The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

Add a memoizing decorator to getValues, so that repeated calls will do
lookups instead of repeated calculations.

-- Paul

Thanks for the idea... I'd thought of that, but didn't use it because
I don't think it improves complexity or readability: just makes it a
little more intuitive.

But thanks for your time!

David
 
P

Paul Rubin

Raymond Hettinger said:
Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

I think you meant

sorted(iterable, reverse=True)[:2]
 
R

Raymond Hettinger

I'm not sure a heap will help much, and at least to me,
doesn't improve readability.

nlargest() should save quite a few comparisons and run much faster
than sorted().

Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

The running time for your algorithm can be very long, depending on the
inputs. Do you need an exact answer or can you approximate it with
sampling random combinations (i.e. choose the best two scores out of
10000 samples).

Raymond
 
R

Raymond Hettinger

Not sure what the readability issue is.  The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable.  That makes the programmer's intent more clear than
the slower, but semanticly equivalent form:  sorted(iterable)[:2].

I think you meant

   sorted(iterable, reverse=True)[:2]

:)


Raymond
 
D

david jensen

I'm not sure a heap will help much, and at least to me,
doesn't improve readability.

nlargest() should save quite a few comparisons and run much faster
than sorted().

Not sure what the readability issue is.  The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable.  That makes the programmer's intent more clear than
the slower, but semanticly equivalent form:  sorted(iterable)[:2].

The running time for your algorithm can be very long, depending on the
inputs.  Do you need an exact answer or can you approximate it with
sampling random combinations (i.e. choose the best two scores out of
10000 samples).

Raymond


You are of course correct. I was running on lack of sleep. Thanks
again for having taken the time to help!
I need an exact answer, but don't anticipate 2**m or n to be huge so
even with e.g. m=10 and n=7 it should still be feasible. I can see how
sampling would become necessary if m is large though. Much larger, and
i'd have to think about optimizing the getValues function first
though.

thank you,
David
 
L

Lie Ryan

Raymond Hettinger said:
Not sure what the readability issue is. The phrase "nlargest(2,
iterable)" does exactly what it says, finds the 2 largest elements
from an iterable. That makes the programmer's intent more clear than
the slower, but semanticly equivalent form: sorted(iterable)[:2].

I think you meant

sorted(iterable, reverse=True)[:2]


or sorted(iterable)[-2:]
 
P

Peter Otten

david said:
Hi all,

I'm trying to find a good way of doing the following:

Each n-tuple in combinations( range( 2 ** m ), n ) has a corresponding
value n-tuple (call them "scores" for clarity later). I'm currently
storing them in a dictionary, by doing:

----
res={}
for i in itertools.combinations( range( 2**m ) , n):
res[ i ] = getValues( i ) # getValues() is computationally
expensive
----

For each (n-1)-tuple, I need to find the two numbers that have the
highest scores versus them. I know this isn't crystal clear, but
hopefully an example will help: with m=n=3:

Looking at only the (1, 3) case, assuming:
getValues( (1, 2, 3) ) == ( -200, 125, 75 ) # this contains the
highest "other" score, where 2 scores 125
getValues( (1, 3, 4) ) == ( 50, -50, 0 )
getValues( (1, 3, 5) ) == ( 25, 300, -325 )
getValues( (1, 3, 6) ) == ( -100, 0, 100 ) # this contains the
second-highest, where 6 scores 100
getValues( (1, 3, 7) ) == ( 80, -90, 10 )
getValues( (1, 3, 8) ) == ( 10, -5, -5 )

I'd like to return ( (2, 125), (6, 100) ).

The most obvious (to me) way to do this would be not to generate the
res dictionary at the beginning, but just to go through each
combinations( range( 2**m), n-1) and try every possibility... this
will test each combination n times, however, and generating those
values is expensive. [e.g. (1,2,3)'s scores will be generated when
finding the best possibilities for (1,2), (1,3) and (2,3)]

What I'm doing now is ugly, and i think is where i'm confusing myself:

----
best2={}
for i in itertools.combinations( range( 2**m), n-1):
scorelist=[]
for j in range( 2**m ):
if j not in i:
k=list(i)
k.append(j)
k=tuple(sorted(k)) #gets the key for looking up the
scores in res
scorelist.append((j,res[k][k.index(j)]))
best2=sorted(scorelist,key=lambda x: -x[1])[:2]


After some tinkering I came up with

def calculate(getvalues):
best2 = defaultdict(list)
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
best2[t[:i] + t[i+1:]].append((k, values))
return dict((k, nlargest(2, v, key=itemgetter(1))) for k, v in
best2.iteritems())

which is concise but slower than your code with Raymond's improvements. I
then tried inlining the nlargest() operation:

def calculate_inlined(getvalues):
best2 = {}
for t in combinations(range(2**m), n):
values = getvalues(t)
for i, k in enumerate(t):
key = t[:i] + t[i+1:]
value = values
if key not in best2:
best2[key] = [(k, value), (None, None)]
else:
a, b = best2[key]
if value > a[1]:
best2[key] = [(k, value), a]
elif value > b[1]:
best2[key] = [a, (k, value)]
return best2

This gives a speed boost (I measured with m=n=5) but is both messy and
brittle. I've got a hunch that there is a better way; I just don't see it at
the moment...

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top