selecting dictionaries to maximize counts

Steven Bethard · Feb 18, 2005

I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:

py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

py> def select(dicts, n):
.... result = []
.... counts = {}
.... def doadd(d):
.... for label, count in d.iteritems():
.... if counts.get(label, 0) + count > n:
.... return False
.... return True
.... for d in dicts:
.... if doadd(d):
.... result.append(d)
.... for label, count in d.iteritems():
.... counts[label] = counts.get(label, 0) + count
.... return result
....

Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:

py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

STeVe

P.S. No, I'm not just making these problems up! I'm sick, but not that
sick.

It has to do with resampling a set of data... If you're
really interested, I can provide the full details, but they'll only make
the problem even _more_ complicated. =)

John Machin · Feb 18, 2005

Steven said:
I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:

Basically, I use a greedy approach -- adding a dict each time if I can.
This leads to some suboptimal solutions given that, while the total

counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select function
produces such a suboptimal solution:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --

then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right now
is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

This is more than a bit OT but you've sucked me in

You should specify a function that assigns a score to valid solutions;
it's not apparent from the wording above and the example exactly what
that function might be. Thinking through the alternatives might help
you solve the problem yourself

Steven Bethard · Feb 18, 2005

John said:
Steven said:

I have a list of dictionaries. Each dictionary holds counts of
various 'words'...
...
Basically, I use a greedy approach -- adding a dict each time if I
can.
...
This leads to some suboptimal solutions given that, while the total
counts must not exceed MAX_VALUE, I'd like them to be as close to
MAX_VALUE as possible. An example of data on which my select
function produces such a suboptimal solution:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]
py> select(countdicts, 12)
[{'a': 9, 'c': 9, 'b': 9}]

The better solution would have been to select the second two dicts --
then all 'words' would have count 12.

I don't need an optimal solution; in fact, the solution I have right
now is tolerable. But if anyone knows a simple way to improve my
performance, I'd appreciate it. Thanks!

Click to expand...

You should specify a function that assigns a score to valid solutions;
it's not apparent from the wording above and the example exactly what
that function might be. Thinking through the alternatives might help
you solve the problem yourself

Yeah, I guess that's a good idea. Here's a scoring function that should
give the right range of scores (where 1.0 is perfect and 0.0 is failure):

py> from __future__ import division
py> def score(origdicts, selecteddicts, n):
.... origwords = set(word for d in origdicts for word in d)
.... counts = {}
.... for d in selecteddicts:
.... for word, count in d.iteritems():
.... counts[word] = counts.get(word, 0) + count
.... if counts[word] > n:
.... return 0.0
.... return sum(counts.itervalues())/len(origwords)/n
....
py> countdicts = [
.... dict(a=9, b=9, c=9),
.... dict(a=8, b=7),
.... dict(a=4, b=5, c=12)]
py> score(countdicts, countdicts, 12)
0.0
py> score(countdicts, select(countdicts, 12), 12)
0.75
py> score(countdicts, [dict(a=8, b=7), dict(a=4, b=5, c=12)], 12)
1.0

It's pretty easy to build a configuration that can't be solved with a
perfect score:

py> def subsets(lst):
.... if not lst:
.... yield []
.... else:
.... first = lst[:1]
.... for subset in subsets(lst[1:]):
.... yield subset
.... yield first + subset
....
py> countdicts = [
.... dict(a=1, b=2, c=3),
.... dict(a=4, b=5, c=6),
.... dict(a=7, b=8, c=9)]
py> for dicts in subsets(countdicts):
.... print score(countdicts, dicts, 9), dicts
....
0.0 []
0.222222222222 [{'a': 1, 'c': 3, 'b': 2}]
0.555555555556 [{'a': 4, 'c': 6, 'b': 5}]
0.777777777778 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
0.888888888889 [{'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c': 9, 'b': 8}]
0.0 [{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}, {'a': 7, 'c':
9, 'b': 8}]

Hmmm... I can't do an exhaustive search like this one here -- I have
~8000 dicts, so 2**N is definitely too large of a space to search. Now
that I have a scoring routine maybe I could do an A* search though...

Thanks for the input!

STeVe

Brian Beck · Feb 18, 2005

Steven said:
I have a list of dictionaries. Each dictionary holds counts of various
'words', e.g.:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

Not that you can't still improve performance of course, but this is an
NP-complete problem if you didn't know, so don't bang your head too hard...

Steven Bethard · Feb 19, 2005

Brian said:
Steven said:

I have a list of dictionaries. Each dictionary holds counts of
various 'words', e.g.:

py> countdicts = [
... dict(a=9, b=9, c=9),
... dict(a=8, b=7),
... dict(a=4, b=5, c=12)]

I need to select dicts with the constraint that the number of each
'word' totalled over all selected dicts doesn't exceed a given
MAX_VALUE. Right now, I do this by:

Click to expand...

Not that you can't still improve performance of course, but this is an
NP-complete problem if you didn't know, so don't bang your head too hard...

Yeah, it seemed a lot like one of those box-packing type problems, so if
it's NP-complete, it wouldn't surprise me. That's one of the reasons I
mentioned trying A* in one of my other responses. However, in
retrospect, I don't think I can apply A* because I don't really have a
"goal state". I could call a selection where all "words" had MAX_VALUE
counts a "goal state", but I'm not guaranteed that such a state always
exists (and if it doesn't A* just degenerates into an exhaustive search).

Anyway, do you know what name this problem is usually discussed under?
If I knew what to google for, I could probably find at least a few
simple heuristics to try...

Thanks again,

STeVe

Brian Beck · Feb 19, 2005

Steven said:
> Anyway, do you know what name this problem is usually discussed under?
If I knew what to google for, I could probably find at least a few
simple heuristics to try...

I think the closest thing would be the 'knapsack problem' or the 'subset
sum problem.'

http://en.wikipedia.org/wiki/Subset_sum_problem
http://en.wikipedia.org/wiki/Knapsack_problem

Debugging reference counts in python/C	0	Aug 30, 2006
Concise idiom to initialize dictionaries	8	Nov 9, 2004
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Want to reduce steps of an operation with dictionaries	15	Oct 25, 2006
PEP 372 -- Adding an ordered directory to collections	28	Jun 16, 2008
Need help selecting items from a list box trigger by what selected in dropdown list	1	Dec 9, 2008
Best way to write this method?	12	Apr 22, 2010
Exception running GNU module "op25_grc.py" : AttributeError:'gr_hier_block2_sptr' object has no	3	Feb 15, 2013

selecting dictionaries to maximize counts

Steven Bethard

John Machin

Steven Bethard

Brian Beck

Steven Bethard

Brian Beck

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads