Recursive generator for combinations of a multiset?


John O'Hagan

Short story: the subject says it all, so if you have an answer already,
fire away. Below is the long story of what I'm using it for, and why I
think it needs to be recursive. It may even be of more general
interest in terms of filtering the results of generators.

I'm playing with an anagram-generating module which works like this:

1) Generate the integer partitions of the length of the input string.

2) For each partition, replace its elements with a list of all
dictionary words which are a) the same length as the partition
element and b) a sub-multiset of the input string.

eg: "cat in hat" -> ... , [3,5], ... -> [[act, ant, ...], [antic,
attic, ...]]

3) Pass each resulting list of wordlists to itertools.product, filtering
the output of anything which is not the same multiset of characters as
the input string.

This works but gets very slow for long strings. It spends most of its
time at the filtering stage because most of the results have too many
of some characters (and therefore not enough of others). I do some
optimising of the lists prior to running product, but this only shaves
off smallish percentages.

I got a big speedup (factor of five for medium-length inputs, much more
for longer strings) by replacing itertools.product with this recursive

def cartesian_product(args, input_str):
if not args:
yield (), input_str
for words, check_str in cartesian_product(args[:-1], input_str):
for word in args[-1]:
#this bit prevents bothering with anything useless
new_str = check_str
for letter in word:
if letter in new_str:
new_str = new_str.replace(letter, '', 1)
yield words + (word,), new_str

Despite being inherently much slower than itertools.product, it can
"prune" branches of the recursion as soon as it accumulates too many of
any character. This means it's much faster and produces correct results
without further filtering.

But there is another problem. The initial partitions contain repeated
elements, so the corresponding wordlists are also repeated. This means
that any cartesian product will contain many redundant results - the
same words in a different order. For a medium-sized string, this is
most of them.

A solution for repeated sections of a partition of wordlists is to
use r-combinations (where r is the number of repeats). In this
scenario, though, some words may be usable more than once, and the
right number of copies of these words must be added to the list to
allow this. This means I need the combinations of a multiset (so I
can't use itertools.combinations).

I found a verbal description of such an algorithm and came up with

def multicombs(it, r):
result = it[:r]
yield result
while 1:
for i in range(-1, -r - 1, -1):
rep = result
if rep < it:
for j, n in enumerate(it):
if n > rep:
result = result[:i] + it[j:j - i]
yield result

I added a call to this generator in a branch to the main generator
above to deal with repeated elements. This eliminates redundant
results, but with a substantial slowdown. The program now spends most
of its time inside multicombs.

I'm hoping that if I could find a recursive multiset combination
generator, I could speed it up using the same pruning approach.

Any suggestions?

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Latest member

Latest Threads