Feedback on Sets, and Partitions

S

Steve

This post has two parts. First is my feedback on sets. (Hello? Last
summer called, they want their discussion thread back...) Second is some
questions about my implementation of a partition function for sets.

Part 1.
---
From: Raymond Hettinger ([email protected])
Subject: Py2.3: Feedback on Sets
Newsgroups: comp.lang.python
Date: 2003-08-11 23:02:18 PST

I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.

* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?

I like '|' and '&', because union and intersection are more like or and
and than they are like addition and multiplication.
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?

Sets would be far less interesting and useful if we could not have sets
of sets. The implementation is sufficient, but not entirely intuitive. Of
course, there is no distinction between immutable and mutable sets in
normal set theory. I could live with only immutable sets, but it is nice
being able to use:
myset.add(x)

instead of:
newmyset = myset | Set([x])

I would like (I think) syntactic support where curly-braces could replace
the class constructor.

Another unintutive feature is that there can be multiple sets with the
same members. I.e., you can have sets A and B such that "A == B" is true
and "A is B" is false. I would like more information on the reasons for
this choice and whether there would be anything to gain or lose by having
"A == B" entail "A is B".
* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?

I don't know if the need is compelling, but I consider the implementation
incomplete without powerset() and some other functions. However, please
see my discussion of a partition function below.
* Does the performance meet your expectations?

I don't know. For memory usage, I did some tests and it looks like a set
that is the singleton of an integer takes about 200 bytes, maybe a little
less. On the one hand that isn't bad, on the other, it's kind of big. 50
bytes would be great. Does or could Set use __slots__?
As for speed, see below.
* Do you care that sets can only contain hashable elements?

Is this required for efficiently preventing duplicate elements or at
least for having lookups be O(1)? Then I can live with it. In general
though I would think that the more types of things that are eligible to
be elements, the better.
* How about the design constraint that the argument to most
set methods must be another Set (as opposed to any iterable)?

Fine with me. It is not obvious in the first place what you should expect
the union of a set and a tuple to be (in set theory, not Python).
* Are the docs clear? Can you suggest improvements?

5.13.1 Set Objects. The table of operations could list proper subset as
well as subset.
Also, I would be happy to see as much information as possible about O()
properties of sets. (Explicit, not "same as a dictionary".) Maybe that
doesn't belong in the documentation proper.
* Are sets helpful in your daily work or does the need arise
only rarely?

I am strongly in favor of moving sets into the core language with
supporting syntax and a C implementation.

Part 2
---
What is true of the performance of a partition function for the most part
should be true of a powerset, combination, and permutation (for ordered
subsets) function.

In general, I don't know if a partition function has much use when you're
dealing with sets with moderately large numbers of members. The problem
with these functions is that their size grows quickly. I have ruled out
making one giant set of all the partitions because it is too big (1
million sets uses 200MB memory). Instead I have a generator function that
yields a different partition on each iteration. The problem then is that
it takes too long. I fear that, despite what I said above about including
them in a complete implementation of sets for Python, there may not be
any practical use for these functions except in limited domains with
small numbers of elements.

So my question is for advice on improving my implementation (by a
constant factor would be great, by a factor of n would be amazing,
although maybe impossible), and also general advice on the best
performance I could ever hope for with these sorts of functions. Below
are some numbers, and then the code.

For those who don't know, a partition P of a set A is a set of mutually
exclusive and jointly exhaustive subsets of A. In other words, every
member of A is a member of exactly one member of P.

As I learned when working on this implementation, the number of
partitions for a set of size n is the Bell Number of n. The formula for
computing the Bell Number isn't really relevant at the moment, but here
are the Bell Numbers for 0,...,14:
1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597,
27644437, 190899322

This means that a set with 3 elements has 5 partitions. Here is an
example:
Partitions({a,b,c}) =
{
{
{a,b,c}
}
{
{a,b},{c}
}
{
{a,c},{b}
}
{
{b,c},{a}
}
{
{a},{b},{c}
}
}

This shows one big set of all the partitions, where each partition is a
set of sets of elements. The function below, being a generator, does not
return one big set of all partitions, but a different partition on each
iteration.

Now, my computer is a 1.8GHz Athlon with 512MB RAM (PC133 unfortunately).
I use Fedora Core 1 Linux and Python 2.3 RPM from python.org. For a
benchmark, the following code takes about 10 seconds to run:

i = 10000000 #ten million
while i > 0:
i -= 1

So here are some times for generating the partitions of sets of
increasing sizes:

1, <1s
2, <1s
3, <1s
4, <1s
5, <1s
6, <1s
7, <1s
8, <1s
9, 1.6s
10, 9s
11, 53s
12, 5m25s

For my current project, I need to generate partitions of sets up to size
25. I am rather hopeless it will ever be feasible, and I have begun
looking for a different approach. Thanks for any replies!

#begin code

from sets import Set, ImmutableSet

#def singleton(s):
# return Set()

def partitions(s):
if len(s) <= 1:
yield Set()
else:
x = Set([iter(s).next()])
for t in partitions(s - x):
yield t | Set([x])
for u in t:
yield (t - Set()) | Set([u | x])


s = Set("abcdefghij") #10 elements, which is 9s runtime
count = 0
for p in partitions(s):
count += 1

print count
 
D

David Eppstein

David Eppstein said:
There are 4638590332229999353 partitions of an 25 item set.
<http://www.research.att.com/projects/OEIS?Anum=A000110>

So, I think generating them all is a little out of the question.

BTW, here is some code for generating these numbers. I think it's a
little interesting that one can write the obvious one-line definition
for factorials, forgetting the base case, and then make it work anyway
via the magic of memoization. Something to think about in connection
with PEP 318...


def memoize(recurrence, base_case={}):
memo = {}
for k,rk in base_case.items():
if not isinstance(k,tuple):
k = (k,)
memo[k] = rk

def memoized(*args):
if args not in memo:
memo[args] = recurrence(*args)
return memo[args]

return memoized

def factorial(n):
return factorial(n-1)*n

factorial = memoize(factorial, base_case={0:1})

def choose(n,k):
return factorial(n) // factorial(k) // factorial(n-k)

def bell(n):
return sum([choose(n-1,k)*bell(n-1-k) for k in range(n)])

bell = memoize(bell, base_case={0:1})

print "There are",bell(25),"partitions of a 25 item set."
 
E

Ernie

Steve said:
This post has two parts. First is my feedback on sets. (Hello? Last
summer called, they want their discussion thread back...) Second is some
questions about my implementation of a partition function for sets.

Part 1.
--- ..... snipped
Part 2
--- ..... snipped
For my current project, I need to generate partitions of sets up to size
25. I am rather hopeless it will ever be feasible, and I have begun
looking for a different approach. Thanks for any replies!

For Part 2, Google on restricted growth functions if you are
interested in non-recursive generation of set partitions of n objects
with k classes.

I have an implementation done in python, not using sets but rather an
array of integers which can be used for indexing, for processing
equivalences and set partitions (See
http://www.geocities.com/eadorio/equiv.pdf).

Regards,

Ernie
 
A

Anton Vredegoor

David Eppstein said:
BTW, here is some code for generating these numbers. I think it's a
little interesting that one can write the obvious one-line definition
for factorials, forgetting the base case, and then make it work anyway
via the magic of memoization. Something to think about in connection
with PEP 318...

[snip some interesting code]

Here's a more tedious way to do the same, but this code is more
"graphic" in that it shows how one can build up a triangle of numbers
line by line according to some mutation prescription.

Depending on the kind of prescription one ends up with pascals
triangle or the triangle for bell numbers.

def pascal_mutate(L):
R = [L + L[i+1] for i in xrange(len(L)-1)]
return [1] + R + [1]

def stirling_mutate(L):
R = [L + (i+2)*L[i+1] for i in xrange(len(L)-1)]
return [1] + R + [1]

def triangle(mutator):
L = [1]
while 1:
yield L
L = mutator(L)

def listify(gen):
cache = []
def func(n):
while len(cache) <= n:
cache.append(gen.next())
return cache[n]
return func

def triangle_function_generator(mutator):
T = listify(triangle(mutator))
def func(n,k):
return T(n)[k]
return func

pascal = triangle_function_generator(pascal_mutate)
stirling_raw = triangle_function_generator(stirling_mutate)

def noverk(n,k):
return reduce(lambda a,b: a*(n-b)/(b+1),range(k),1)

def stirling(n,k):
return stirling_raw(n-1,k-1)

def bell(n):
return sum([stirling(n,i) for i in range(1,n+1)])

def test():
n,k = 10,4
assert pascal(n,k) == noverk(n,k)
print bell(25)

if __name__=='__main__':
test()

#prints 4638590332229999353

Anton
 
A

A. Lloyd Flanagan

Steve said:
Another unintutive feature is that there can be multiple sets with the
same members. I.e., you can have sets A and B such that "A == B" is true
and "A is B" is false. I would like more information on the reasons for
this choice and whether there would be anything to gain or lose by having
"A == B" entail "A is B".

I can't speak for the designers, but I think there's two problems with
A == B => A is B. One, none of the other sequences behave that way.
Two (and this is the real killer) I think it would be tough to
implement.
A = Set([1, 2, 3])
B = Set([1, 2])
#A is B clearly false, as is A==B
B.add(3)
#Now A==B is true

After every alteration of a set, such as the add() above, you'd have
to do an exhaustive comparison of every other set for equality. Then
you'd do something like B = A to drop the old B set and set up another
reference to A. Then if you altered B and not A, you'd have to do
something like copy A, make the change, and assign the result to B.

Problem with that is, it breaks the case where you want A and B to be
pointing to a single set which is modified so that A and B change.
The most common case would be passing a Set to a subroutine.

So I just don't think it's feasible.
 
A

Andrew Koenig

I could live with only immutable sets, but it is nice
being able to use:
myset.add(x)

instead of:
newmyset = myset | Set([x])

What about

myset |= Set([x])

Presumably that would work analogously to += for strings.
Another unintutive feature is that there can be multiple sets with the
same members. I.e., you can have sets A and B such that "A == B" is true
and "A is B" is false.

Why is that any more unintuitive than the fact that there can be multiple
integers with the same value?
For example:

x = 12345
y = 12344+1

After these two statements, most Python implementations will evaluat "x is
y" as false, but of course 1==y is true.
 
A

Anton Vredegoor

I have an implementation done in python, not using sets but rather an
array of integers which can be used for indexing, for processing
equivalences and set partitions (See
http://www.geocities.com/eadorio/equiv.pdf).

Thanks.

Here's something I wrote in 2000. I just did some very superficial
inspection of the code and adapted a few names that would not be ok
anymore, now that we have sets, but very likely I would do things very
differently now.

It might start off other implementations though so I'm providing it
below here.

# SetPart.py : return a set divided into subsets. The possible ways
# this can be done are indexed. Algorithms are adapted from the book
# "Combinatorial Algorithms..." (1999) by Donald Kreher and
# Douglas Stinson.
# See also: http://www.math.mtu.edu/~kreher/cages.html
# Translated and adapted for Python by Anton Vredegoor:
# (e-mail address removed) 19 jun 2000
# last update: april 2004

class SetPart:

def __init__(self, aset):
self.aset = aset
self.m = len(self.aset)
self.d = self.GeneralizedRGF(self.m)
self.count = self.d[self.m][0]

def __getitem__(self, index):
# Partition number index
if index > self.count - 1:
raise IndexError
rgf = self.UnrankRGF(self.m, index)
result = self.RGFToSetPart(self.m, rgf, self.aset)
return result[1:]

## ** Algorithm 3.12
def RGFToSetPart(self, m, b, aset):
# Transform a restricted growth function list into a partition
A = []
n = 1
for i in range(1,m+1):
if b > n:
n = b
for i in range(n+1):
A.append([])
for i in range(1,m+1):
A[b].append(aset[i-1])
return A

## ** Algorithm 3.14
def GeneralizedRGF(self, m):
# Compute the values d[i.j]
d = [[1L] * (m+1) for i in range(m+1)]
for i in range(1,m+1):
for j in range(0,m-i+1):
d[j] = j * d[i-1][j] + d[i-1][j+1]
return d

##** Algorithm 3.16
def UnrankRGF(self, m, r):
# Unrank a restricted grow function
f = [1] * (m+1)
j = 1
d = self.d
for i in range(2,m+1):
v = d[m-i][j]
cr = j*v
if cr <= r:
f = j + 1
r -= cr
j += 1
else:
f = r / v + 1
r %= v
return f

def test():
aset = range(5)
P = SetPart(aset)
print P.count
for x in P:
print x

if __name__ == '__main__':
test()

Anton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top