S
Steve
This post has two parts. First is my feedback on sets. (Hello? Last
summer called, they want their discussion thread back...) Second is some
questions about my implementation of a partition function for sets.
Part 1.
---
I like '|' and '&', because union and intersection are more like or and
and than they are like addition and multiplication.
Sets would be far less interesting and useful if we could not have sets
of sets. The implementation is sufficient, but not entirely intuitive. Of
course, there is no distinction between immutable and mutable sets in
normal set theory. I could live with only immutable sets, but it is nice
being able to use:
myset.add(x)
instead of:
newmyset = myset | Set([x])
I would like (I think) syntactic support where curly-braces could replace
the class constructor.
Another unintutive feature is that there can be multiple sets with the
same members. I.e., you can have sets A and B such that "A == B" is true
and "A is B" is false. I would like more information on the reasons for
this choice and whether there would be anything to gain or lose by having
"A == B" entail "A is B".
I don't know if the need is compelling, but I consider the implementation
incomplete without powerset() and some other functions. However, please
see my discussion of a partition function below.
I don't know. For memory usage, I did some tests and it looks like a set
that is the singleton of an integer takes about 200 bytes, maybe a little
less. On the one hand that isn't bad, on the other, it's kind of big. 50
bytes would be great. Does or could Set use __slots__?
As for speed, see below.
Is this required for efficiently preventing duplicate elements or at
least for having lookups be O(1)? Then I can live with it. In general
though I would think that the more types of things that are eligible to
be elements, the better.
Fine with me. It is not obvious in the first place what you should expect
the union of a set and a tuple to be (in set theory, not Python).
5.13.1 Set Objects. The table of operations could list proper subset as
well as subset.
Also, I would be happy to see as much information as possible about O()
properties of sets. (Explicit, not "same as a dictionary".) Maybe that
doesn't belong in the documentation proper.
I am strongly in favor of moving sets into the core language with
supporting syntax and a C implementation.
Part 2
---
What is true of the performance of a partition function for the most part
should be true of a powerset, combination, and permutation (for ordered
subsets) function.
In general, I don't know if a partition function has much use when you're
dealing with sets with moderately large numbers of members. The problem
with these functions is that their size grows quickly. I have ruled out
making one giant set of all the partitions because it is too big (1
million sets uses 200MB memory). Instead I have a generator function that
yields a different partition on each iteration. The problem then is that
it takes too long. I fear that, despite what I said above about including
them in a complete implementation of sets for Python, there may not be
any practical use for these functions except in limited domains with
small numbers of elements.
So my question is for advice on improving my implementation (by a
constant factor would be great, by a factor of n would be amazing,
although maybe impossible), and also general advice on the best
performance I could ever hope for with these sorts of functions. Below
are some numbers, and then the code.
For those who don't know, a partition P of a set A is a set of mutually
exclusive and jointly exhaustive subsets of A. In other words, every
member of A is a member of exactly one member of P.
As I learned when working on this implementation, the number of
partitions for a set of size n is the Bell Number of n. The formula for
computing the Bell Number isn't really relevant at the moment, but here
are the Bell Numbers for 0,...,14:
1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597,
27644437, 190899322
This means that a set with 3 elements has 5 partitions. Here is an
example:
Partitions({a,b,c}) =
{
{
{a,b,c}
}
{
{a,b},{c}
}
{
{a,c},{b}
}
{
{b,c},{a}
}
{
{a},{b},{c}
}
}
This shows one big set of all the partitions, where each partition is a
set of sets of elements. The function below, being a generator, does not
return one big set of all partitions, but a different partition on each
iteration.
Now, my computer is a 1.8GHz Athlon with 512MB RAM (PC133 unfortunately).
I use Fedora Core 1 Linux and Python 2.3 RPM from python.org. For a
benchmark, the following code takes about 10 seconds to run:
i = 10000000 #ten million
while i > 0:
i -= 1
So here are some times for generating the partitions of sets of
increasing sizes:
1, <1s
2, <1s
3, <1s
4, <1s
5, <1s
6, <1s
7, <1s
8, <1s
9, 1.6s
10, 9s
11, 53s
12, 5m25s
For my current project, I need to generate partitions of sets up to size
25. I am rather hopeless it will ever be feasible, and I have begun
looking for a different approach. Thanks for any replies!
#begin code
from sets import Set, ImmutableSet
#def singleton(s):
# return Set()
def partitions(s):
if len(s) <= 1:
yield Set()
else:
x = Set([iter(s).next()])
for t in partitions(s - x):
yield t | Set([x])
for u in t:
yield (t - Set()) | Set([u | x])
s = Set("abcdefghij") #10 elements, which is 9s runtime
count = 0
for p in partitions(s):
count += 1
print count
summer called, they want their discussion thread back...) Second is some
questions about my implementation of a partition function for sets.
Part 1.
---
From: Raymond Hettinger ([email protected])
Subject: Py2.3: Feedback on Sets
Newsgroups: comp.lang.python
Date: 2003-08-11 23:02:18 PST
I've gotten lots of feedback on the itertools module
but have not heard a peep about the new sets module.
* Are you overjoyed/outraged by the choice of | and &
as set operators (instead of + and *)?
I like '|' and '&', because union and intersection are more like or and
and than they are like addition and multiplication.
* Is the support for sets of sets necessary for your work
and, if so, then is the implementation sufficiently
powerful?
Sets would be far less interesting and useful if we could not have sets
of sets. The implementation is sufficient, but not entirely intuitive. Of
course, there is no distinction between immutable and mutable sets in
normal set theory. I could live with only immutable sets, but it is nice
being able to use:
myset.add(x)
instead of:
newmyset = myset | Set([x])
I would like (I think) syntactic support where curly-braces could replace
the class constructor.
Another unintutive feature is that there can be multiple sets with the
same members. I.e., you can have sets A and B such that "A == B" is true
and "A is B" is false. I would like more information on the reasons for
this choice and whether there would be anything to gain or lose by having
"A == B" entail "A is B".
* Is there a compelling need for additional set methods like
Set.powerset() and Set.isdisjoint(s) or are the current
offerings sufficient?
I don't know if the need is compelling, but I consider the implementation
incomplete without powerset() and some other functions. However, please
see my discussion of a partition function below.
* Does the performance meet your expectations?
I don't know. For memory usage, I did some tests and it looks like a set
that is the singleton of an integer takes about 200 bytes, maybe a little
less. On the one hand that isn't bad, on the other, it's kind of big. 50
bytes would be great. Does or could Set use __slots__?
As for speed, see below.
* Do you care that sets can only contain hashable elements?
Is this required for efficiently preventing duplicate elements or at
least for having lookups be O(1)? Then I can live with it. In general
though I would think that the more types of things that are eligible to
be elements, the better.
* How about the design constraint that the argument to most
set methods must be another Set (as opposed to any iterable)?
Fine with me. It is not obvious in the first place what you should expect
the union of a set and a tuple to be (in set theory, not Python).
* Are the docs clear? Can you suggest improvements?
5.13.1 Set Objects. The table of operations could list proper subset as
well as subset.
Also, I would be happy to see as much information as possible about O()
properties of sets. (Explicit, not "same as a dictionary".) Maybe that
doesn't belong in the documentation proper.
* Are sets helpful in your daily work or does the need arise
only rarely?
I am strongly in favor of moving sets into the core language with
supporting syntax and a C implementation.
Part 2
---
What is true of the performance of a partition function for the most part
should be true of a powerset, combination, and permutation (for ordered
subsets) function.
In general, I don't know if a partition function has much use when you're
dealing with sets with moderately large numbers of members. The problem
with these functions is that their size grows quickly. I have ruled out
making one giant set of all the partitions because it is too big (1
million sets uses 200MB memory). Instead I have a generator function that
yields a different partition on each iteration. The problem then is that
it takes too long. I fear that, despite what I said above about including
them in a complete implementation of sets for Python, there may not be
any practical use for these functions except in limited domains with
small numbers of elements.
So my question is for advice on improving my implementation (by a
constant factor would be great, by a factor of n would be amazing,
although maybe impossible), and also general advice on the best
performance I could ever hope for with these sorts of functions. Below
are some numbers, and then the code.
For those who don't know, a partition P of a set A is a set of mutually
exclusive and jointly exhaustive subsets of A. In other words, every
member of A is a member of exactly one member of P.
As I learned when working on this implementation, the number of
partitions for a set of size n is the Bell Number of n. The formula for
computing the Bell Number isn't really relevant at the moment, but here
are the Bell Numbers for 0,...,14:
1, 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597,
27644437, 190899322
This means that a set with 3 elements has 5 partitions. Here is an
example:
Partitions({a,b,c}) =
{
{
{a,b,c}
}
{
{a,b},{c}
}
{
{a,c},{b}
}
{
{b,c},{a}
}
{
{a},{b},{c}
}
}
This shows one big set of all the partitions, where each partition is a
set of sets of elements. The function below, being a generator, does not
return one big set of all partitions, but a different partition on each
iteration.
Now, my computer is a 1.8GHz Athlon with 512MB RAM (PC133 unfortunately).
I use Fedora Core 1 Linux and Python 2.3 RPM from python.org. For a
benchmark, the following code takes about 10 seconds to run:
i = 10000000 #ten million
while i > 0:
i -= 1
So here are some times for generating the partitions of sets of
increasing sizes:
1, <1s
2, <1s
3, <1s
4, <1s
5, <1s
6, <1s
7, <1s
8, <1s
9, 1.6s
10, 9s
11, 53s
12, 5m25s
For my current project, I need to generate partitions of sets up to size
25. I am rather hopeless it will ever be feasible, and I have begun
looking for a different approach. Thanks for any replies!
#begin code
from sets import Set, ImmutableSet
#def singleton(s):
# return Set(
def partitions(s):
if len(s) <= 1:
yield Set(
else:
x = Set([iter(s).next()])
for t in partitions(s - x):
yield t | Set([x])
for u in t:
yield (t - Set()) | Set([u | x])
s = Set("abcdefghij") #10 elements, which is 9s runtime
count = 0
for p in partitions(s):
count += 1
print count