bags? 2.5.x?

D

Dan Stromberg

Is there a particular reason why bags didn't go into 2.5.x or 3000?

I keep wanting something like them - especially bags with something akin
to set union, intersection and difference.
 
D

Dan Stromberg

How about this recepie
<URL:http://www.ubookcase.com/book/Oreilly/ Python.Cookbook.2nd.edition/0596007973/pythoncook2-chp-18-sect-8.html>?

/W

I'd found this with google, but thank you for making sure I was aware of
it.

The author of the bag class said that he was planning to submit bags for
inclusion in 2.5 - is there a particular reason why they didn't go in?

I keep finding a need for bags. In the past, I've done this sort of
thing with dictionaries, but it's much nicer to have a bag class, and of
course it's better to have it in the standard library than to slurp it
into this, that and the other project.
 
W

Wildemar Wildenburger

Dan said:
The author of the bag class said that he was planning to submit bags for
inclusion in 2.5 - is there a particular reason why they didn't go in?
I wouldn't know. Not enough convincing use cases, I guess. Fools ;)

I keep finding a need for bags. In the past, I've done this sort of
thing with dictionaries, but it's much nicer to have a bag class, and of
course it's better to have it in the standard library than to slurp it
into this, that and the other project.
Then again, it is only one class. Also, if I may be so bold, why
wouldn't a simple list fit your needs (other than performance, of course)?


/W
 
R

Raymond Hettinger

I keep wanting something like them - especially bags with something
The author of the bag class said that he was planning to submit bags for
inclusion in 2.5 - is there a particular reason why they didn't go in?

Three reasons:

1. b=collections.defaultdict(int) went a long way
towards meeting the need to for a fast counter.

2. It's still not clear what the best API would be.
What should list(b) return for b.dict = {'a':3, 'b':0, 'c':-3}?
Perhaps, [('a', 3), ('b', 0), ('c', -3)]
or ['a', 'a', 'a']
or ['a']
or ['a', 'b', 'c']
or raise an Error for the negative entry.

3. I'm still working on it and am not done yet.


Raymond
 
D

Dan Stromberg

The author of the bag class said that he was planning to submit bags
for inclusion in 2.5 - is there a particular reason why they didn't go
in?

Three reasons:

1. b=collections.defaultdict(int) went a long way towards meeting the
need to for a fast counter.

2. It's still not clear what the best API would be. What should list(b)
return for b.dict = {'a':3, 'b':0, 'c':-3}? Perhaps, [('a', 3), ('b',
0), ('c', -3)] or ['a', 'a', 'a']
or ['a']
or ['a', 'b', 'c']
or raise an Error for the negative entry.

I'd suggest that .keys() return the unique list, and that list() return
the list of tuples. Then people can use list comprehensions or map() to
get what they really need.

It might not be a bad thing to have an optional parameter on __init__
that would allow the user to specify if they need negative counts or not;
so far, I've not needed them though.
3. I'm still working on it and am not done yet.

Decent reasons. :)

Thanks!

Here's a diff to bag.py that helped me. I'd like to think these meanings
are common, but not necessarily!

$ diff -b /tmp/bag.py.original /usr/local/lib/bag.py
18a19,58
def _difference(lst):
left = lst[0]
right = lst[1]
return max(left - right, 0)
_difference = staticmethod(_difference)

def _relationship(self, other, operator):
if isinstance(other, bag):
self_keys = set(self._data.keys())
other_keys = set(other._data.keys())
union_keys = self_keys | other_keys
#print 'union_keys is',union_keys
result = bag()
for element in list(union_keys):
temp = operator([ self[element], other [element] ])
#print 'operator is', operator
#print 'temp is', temp
result[element] += temp
return result
else:
raise NotImplemented

def union(self, other):
return self._relationship(other, sum)

__or__ = union

def intersection(self, other):
return self._relationship(other, min)

__and__ = intersection

def maxunion(self, other):
return self._relationship(other, max)

def difference(self, other):
return self._relationship(other, self._difference)

__sub__ = difference
 
M

MRAB

Three reasons:
1. b=collections.defaultdict(int) went a long way towards meeting the
need to for a fast counter.
2. It's still not clear what the best API would be. What should list(b)
return for b.dict = {'a':3, 'b':0, 'c':-3}? Perhaps, [('a', 3), ('b',
0), ('c', -3)] or ['a', 'a', 'a']
or ['a']
or ['a', 'b', 'c']
or raise an Error for the negative entry.

I'd suggest that .keys() return the unique list, and that list() return
the list of tuples. Then people can use list comprehensions or map() to
get what they really need.

I think that a bag is a cross between a dict (but the values are
always positive integers) and a set (but duplicates are permitted).

I agree that .keys() should the unique list, but that .items() should
return the tuples and list() should return the list of keys including
duplicates. bag() should accept an iterable and count the duplicates.

For example:
sentence = "the cat sat on the mat"
my_words = sentence.split()
print my_words ['the', 'cat', 'sat', 'on', 'the', 'mat']
my_bag = bag(my_words)
print my_bag
bag({'on': 1, 'the': 2, 'sat': 1, 'mat': 1, 'cat': 1})
my_list = list(my_bag)
['on', 'the', 'the', 'sat', 'mat', 'cat']

It should be easy to convert a bag to a dict and also a dict to a bag,
raising ValueError if it sees a value that's not a non-negative
integer (a value of zero just means "there isn't one of these in the
bag"!).
It might not be a bad thing to have an optional parameter on __init__
that would allow the user to specify if they need negative counts or not;
so far, I've not needed them though.
3. I'm still working on it and am not done yet.

Decent reasons. :)

Thanks!

Here's a diff to bag.py that helped me. I'd like to think these meanings
are common, but not necessarily!

$ diff -b /tmp/bag.py.original /usr/local/lib/bag.py
18a19,58
def _difference(lst):
left = lst[0]
right = lst[1]
return max(left - right, 0)
_difference = staticmethod(_difference)
def _relationship(self, other, operator):
if isinstance(other, bag):
self_keys = set(self._data.keys())
other_keys = set(other._data.keys())
union_keys = self_keys | other_keys
#print 'union_keys is',union_keys
result = bag()
for element in list(union_keys):
temp = operator([ self[element], other [element] ])
#print 'operator is', operator
#print 'temp is', temp
result[element] += temp
return result
else:
raise NotImplemented
def union(self, other):
return self._relationship(other, sum)
__or__ = union
def intersection(self, other):
return self._relationship(other, min)
__and__ = intersection
def maxunion(self, other):
return self._relationship(other, max)
def difference(self, other):
return self._relationship(other, self._difference)
__sub__ = difference
 
C

Christian Heimes

Dan said:
Is there a particular reason why bags didn't go into 2.5.x or 3000?

I keep wanting something like them - especially bags with something akin
to set union, intersection and difference.

Ask yourself the following questions:

* Is the feature useful for the broad mass?
* Has the feature been implemented and contributed for Python?
* Is the code well written, tested and documented?
* Is the code mature and used by lots of people?

Can you answer every question with yes?

Christian
 
D

Dan Stromberg

Ask yourself the following questions:

* Is the feature useful for the broad mass?

Yes, probably, at least if this kind of feature's inclusion in other
languages and my two recent needs for it are any indication. In other
languages, they are sometimes called bags or multisets.

* Has the feature been
implemented and contributed for Python?

Yes.

* Is the code well written,
tested and documented?

Yes, I believe so, though the patch I sent has only been used in
production a little.

* Is the code mature and used by lots of people?

This is hard to get a handle on. It's been on the web for a while, so
it's hard to say how many people have downloaded it and used it.
Can you answer every question with yes?

3 yesses and a "hard to say for sure". But maybe we could check the web
server log to get a download count for the maybe.
 
P

Paul Rubin

Dan Stromberg said:
Yes, probably, at least if this kind of feature's inclusion in other
languages and my two recent needs for it are any indication. In other
languages, they are sometimes called bags or multisets.

I use defaultdict(int) all the time, but it's been fairly rare
to want to do set operations (union, intersection, difference) on
these. I'm not even sure what the intersection should mean.
 
A

Arnaud Delobelle

I use defaultdict(int) all the time, but it's been fairly rare
to want to do set operations (union, intersection, difference) on
these.  I'm not even sure what the intersection should mean.

Let me write the multiset with 3 x's and 2y y' as {2x, 3y} (this is
not a suggested Python syntax!).

Note: in the following my multisets can have negative multiplicities,
whereas the usual definition only accepts positive multiplicities.

Let H be the collection of all python hashable objects.
Let Z be the set of integers.

So a multiset is a mapping from H to Z which maps cofinitely many
objects
to 0 (a sort of defaultdict(int))

=== For mathematically minded people ===

Well, if one thinks about the mathematical meaning of multisets, then
the collection of all multisets is the free Z-module over H.
Therefore the meaningful operations for multisets are addition,
subtraction and multiplication by a scalar.

=== For not so mathematically minded people ===

One way of thinking about a multiset can be as a vector whose
dimensions are explicitly named, instead of being implicit. For
example in 3D the vector [2, 3, 4] could instead be written as {2x,
3y, 4z}. Just like vectors, multisets can then be added, subtracted
and multiplied by as scalar:


{2x, 3y} + {4x, 7z} = {6x, 3y, 7z}
{4x, 2y, 1z} - {1x, 2y, 3z} = {3x, -2z}
{3x, 2y}*5 = {15x, 10y}

Now if one wants to extend union and intersection to multisets:

* For sets {x, y} union {y, z} = {x, y, z}. The natural way of
extending this to multisets is having the union operator take the
max of the multiplicities of each element, i.e.

{2x, 7y, 1z} union {3y, 1z, 5t} = {2x, 7y, 1z, 5t}
{3x, -5y} union {2z} = {3x, 2z} (!)

* Similarly, for intersection one would take the min of
multiplicities, i.e.

{2x, 7y, 1z} intersection {3y, 1z, 5t} = {3y, 1z}
{3x, -5y} intersection {2z} = {-5y} (!)

* difference and symmetric_difference can be defined in terms of +, -,
union, intersection:

A difference B = A - (A intersection B)
A symmetric_difference B = (A union B) - (A intersection B)

Note that union, intersection, difference and symmetric_difference
defined this way are fully compatible with the corresponding set
operations in the following sense:

If A and B are sets, then

- multiset(A).union(multiset(B)) == multiset(A.union(B))

- multiset(A).intersection(multiset(B)) == multiset(A.intersection(B))

(same for difference, symmetric_difference)

I think this is how I would like my multisets :)
 
P

Paul Rubin

Arnaud Delobelle said:
* For sets {x, y} union {y, z} = {x, y, z}. The natural way of
extending this to multisets is having the union operator take the
max of the multiplicities of each element, i.e.

That certainly doesn't fit the intuition of a bag of objects. I'd
think of the union of two bags as the result of dumping the contents
of both bags onto the table, i.e. you'd add the two vectors.
* Similarly, for intersection one would take the min of
multiplicities, i.e.

This is a reasonable interpretation I guess.
A difference B = A - (A intersection B)

I think difference would mean subtraction.
 
M

MRAB

That certainly doesn't fit the intuition of a bag of objects. I'd
think of the union of two bags as the result of dumping the contents
of both bags onto the table, i.e. you'd add the two vectors.
I could see uses for both types of union. You could have both A + B
which adds the multiplicities (the smallest bag which contains both
the bags) and A | B which takes the max of the multiplicities (the
smallest bag which contains either of the bags).
 
C

Carsten Haese

That certainly doesn't fit the intuition of a bag of objects. I'd
think of the union of two bags as the result of dumping the contents
of both bags onto the table, i.e. you'd add the two vectors.

In addition to possibly being more intuitive, the latter also agrees
with a use case of bags in complex analysis. If you consider the zeroes
of complex polynomials, let B1 be the zeroes of polynomial P1, and B2 be
the zeroes of polynomial P2, then B1+B2 are the zeroes of the product
P1*P2.

Simple example: P1(z)=z-1, P2(z)=(z-1)^2. P1 has a simple zero at z=1,
P2 has a double zero at z=1. The product is (z-1)^3, which has a triple
zero at z=1.
 
A

Arnaud Delobelle

In addition to possibly being more intuitive, the latter also agrees
with a use case of bags in complex analysis. If you consider the zeroes
of complex polynomials, let B1 be the zeroes of polynomial P1, and B2 be
the zeroes of polynomial P2, then B1+B2 are the zeroes of the product
P1*P2.

Note that I was proposing exactly this: B1+B2 would add
multiplicities. On top of this, B1.union(B2) would take the max
(maybe spelt B1|B2). IMHO this is more in the spirit of union:
conceptually, the union of two things is the smallest thing that
contains them both.
 
A

Arnaud Delobelle

I could see uses for both types of union. You could have both A + B
which adds the multiplicities (the smallest bag which contains both
the bags) and A | B which takes the max of the multiplicities (the
smallest bag which contains either of the bags).

I agree, and this is what I proposed in my post (except that I spelt A|
B as A.union(B))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top