Candidate for a new itertool

  • Thread starter Raymond Hettinger
  • Start date
R

Raymond Hettinger

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.



Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30
 
B

bearophileHUGS

Raymond Hettinger, maybe it can be useful to add an optional argument
flag to tell such split_on to keep the separators or not? This is the
xsplit I usually use:


def xsplit(seq, key=bool, keepkeys=True):
"""xsplit(seq, key=bool, keepkeys=True): given an iterable seq and
a predicate
key, splits the iterable where key(item) is True and yields the
parts as lists.
If keepkeys is True then the splitting items are kept at the
beginning of the
sublists (but the first sublist may miss the key item).
[]
>>> key = lambda x: 0x80 & x
>>> l = [1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13]
>>> list(xsplit(l, key=key))
[[1, 2, 3], [240, 4, 5, 6], [241, 7, 8], [242, 9, 10, 11, 12, 13]]
>>> l = [0xF0,1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13,0xF0,14,0xF1]
>>> list(xsplit(l, key=key, keepkeys=False))
[[1, 2, 3], [4, 5, 6], [7, 8], [9, 10, 11, 12, 13], [14]]
>>> s1 = "100001000101100001000000010000"
>>> ["".join(map(str, g)) for g in xsplit(s1, key=int)]
['10000', '1000', '10', '1', '10000', '10000000', '10000']
>>> from itertools import groupby # To compare against groupby
>>> s2 = "1111100011111100011100101011111"
>>> ["".join(map(str, g)) for h, g in groupby(s2, key=int)]
['11111', '000', '111111', '000', '111', '00', '1', '0', '1', '0',
'11111']
"""
group = []
for el in seq:
if key(el):
if group:
yield group
group = []
if keepkeys:
group.append(el)
else:
group.append(el)
if group:
yield group


Maybe it's better to separate or denote the separators in some way?

A possibility:
"X1X23X456X" => "X", "1", "X", "23", "X", "456", "X"

Another possibility:
"X1X23X456X" => ("", "X"), ("1", "X"), (["2", "3"], "X"), (["4", "5",
"6"], "X")

Another possibility (True == is a separator):
"X1X23X456X" => (True, "X"), (False, ["1"]), (True, "X"), (False,
["2", "3"]), (True, "X"), (False, ["4", "5", "6"]), (True, "X")

Is it useful to merge successive separators (notice two X)?

"X1X23XX456X" => (True, ["X"]), (False, ["1"]), (True, ["X"]), (False,
["2", "3"]), (True, ["X", "X"]), (False, ["4", "5", "6"]), (True,
["X"])

Opps, this is groupby :)

Is a name like isplitter or splitter better this itertool?

Bye,
bearophile
 
J

jay logan

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function.  The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful.  Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
    'Split iterable on event boundaries (either start events or stop
events).'
    # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
    # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
    def transition_counter(x, start=start, cnt=[0]):
        before = cnt[0]
        if event(x):
            cnt[0] += 1
        after = cnt[0]
        return after if start else before
    return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
    for start in True, False:
        for g in split_on('X1X23X456X', 'X'.__eq__, start):
            print list(g)
        print

    from pprint import pprint
    boundary = '--===============2615450625767277916==\n'
    email = open('email.txt')
    for mime_section in split_on(email, boundary.__eq__):
        pprint(list(mime_section, 1, None))
        print '= = ' * 30

I've found this type of splitting quite useful when grouping sections
of a text file. I used the groupby function directly in the file, when
i would have rather used something like this.

However, I wonder if it would be helpful to break that function into
two instead of having the "start" flag. The flag feels odd to me
(maybe it's the name?), and the documentation might have a better feel
to it, coming from a newcomer's perspective. Also, it would be cool if
the function took keywords; I wonder why most of the other functions
in the itertools module don't take keywords.

I wouldn't split out the keys separately from the groups. But the idea
of a flag to exclude the keys sounds interesting to me.

Thank you for giving me the opportunity to use the nonlocal keyword
for the first time since trying out Python 3.0. I hope this is an
appropriate usage:


def split_on(iterable, key=bool, start=True):
'Split iterable on boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
flag = 0

def event_marker(x, start_flag=start):
nonlocal flag, key
before = flag
if key(x):
flag += 1
after = flag

return after if start_flag else before

return (g for k, g in it.groupby(iterable, key=event_marker))
 
R

Raymond Hettinger

Raymond Hettinger, maybe it can be useful to add an optional argument
flag to tell such split_on to keep the separators or not? This is the
xsplit I usually use:

In your experiences with xsplit(), do most use cases involve removing
the separators?

My thought is that if you know the separator is at the beginning, it
is easy to strip it off with a next() on the subgroup iterator.
If so, then there's no reason to complicate the API with another
set of options.

Also, separation events do not necessarily tie themselves to specific
values that can be throw-away. Perhaps, the event detector is looking
for the number of lines written on a page and generates a new page
event every time a page is full.

Maybe it's better to separate or denote the separators in some way?

Probably not. Itertools work together best when a stream of tokens
all have the same meaning. Otherwise, you end-up complicating
consumer
code with something like:

for val in split(...):
if val == separator:
do_onething(val)
else:
do_another(val)


Is it useful to merge successive separators (notice two X)?

Probably depends on real-world use cases. Do you have any?


Is a name like isplitter or splitter better this itertool?

Like groupbyer, filterer, teer, starmapper, and chainer ;-)



Raymond
 
B

bearophileHUGS

Raymond Hettinger:
In your experiences with xsplit(), do most use cases involve removing the separators?<

Unfortunately I am not able to tell you how often I remove them. But
regarding strings, I usually want to remove separators:
['a', 'cd', 'fg']

So sometimes I want to do the same with lists too.
Maybe I am also a bit spoiled by the D (V.1) language, where strings
are dynamic arrays, so you can use the same templated functions for
arrays and strings, so a xsplit() works on both.


Probably depends on real-world use cases. Do you have any?<

I don't think so.

Bye and thank you for your work that keeps shrinking my bag of tricks
(about 100+ functions to go),
bearophile
 
P

pruebauno

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function.  The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful.  Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
    'Split iterable on event boundaries (either start events or stop
events).'
    # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
    # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
    def transition_counter(x, start=start, cnt=[0]):
        before = cnt[0]
        if event(x):
            cnt[0] += 1
        after = cnt[0]
        return after if start else before
    return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
    for start in True, False:
        for g in split_on('X1X23X456X', 'X'.__eq__, start):
            print list(g)
        print

    from pprint import pprint
    boundary = '--===============2615450625767277916==\n'
    email = open('email.txt')
    for mime_section in split_on(email, boundary.__eq__):
        pprint(list(mime_section, 1, None))
        print '= = ' * 30

Sorry to hijack the thread but I now that you have a knack for finding
good iterator patterns. I have noticed a pattern lately: Aggregation
using a defaultdict. I quickly found two examples of problems that
could use this:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/c8b3976ec3ceadfd

http://www.willmcgugan.com/blog/tech/2009/1/17/python-coder-test/

To show an example, using data like this:
data=[('red',2,'other data'),('blue',5,'more data'),('yellow',3,'lots of things'),('blue',1,'data'),('red',2,'random data')]
Then

We can use groupby to do this:
[(el[0],sum(x[1] for x in el[1])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', 6), ('red', 4), ('yellow', 3)]
[(el[0],[x[1] for x in el[1]]) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', [5, 1]), ('red', [2, 2]), ('yellow', [3])]
[(el[0],set([x[1] for x in el[1]])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]
[('blue', set([1, 5])), ('red', set([2])), ('yellow', set([3]))]

But this way seems to be more efficient:
dd=defaultdict(int)
for el in data:
dd[key(el)]+=agrcol(el)
return dd.items()
[('blue', 6), ('yellow', 3), ('red', 4)]

dd=defaultdict(list)
for el in data:
dd[key(el)].append(agrcol(el))
return dd.items()
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

dd=defaultdict(set)
for el in data:
dd[key(el)].add(agrcol(el))
return dd.items()
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]


The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.
 
R

Raymond Hettinger

[prueba]
The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Looks like you've searched the possibilities thoroughly and no one
aggregation function seems to meet all needs. That's usually a cue
to not try to build one and instead let simple python loops do the
work for you (that also saves the awkward itemgetter() calls in your
examples). To my eyes, all three examples look like straight-forward,
easy-to-write, easy-to-read, fast plain python:
d = defaultdict(int)
for color, n, info in data: .... d[color] += n
d.items()
[('blue', 6), ('yellow', 3), ('red', 4)]

d = defaultdict(list)
for color, n, info in data: .... d[color].append(n)
d.items()
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

d = defaultdict(set)
for color, n, info in data: .... d[color].add(n)
d.items()
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]


I don't think you can readily combine all three examples into a single
aggregator without the obfuscation and awkwardness that comes from
parameterizing all of the varying parts:

def aggregator(default_factory, adder, iterable, keyfunc, valuefunc):
d = defaultdict(default_factory)
for record in iterable:
key = keyfunc(record)
value = valuefunc(record)
adder(d[key], value)
return d.items()
aggregator(list, list.append, data, itemgetter(0), itemgetter(1)) [('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]
aggregator(set, set.add, data, itemgetter(0), itemgetter(1))
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]


Yuck! Plain Python wins.


Raymond


P.S. The aggregator doesn't work so well for:
[('blue', 0), ('yellow', 0), ('red', 0)]

The problem is that operator.iadd() doesn't have a way to both
retrieve
and store back into a dictionary.
 
P

pruebauno

[prueba]
The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.
But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Looks like you've searched the possibilities thoroughly and no one
aggregation function seems to meet all needs.  That's usually a cue
to not try to build one and instead let simple python loops do the
work for you (that also saves the awkward itemgetter() calls in your
examples).  To my eyes, all three examples look like straight-forward,
easy-to-write, easy-to-read, fast plain python:
d = defaultdict(int)
for color, n, info in data: ...     d[color] += n
d.items()

[('blue', 6), ('yellow', 3), ('red', 4)]

...     d[color].append(n)>>> d.items()

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]
d = defaultdict(set)
for color, n, info in data: ...     d[color].add(n)
d.items()

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

I don't think you can readily combine all three examples into a single
aggregator without the obfuscation and awkwardness that comes from
parameterizing all of the varying parts:

def aggregator(default_factory, adder, iterable, keyfunc, valuefunc):
   d = defaultdict(default_factory)
   for record in iterable:
       key = keyfunc(record)
       value = valuefunc(record)
       adder(d[key], value)
   return d.items()

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]>>> aggregator(set, set.add, data, itemgetter(0), itemgetter(1))

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

Yuck!  Plain Python wins.

Raymond

P.S. The aggregator doesn't work so well for:

[('blue', 0), ('yellow', 0), ('red', 0)]

The problem is that operator.iadd() doesn't have a way to both
retrieve
and store back into a dictionary.

Yes thinking about this more, one probably needs to have two code
paths depending if the type returned by default_factory is mutable or
immutable. But you are probably right that the ratio of redundancy/
variability is pretty low for such a function and the plain written
out for loop is not too painful. The only redundancy is the creation
and manipulation of the dictionary and the explicit looping.
 
P

pruebauno

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function.  The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful.  Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
    'Split iterable on event boundaries (either start events or stop
events).'
    # split_on('X1X23X456X', 'X'.__eq__, True)  --> X1 X23 X456 X
    # split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
    def transition_counter(x, start=start, cnt=[0]):
        before = cnt[0]
        if event(x):
            cnt[0] += 1
        after = cnt[0]
        return after if start else before
    return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
    for start in True, False:
        for g in split_on('X1X23X456X', 'X'.__eq__, start):
            print list(g)
        print

    from pprint import pprint
    boundary = '--===============2615450625767277916==\n'
    email = open('email.txt')
    for mime_section in split_on(email, boundary.__eq__):
        pprint(list(mime_section, 1, None))
        print '= = ' * 30

For me your examples don't justify why you would need such a general
algorithm. A split function that works on iterables instead of just
strings seems straightforward, so maybe we should have that and
another one function with examples of problems where a plain split
does not work.
Something like this should work for the two examples you gave were the
boundaries are a known constants (and therefore there is really no
need to keep them. I can always add them later):

def split_on(iterable, boundary):
l=[]
for el in iterable:
if el!=boundary:
l.append(el)
else:
yield l
l=[]
yield l

def join_on(iterable, boundary):
it=iter(iterable)
firstel=it.next()
for el in it:
yield boundary
for x in el:
yield x

if __name__ == '__main__':
lst=[]
for g in split_on('X1X23X456X', 'X'):
print list(g)
lst.append(g)
print
print list(join_on(lst,'X'))
 
A

Aahz

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

It seems a little esoteric for the standard library. It was non-trivial
for me to make it work with what seems to me an obvious use-case
(although perhaps I'm missing something), and I'm not sure it has enough
general utility otherwise:

from math import log10

class C:
def __init__(self, data):
self.data = data
self.index = 0
self.sentinel = None

def __iter__(self):
return self

def next(self):
if self.index >= len(self.data):
raise StopIteration
value = self.data[self.index]
self.index += 1
return value

def make_sentinel(self, value):
return (int(value / 10) + 1) * 10

def grouper(self, value):
if self.sentinel is None:
self.sentinel = self.make_sentinel(value)
return False
if value >= self.sentinel:
self.sentinel = self.make_sentinel(value)
return True
return False

if __name__ == '__main__':
for start in True, False:
L_iter = C([11, 12, 13, 20, 32])
for g in split_on(L_iter, L_iter.grouper, start):
print list(g)
print

[11, 12, 13]
[20]
[32]

[11, 12, 13, 20]
[32]
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Programming language design is not a rational science. Most reasoning
about it is at best rationalization of gut feelings, and at worst plain
wrong." --GvR, python-ideas, 2009-3-1
 
G

George Sakkis

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function.  The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful.  Suggestions are welcome.

That's pretty close to a recipe [1] I had posted some time ago (and
you commented improving it:)). As they stand, none is a generalization
of the other; your version allows either start or stop events (but not
both) while mine requires the start but takes an optional stop event
(plus an option to control whether separators are returned). If we
were to generalize them, the resulting signature would be something
like:

def split_on(iterable, **kwds):
'''Split iterable on event boundaries.

@keyword start,stop: The start and stop boundaries. Both are
optional
(but at least one of them must be given).
@keyword yield_bounds: If True, yield also the boundary events.
'''


On a related note, I recently needed a grouper that I couldn't come up
with either groupby() or the split_on() above. The reason is that
instead of one, it needs two consecutive events to decide whether to
make a split or not. An example would be to partition an iterable of
numbers (or any orderable objects for that matter) in increasing or
non-decreasing groups:
from operator import gt, ge
list(groupby_needsbettername([3,4,4,2,2,5,1], gt)) [[3, 4, 4], [2, 2, 5], [1]]
list(groupby_needsbettername([3,4,4,2,2,5,1], ge))
[[3, 4], [4], [2], [2, 5], [1]]

def groupby_needsbettername(iterable, is_boundary):
it = iter(iterable)
try: cur = it.next()
except StopIteration:
return
group = [cur]
for next in it:
if is_boundary(cur,next):
yield group
group = []
group.append(next)
cur = next
yield group


George

[1] http://code.activestate.com/recipes/521877/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,151
Latest member
JaclynMarl
Top