Candidate for a new itertool

Raymond Hettinger · Mar 8, 2009

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30

bearophileHUGS · Mar 8, 2009

Raymond Hettinger, maybe it can be useful to add an optional argument
flag to tell such split_on to keep the separators or not? This is the
xsplit I usually use:

def xsplit(seq, key=bool, keepkeys=True):
"""xsplit(seq, key=bool, keepkeys=True): given an iterable seq and
a predicate
key, splits the iterable where key(item) is True and yields the
parts as lists.
If keepkeys is True then the splitting items are kept at the
beginning of the
sublists (but the first sublist may miss the key item).

>>> list(xsplit([]))

Click to expand...

Click to expand...

[]

>>> key = lambda x: 0x80 & x
>>> l = [1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13]
>>> list(xsplit(l, key=key))

Click to expand...

Click to expand...

[[1, 2, 3], [240, 4, 5, 6], [241, 7, 8], [242, 9, 10, 11, 12, 13]]

>>> l = [0xF0,1,2,3,0xF0,4,5,6,0xF1,7,8,0xF2,9,10,11,12,13,0xF0,14,0xF1]
>>> list(xsplit(l, key=key, keepkeys=False))

Click to expand...

Click to expand...

[[1, 2, 3], [4, 5, 6], [7, 8], [9, 10, 11, 12, 13], [14]]

>>> s1 = "100001000101100001000000010000"
>>> ["".join(map(str, g)) for g in xsplit(s1, key=int)]

Click to expand...

Click to expand...

['10000', '1000', '10', '1', '10000', '10000000', '10000']

>>> from itertools import groupby # To compare against groupby
>>> s2 = "1111100011111100011100101011111"
>>> ["".join(map(str, g)) for h, g in groupby(s2, key=int)]

Click to expand...

Click to expand...

['11111', '000', '111111', '000', '111', '00', '1', '0', '1', '0',
'11111']
"""
group = []
for el in seq:
if key(el):
if group:
yield group
group = []
if keepkeys:
group.append(el)
else:
group.append(el)
if group:
yield group

Maybe it's better to separate or denote the separators in some way?

A possibility:
"X1X23X456X" => "X", "1", "X", "23", "X", "456", "X"

Another possibility:
"X1X23X456X" => ("", "X"), ("1", "X"), (["2", "3"], "X"), (["4", "5",
"6"], "X")

Another possibility (True == is a separator):
"X1X23X456X" => (True, "X"), (False, ["1"]), (True, "X"), (False,
["2", "3"]), (True, "X"), (False, ["4", "5", "6"]), (True, "X")

Is it useful to merge successive separators (notice two X)?

"X1X23XX456X" => (True, ["X"]), (False, ["1"]), (True, ["X"]), (False,
["2", "3"]), (True, ["X", "X"]), (False, ["4", "5", "6"]), (True,
["X"])

Opps, this is groupby

Is a name like isplitter or splitter better this itertool?

Bye,
bearophile

jay logan · Mar 9, 2009

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30

I've found this type of splitting quite useful when grouping sections
of a text file. I used the groupby function directly in the file, when
i would have rather used something like this.

However, I wonder if it would be helpful to break that function into
two instead of having the "start" flag. The flag feels odd to me
(maybe it's the name?), and the documentation might have a better feel
to it, coming from a newcomer's perspective. Also, it would be cool if
the function took keywords; I wonder why most of the other functions
in the itertools module don't take keywords.

I wouldn't split out the keys separately from the groups. But the idea
of a flag to exclude the keys sounds interesting to me.

Thank you for giving me the opportunity to use the nonlocal keyword
for the first time since trying out Python 3.0. I hope this is an
appropriate usage:

def split_on(iterable, key=bool, start=True):
'Split iterable on boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
flag = 0

def event_marker(x, start_flag=start):
nonlocal flag, key
before = flag
if key(x):
flag += 1
after = flag

return after if start_flag else before

return (g for k, g in it.groupby(iterable, key=event_marker))

Raymond Hettinger · Mar 9, 2009

Raymond Hettinger, maybe it can be useful to add an optional argument
flag to tell such split_on to keep the separators or not? This is the
xsplit I usually use:

In your experiences with xsplit(), do most use cases involve removing
the separators?

My thought is that if you know the separator is at the beginning, it
is easy to strip it off with a next() on the subgroup iterator.
If so, then there's no reason to complicate the API with another
set of options.

Also, separation events do not necessarily tie themselves to specific
values that can be throw-away. Perhaps, the event detector is looking
for the number of lines written on a page and generates a new page
event every time a page is full.

Maybe it's better to separate or denote the separators in some way?

Probably not. Itertools work together best when a stream of tokens
all have the same meaning. Otherwise, you end-up complicating
consumer
code with something like:

for val in split(...):
if val == separator:
do_onething(val)
else:
do_another(val)

Is it useful to merge successive separators (notice two X)?

Probably depends on real-world use cases. Do you have any?

Is a name like isplitter or splitter better this itertool?

Like groupbyer, filterer, teer, starmapper, and chainer ;-)

Raymond

bearophileHUGS · Mar 9, 2009

Raymond Hettinger:

In your experiences with xsplit(), do most use cases involve removing the separators?<

Unfortunately I am not able to tell you how often I remove them. But
regarding strings, I usually want to remove separators:
['a', 'cd', 'fg']

So sometimes I want to do the same with lists too.
Maybe I am also a bit spoiled by the D (V.1) language, where strings
are dynamic arrays, so you can use the same templated functions for
arrays and strings, so a xsplit() works on both.

Probably depends on real-world use cases. Do you have any?<

I don't think so.

Bye and thank you for your work that keeps shrinking my bag of tricks
(about 100+ functions to go),
bearophile

pruebauno · Mar 9, 2009

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30

Sorry to hijack the thread but I now that you have a knack for finding
good iterator patterns. I have noticed a pattern lately: Aggregation
using a defaultdict. I quickly found two examples of problems that
could use this:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/c8b3976ec3ceadfd

http://www.willmcgugan.com/blog/tech/2009/1/17/python-coder-test/

To show an example, using data like this:

data=[('red',2,'other data'),('blue',5,'more data'),('yellow',3,'lots of things'),('blue',1,'data'),('red',2,'random data')]

Click to expand...

Click to expand...

Then

We can use groupby to do this:

[(el[0],sum(x[1] for x in el[1])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]

Click to expand...

Click to expand...

[('blue', 6), ('red', 4), ('yellow', 3)]

[(el[0],[x[1] for x in el[1]]) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]

Click to expand...

Click to expand...

[('blue', [5, 1]), ('red', [2, 2]), ('yellow', [3])]

[(el[0],set([x[1] for x in el[1]])) for el in groupby(sorted(data,key=itemgetter(0)),itemgetter(0))]

Click to expand...

Click to expand...

[('blue', set([1, 5])), ('red', set([2])), ('yellow', set([3]))]

But this way seems to be more efficient:
dd=defaultdict(int)
for el in data:
dd[key(el)]+=agrcol(el)
return dd.items()
[('blue', 6), ('yellow', 3), ('red', 4)]

dd=defaultdict(list)
for el in data:
dd[key(el)].append(agrcol(el))
return dd.items()
[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

dd=defaultdict(set)
for el in data:
dd[key(el)].add(agrcol(el))
return dd.items()
[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Raymond Hettinger · Mar 9, 2009

[prueba]

The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Looks like you've searched the possibilities thoroughly and no one
aggregation function seems to meet all needs. That's usually a cue
to not try to build one and instead let simple python loops do the
work for you (that also saves the awkward itemgetter() calls in your
examples). To my eyes, all three examples look like straight-forward,
easy-to-write, easy-to-read, fast plain python:

d = defaultdict(int)
for color, n, info in data: .... d[color] += n
d.items()

Click to expand...

Click to expand...

[('blue', 6), ('yellow', 3), ('red', 4)]

d = defaultdict(list)
for color, n, info in data: .... d[color].append(n)
d.items()

Click to expand...

Click to expand...

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

d = defaultdict(set)
for color, n, info in data: .... d[color].add(n)
d.items()

Click to expand...

Click to expand...

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

I don't think you can readily combine all three examples into a single
aggregator without the obfuscation and awkwardness that comes from
parameterizing all of the varying parts:

def aggregator(default_factory, adder, iterable, keyfunc, valuefunc):
d = defaultdict(default_factory)
for record in iterable:
key = keyfunc(record)
value = valuefunc(record)
adder(d[key], value)
return d.items()

aggregator(list, list.append, data, itemgetter(0), itemgetter(1)) [('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]
aggregator(set, set.add, data, itemgetter(0), itemgetter(1))

Click to expand...

Click to expand...

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

Yuck! Plain Python wins.

Raymond

P.S. The aggregator doesn't work so well for:
[('blue', 0), ('yellow', 0), ('red', 0)]

The problem is that operator.iadd() doesn't have a way to both
retrieve
and store back into a dictionary.

pruebauno · Mar 10, 2009

[prueba]

The data often contains objects with attributes instead of tuples, and
I expect the new namedtuple datatype to be used also as elements of
the list to be processed.

Click to expand...

But I haven't found a nice generalized way for that kind of pattern
that aggregates from a list of one datatype to a list of key plus
output datatype that would make it practical and suitable for
inclusion in the standard library.

Click to expand...

Looks like you've searched the possibilities thoroughly and no one
aggregation function seems to meet all needs. That's usually a cue
to not try to build one and instead let simple python loops do the
work for you (that also saves the awkward itemgetter() calls in your
examples). To my eyes, all three examples look like straight-forward,
easy-to-write, easy-to-read, fast plain python:

d = defaultdict(int)
for color, n, info in data: ... d[color] += n
d.items()

Click to expand...

Click to expand...

[('blue', 6), ('yellow', 3), ('red', 4)]

... d[color].append(n)>>> d.items()

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]

d = defaultdict(set)
for color, n, info in data: ... d[color].add(n)
d.items()

Click to expand...

Click to expand...

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

I don't think you can readily combine all three examples into a single
aggregator without the obfuscation and awkwardness that comes from
parameterizing all of the varying parts:

def aggregator(default_factory, adder, iterable, keyfunc, valuefunc):
d = defaultdict(default_factory)
for record in iterable:
key = keyfunc(record)
value = valuefunc(record)
adder(d[key], value)
return d.items()

[('blue', [5, 1]), ('yellow', [3]), ('red', [2, 2])]>>> aggregator(set, set.add, data, itemgetter(0), itemgetter(1))

[('blue', set([1, 5])), ('yellow', set([3])), ('red', set([2]))]

Yuck! Plain Python wins.

Raymond

P.S. The aggregator doesn't work so well for:

[('blue', 0), ('yellow', 0), ('red', 0)]

The problem is that operator.iadd() doesn't have a way to both
retrieve
and store back into a dictionary.

Yes thinking about this more, one probably needs to have two code
paths depending if the type returned by default_factory is mutable or
immutable. But you are probably right that the ratio of redundancy/
variability is pretty low for such a function and the plain written
out for loop is not too painful. The only redundancy is the creation
and manipulation of the dictionary and the explicit looping.

pruebauno · Mar 12, 2009

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

Raymond

-----------------------------------------

from itertools import groupby

def split_on(iterable, event, start=True):
'Split iterable on event boundaries (either start events or stop
events).'
# split_on('X1X23X456X', 'X'.__eq__, True) --> X1 X23 X456 X
# split_on('X1X23X456X', 'X'.__eq__, False) --> X 1X 23X 456X
def transition_counter(x, start=start, cnt=[0]):
before = cnt[0]
if event(x):
cnt[0] += 1
after = cnt[0]
return after if start else before
return (g for k, g in groupby(iterable, transition_counter))

if __name__ == '__main__':
for start in True, False:
for g in split_on('X1X23X456X', 'X'.__eq__, start):
print list(g)
print

from pprint import pprint
boundary = '--===============2615450625767277916==\n'
email = open('email.txt')
for mime_section in split_on(email, boundary.__eq__):
pprint(list(mime_section, 1, None))
print '= = ' * 30

For me your examples don't justify why you would need such a general
algorithm. A split function that works on iterables instead of just
strings seems straightforward, so maybe we should have that and
another one function with examples of problems where a plain split
does not work.
Something like this should work for the two examples you gave were the
boundaries are a known constants (and therefore there is really no
need to keep them. I can always add them later):

def split_on(iterable, boundary):
l=[]
for el in iterable:
if el!=boundary:
l.append(el)
else:
yield l
l=[]
yield l

def join_on(iterable, boundary):
it=iter(iterable)
firstel=it.next()
for el in it:
yield boundary
for x in el:
yield x

if __name__ == '__main__':
lst=[]
for g in split_on('X1X23X456X', 'X'):
print list(g)
lst.append(g)
print
print list(join_on(lst,'X'))

Aahz · Mar 19, 2009

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

It seems a little esoteric for the standard library. It was non-trivial
for me to make it work with what seems to me an obvious use-case
(although perhaps I'm missing something), and I'm not sure it has enough
general utility otherwise:

from math import log10

class C:
def __init__(self, data):
self.data = data
self.index = 0
self.sentinel = None

def __iter__(self):
return self

def next(self):
if self.index >= len(self.data):
raise StopIteration
value = self.data[self.index]
self.index += 1
return value

def make_sentinel(self, value):
return (int(value / 10) + 1) * 10

def grouper(self, value):
if self.sentinel is None:
self.sentinel = self.make_sentinel(value)
return False
if value >= self.sentinel:
self.sentinel = self.make_sentinel(value)
return True
return False

if __name__ == '__main__':
for start in True, False:
L_iter = C([11, 12, 13, 20, 32])
for g in split_on(L_iter, L_iter.grouper, start):
print list(g)
print

[11, 12, 13]
[20]
[32]

[11, 12, 13, 20]
[32]
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Programming language design is not a rational science. Most reasoning
about it is at best rationalization of gut feelings, and at worst plain
wrong." --GvR, python-ideas, 2009-3-1

George Sakkis · Mar 19, 2009

The existing groupby() itertool works great when every element in a
group has the same key, but it is not so handy when groups are
determined by boundary conditions.

For edge-triggered events, we need to convert a boundary-event
predicate to groupby-style key function. The code below encapsulates
that process in a new itertool called split_on().

Would love you guys to experiment with it for a bit and confirm that
you find it useful. Suggestions are welcome.

That's pretty close to a recipe [1] I had posted some time ago (and
you commented improving it

). As they stand, none is a generalization
of the other; your version allows either start or stop events (but not
both) while mine requires the start but takes an optional stop event
(plus an option to control whether separators are returned). If we
were to generalize them, the resulting signature would be something
like:

def split_on(iterable, **kwds):
'''Split iterable on event boundaries.

@keyword start,stop: The start and stop boundaries. Both are
optional
(but at least one of them must be given).
@keyword yield_bounds: If True, yield also the boundary events.
'''

On a related note, I recently needed a grouper that I couldn't come up
with either groupby() or the split_on() above. The reason is that
instead of one, it needs two consecutive events to decide whether to
make a split or not. An example would be to partition an iterable of
numbers (or any orderable objects for that matter) in increasing or
non-decreasing groups:

from operator import gt, ge
list(groupby_needsbettername([3,4,4,2,2,5,1], gt)) [[3, 4, 4], [2, 2, 5], [1]]
list(groupby_needsbettername([3,4,4,2,2,5,1], ge))

Click to expand...

Click to expand...

[[3, 4], [4], [2], [2, 5], [1]]

def groupby_needsbettername(iterable, is_boundary):
it = iter(iterable)
try: cur = it.next()
except StopIteration:
return
group = [cur]
for next in it:
if is_boundary(cur,next):
yield group
group = []
group.append(next)
cur = next
yield group

George

[1] http://code.activestate.com/recipes/521877/

Anyone know why this doesn't work. I'm new to coding.	1	Jan 20, 2021
itertools.groupby	6	Apr 20, 2013
Padding strings for a clean visual print out...	5	Dec 23, 2023
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
embedding ipython kernel in a thread	0	Jun 9, 2014
EEG stream data with mne and brainfolw	0	Jul 26, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Help for ActionPerformance and how to use HashMap.	2	Feb 10, 2022

Candidate for a new itertool

Raymond Hettinger

bearophileHUGS

jay logan

Raymond Hettinger

bearophileHUGS

pruebauno

Raymond Hettinger

pruebauno

pruebauno

Aahz

George Sakkis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads