Sequence splitting

P

Paul Rubin

Steven D'Aprano said:
groupby() works on lists.
a = [1,3,4,6,7]
from itertools import groupby
b = groupby(a, lambda x: x%2==1) # split into even and odd
c = list(b)
print len(c) 3
d = list(c[1][1]) # should be [4,6]
print d # oops.
[]

The difference between what I'm suggesting and what groupby() does is
that my suggestion would collate *all* the elements with the same key,
not just runs of them. This (as far as I can tell) requires returning
lists rather than iterators.

I guess that is reasonable.
The most important difference between my suggestion and that of the OP is
that he limited the key function to something which returns a truth
value, while I'm looking for something more general which can split the
input into an arbitrary number of collated sublists.

Also ok.
 
S

Steven D'Aprano

Steven D'Aprano said:
groupby() works on lists.
a = [1,3,4,6,7]
from itertools import groupby
b = groupby(a, lambda x: x%2==1) # split into even and odd
c = list(b)
print len(c) 3
d = list(c[1][1]) # should be [4,6] print d # oops.
[]

I didn't say it worked properly *wink*

Seriously, this behaviour caught me out too. The problem isn't that the
input data is a list, the same problem occurs for arbitrary iterators.
From the docs:

The operation of groupby() is similar to the uniq filter in Unix. It
generates a break or new group every time the value of the key function
changes (which is why it is usually necessary to have sorted the data
using the same key function). That behavior differs from SQL’s GROUP BY
which aggregates common elements regardless of their input order.

The returned group is itself an iterator that shares the underlying
iterable with groupby(). Because the source is shared, when the groupby()
object is advanced, the previous group is no longer visible. So, if that
data is needed later, it should be stored as a list
[end quote]

http://www.python.org/doc/2.6/library/itertools.html#itertools.groupby
 
P

Paul Moore

2009/7/3 Brad said:
Perhaps true, but it would be a nice convenience (for me) as a built-
in written in either Python or C. Although the default case of a bool
function would surely be faster.

The chance of getting this accepted as a builtin is essentially zero.
To be a builtin, as opposed to being in the standard library,
something has to have a very strong justification.

This suggestion may find a home in the standard library, although it's
not entirely clear where (maybe itertools, although it's not entirely
obvious that it's a fit there).

You'd have to justify this against the argument "not every 2-3 line
function needs to be built in". Personally, I'm not sure it passes
that test - sure, it's a useful function, but it's not that hard to
write when you need it. It may be better as a recipe in the cookbook.
Or if it's close enough to the spirit of the itertools, it may be
suitable as a sample in the itertools documentation (the "recipes"
section).

Paul.
 
P

Pablo Torres N.

The chance of getting this accepted as a builtin is essentially zero.
To be a builtin, as opposed to being in the standard library,
something has to have a very strong justification.

That's right. Mr. schickb, I think what you need is a few concrete
examples as to where this function would be beneficial, so it can be
judged objectively.
 
L

Lie Ryan

tsangpo said:
Just a shorter implementation:

from itertools import groupby
def split(lst, func):
gs = groupby(lst, func)
return list(gs[True]), list(gs[False])

As you're replying to my post, I assume you meant a shorter
implementation my function. But it doesn't do the same thing. The idea
with my group() is similar to what Steven D'Aprano is describing in
another branch of this thread (i.e. splitting not only based on True and
False, but arbitrary groupings, e.g. 'tru', 'flash' or perhaps -1, 0, 1).

For example:.... ret = {}
.... for item in seq:
.... fitem = func(item)
.... try:
.... ret[fitem].append(item)
.... except KeyError:
.... ret[fitem] = [item]
.... return ret
....
group(range(10), lambda x: x%3) {0: [0, 3, 6, 9], 1: [1, 4, 7], 2: [2, 5, 8]}
# the next one is the OP's split
group(['true', '', [], [''], 'false'], bool)
{False: ['', []], True: ['true', [''], 'false']}
 
B

Brad

I've never needed such a split function, and I don't like the name, and
the functionality isn't general enough. I'd prefer something which splits
the input sequence into as many sublists as necessary, according to the
output of the key function.

That's not a bad idea, I'll have to experiment with the alternatives.
My thought process for this, however, was that filter itself already
splits the sequence and it would have been more useful had it not
thrown away "half" of what it discovers. It could have been written to
returned two sequences with very litter perf hit for all but very
large input sequences, and been useful in more situations. What I
*really* wanted was a way to make filter itself more useful, since it
seems a bit silly to have two very similar functions.

Maybe this would be difficult to get into the core, but how about this
idea: Rename the current filter function to something like "split" or
"partition" (which I agree may be a better name) and modify it to
return the desired true and false sequences. Then recreate the
existing "filter" function with a wrapper that throws away the false
sequence.

Here are two simplified re-creations of situations where I could have
used partition (aka split):

words = ['this', 'is', 'a', 'bunch', 'of', 'words']
short, long = partition(words, lambda w: len(w) < 3)

d = {1 : 'w', 2 : 'x' ,3 : 'y' ,4 : 'z'}
keys = [1, 3, 4, 9]
found, missing = partition(keys, d.has_key)


There are probably a dozen other approaches, but the existing "filter"
is fast, clear, and *almost* good enough. So when is this useful in
general: Whenever filter itself is useful, but you want to use both
sides of the partitioning work it already does.


-Brad
 
P

Paul Rubin

Brad said:
Maybe this would be difficult to get into the core, but how about this
idea: Rename the current filter function to something like "split" or
"partition" (which I agree may be a better name) and modify it to
return the desired true and false sequences. Then recreate the
existing "filter" function with a wrapper that throws away the false
sequence.

This isn't so attractive, since filter takes a sequence input but
returns a list. So if I have an iterator that produces a billion
elements of which I expect three to satisfy some predicate, then
xs = filter(func, seq)
as currently implemented will build a 3-element list and return it.
Under your suggestion, it would also build and throw away an (almost)
billion element list.
 
L

Lie Ryan

Brad said:
That's not a bad idea, I'll have to experiment with the alternatives.
My thought process for this, however, was that filter itself already
splits the sequence and it would have been more useful had it not
thrown away "half" of what it discovers. It could have been written to
returned two sequences with very litter perf hit for all but very
large input sequences, and been useful in more situations. What I
*really* wanted was a way to make filter itself more useful, since it
seems a bit silly to have two very similar functions.

It's not that easy. filter is nearly by definition a lazy function. The
various split/group functions is impossible to be made an efficient
iterator, since they must traverse the whole list before being sure that
they have finished the group/part of the split (or even to be sure that
they have all the group keys).
Maybe this would be difficult to get into the core, but how about this
idea: Rename the current filter function to something like "split" or
"partition" (which I agree may be a better name) and modify it to
return the desired true and false sequences. Then recreate the
existing "filter" function with a wrapper that throws away the false
sequence.

filter has a very deep root in functional programming, I doubt it will
change any time soon. Also, consider infinite iterators. With the
"filter as wrapper to partition" scheme, it is impossible for
split/group/partition to use infinite iterator.
 
M

Mel

Steven said:
The most important difference between my suggestion and that of the OP is
that he limited the key function to something which returns a truth
value, while I'm looking for something more general which can split the
input into an arbitrary number of collated sublists.

Generally along the lines of

def scatter (seq, criterion):
lists = {}
for s in seq:
lists.setdefault (criterion (s), []).append (s)
return lists

modulo a defaultdict or some such ?

Mel.
 
T

Terry Reedy

Brad said:
I tried posting on python-ideas and received a "You are not allowed to
post to this mailing list" reply. Perhaps because I am posting through
Google groups? Or maybe one must be an approved member to post?

Spammers post thru Google groups

Either subscribe or access via news.gmane.org as gmane.comp.python.ideas.

As to your main question: this was discuss some years ago. There did not
seem enough use cases for 'bifilter' or 'distributor' to warrent
inclusion in the stdlib. Easy enough to do on one's own for those few
cases. Just filter twice.

Usually, one wants to do one thing or another to each item with the
original scan

for item in collection:
if pred(item):
proc_true(item)
else:
proc_false(item)

Having proc_true and proc_false both be append(item) is a special case.

Note that filter has been changed from returning a list to returning an
iterator. So a proposal for a function to return two lists would seem
backwards. A function that returns two iterators would be possible.
Something like itertools.tee except that each item would go on one tee
or the other instead of both. It would have two queues (deques) instead
of one. And it should be lazy -- items should only be pulled from the
original iterator when an item is requested from an empty queue.

I am not sure it is worth it. In the worst case, all items are say
'True' and you request 'False' items and all items get copied to the
True queue before the False iterator raised StopIteration.

The way to avoid calling keyfunc(item) twice on each item is to make a
new list first: newlist = list(map(keyfunc,iterable)). Then scan newlist
twice.

If you write such a function, first in Python, then C, you can register
code on PyPI and see how much interest it gets.

Terry Jan Reedy
 
M

MRAB

Mel said:
Steven said:
The most important difference between my suggestion and that of the OP is
that he limited the key function to something which returns a truth
value, while I'm looking for something more general which can split the
input into an arbitrary number of collated sublists.

Generally along the lines of

def scatter (seq, criterion):
lists = {}
for s in seq:
lists.setdefault (criterion (s), []).append (s)
return lists

modulo a defaultdict or some such ?
Somehow this reminds me of the new Counter class, except that it's
making lists instead of counts.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,200
Latest member
LaraHunley

Latest Threads

Top