Sequence splitting

S

schickb

Jul 3, 2009

#1

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.
http://stackoverflow.com/questions/949098/python-split-a-list-based-on-a-condition

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad

P

Pablo Torres N.

Jul 3, 2009

#2

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
Â Â if func is None:
Â Â Â Â func = bool
Â Â t, f = [], []
Â Â for item in seq:
Â Â Â Â if func(item):
Â Â Â Â Â Â t.append(item)
Â Â Â Â else:
Â Â Â Â Â Â f.append(item)
Â Â return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.
http://stackoverflow.com/questions/949098/python-split-a-list-based-on-a-condition

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

P

Paul Rubin

Jul 3, 2009

#3

schickb said:
def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

untested:

def split(seq, func=bool):
xs = zip(seq, itertools.imap(func, seq))
t = list(x for (x,y) in xs if y)
f = list(x for (x,y) in xs if not y)
return (t, f)

P

Pablo Torres N.

Jul 3, 2009

#4

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.http://stackoverflow.com/questions/949098/python-split-a-list-based-o...

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

B

Brad

Jul 3, 2009

#5

schickb said:
schickb said:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

Click to expand...

untested:

def split(seq, func=bool):
xs = zip(seq, itertools.imap(func, seq))
t = list(x for (x,y) in xs if y)
f = list(x for (x,y) in xs if not y)
return (t, f)

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

-Brad

B

Brad

Jul 3, 2009

#6

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

Thanks, I didn't know about that list.

P

Paul Rubin

Jul 3, 2009

#7

Brad said:
schickb said:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

Click to expand...

untested:

def split(seq, func=bool):
xs = zip(seq, itertools.imap(func, seq))
t = list(x for (x,y) in xs if y)
f = list(x for (x,y) in xs if not y)
return (t, f)

Click to expand...

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element. Here's a variant
of your version:

def split(seq, func=bool):
t, f = [], []
ta, fa = t.append, f.append
for item in seq:
(ta if func(item) else fa)(item)
return (t, f)

This avoids some dict lookups and copying. I wonder if that helps
significantly.

B

Brad

Jul 3, 2009

#8

Brad said:
Brad said:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)
untested:
def split(seq, func=bool):
xs = zip(seq, itertools.imap(func, seq))
t = list(x for (x,y) in xs if y)
f = list(x for (x,y) in xs if not y)
return (t, f)

Click to expand...

Click to expand...

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

Click to expand...

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element.

Perhaps true, but it would be a nice convenience (for me) as a built-
in written in either Python or C. Although the default case of a bool
function would surely be faster.

Here's a variant of your version:

def split(seq, func=bool):
   t, f = [], []
ta, fa = t.append, f.append
   for item in seq:
(ta if func(item) else fa)(item)
   return (t, f)

This avoids some dict lookups and copying. I wonder if that helps
significantly.

Faster, but in tests of a few short sequences only 1% so.

-Brad

P

Pablo Torres N.

Jul 3, 2009

#9

Brad said:
Brad said:

def split(seq, func=None):
Â Â if func is None:
Â Â Â Â func = bool
Â Â t, f = [], []
Â Â for item in seq:
Â Â Â Â if func(item):
Â Â Â Â Â Â t.append(item)
Â Â Â Â else:
Â Â Â Â Â Â f.append(item)
Â Â return (t, f)

untested:

Click to expand...

Â Â def split(seq, func=bool):
Â Â Â xs = zip(seq, itertools.imap(func, seq))
Â Â Â t = list(x for (x,y) in xs if y)
Â Â Â f = list(x for (x,y) in xs if not y)
Â Â Â return (t, f)

Click to expand...

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

Click to expand...

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element.

Click to expand...

Perhaps true, but it would be a nice convenience (for me) as a built-
in written in either Python or C. Although the default case of a bool
function would surely be faster.

Here's a variant of your version:

Â def split(seq, func=bool):
Â Â Â t, f = [], []
Â Â Â ta, fa = t.append, f.append
Â Â Â for item in seq:
Â Â Â Â Â (ta if func(item) else fa)(item)
Â Â Â return (t, f)

This avoids some dict lookups and copying. Â I wonder if that helps
significantly.

Click to expand...

Faster, but in tests of a few short sequences only 1% so.

-Brad

If it is speed that we are after, it's my understanding that map and
filter are faster than iterating with the for statement (and also
faster than list comprehensions). So here is a rewrite:

def split(seq, func=bool):
t = filter(func, seq)
f = filter(lambda x: not func(x), seq)
return list(t), list(f)

The lambda thing is kinda ugly, but I can't think of anything else.
Also, is it ok to return lists? Py3k saw a lot of APIs changed to
return iterables instead of lists, so maybe my function should have
'return t, f' as it's last statement.

P

Paul Rubin

Jul 3, 2009

#10

Pablo Torres N. said:
def split(seq, func=bool):
t = filter(func, seq)
f = filter(lambda x: not func(x), seq)
return list(t), list(f)

That is icky--you're calling func (which might be slow) twice instead
of once on every element of the seq.

B

Brad

Jul 3, 2009

#11

If it is speed that we are after, it's my understanding that map and
filter are faster than iterating with the for statement (and also
faster than list comprehensions). So here is a rewrite:

def split(seq, func=bool):
t = filter(func, seq)
f = filter(lambda x: not func(x), seq)
return list(t), list(f)

In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

-Brad

G

Gabriel Genellina

Jul 3, 2009

#12

En Fri said:
That is icky--you're calling func (which might be slow) twice instead
of once on every element of the seq.

In addition, this doesn't work if seq is an iterator instead of a sequence.

B

Brad

Jul 3, 2009

#13

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

I tried posting on python-ideas and received a "You are not allowed to
post to this mailing list" reply. Perhaps because I am posting through
Google groups? Or maybe one must be an approved member to post?

-Brad

R

Rickard Lindberg

Jul 3, 2009

#14

I tried posting on python-ideas and received a "You are not allowed to

post to this mailing list" reply. Perhaps because I am posting through
Google groups? Or maybe one must be an approved member to post?

If you got an "awaiting moderator approval" message you post might appear on
the list soon. The reason for getting those can be that it is a member only
list and you posted from another address. I am not sure if that was the message
you got.

L

Lie Ryan

Jul 3, 2009

#15

Brad said:
In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

-Brad

A built-in/itertools should always try to provide the general solution
to be as useful as possible, something like this:

def group(seq, func=bool):
ret = {}
for item in seq:
fitem = func(item)
try:
ret[fitem].append(item)
except KeyError:
ret[fitem] = [item]
return ret

definitely won't be faster, but it is a much more general solution.
Basically, the function allows you to group sequences based on the
result of func(item). It is similar to itertools.groupby() except that
this also group non-contiguous items.

L

Lie Ryan

Jul 3, 2009

#16

Rickard said:
If you got an "awaiting moderator approval" message you post might appear on
the list soon. The reason for getting those can be that it is a member only
list and you posted from another address. I am not sure if that was the message
you got.

AFAIK, python-ideas is not moderated (I followed python-ideas). I've
never used Google Groups to access it though. Try subscribing to the
mailing list directly (instead of using Google Group's web-gateway)
here: http://mail.python.org/mailman/listinfo/python-ideas

S

Steven D'Aprano

Jul 3, 2009

#17

This sounds like it belongs to the python-ideas list. I suggest posting
there for better feedback, since the core developers check that list
more often than this one.

If you post to python-ideas, you'll probably be told to gather feedback
here first. The core-developers aren't hugely interested in arbitrary new
features unless they have significant community support.

I've never needed such a split function, and I don't like the name, and
the functionality isn't general enough. I'd prefer something which splits
the input sequence into as many sublists as necessary, according to the
output of the key function. Something like itertools.groupby(), except it
runs through the entire sequence and collates all the elements with
identical keys.

E.g.:

splitby(range(10), lambda n: n%3)
=> [ (0, [0, 3, 6, 9]),
(1, [1, 4, 7]),
(2, [2, 5, 8]) ]

Your split() would be nearly equivalent to this with a key function that
returns a Boolean.

P

Paul Rubin

Jul 3, 2009

#18

Steven D'Aprano said:
I've never needed such a split function, and I don't like the name, and
the functionality isn't general enough. I'd prefer something which splits
the input sequence into as many sublists as necessary, according to the
output of the key function. Something like itertools.groupby(), except it
runs through the entire sequence and collates all the elements with
identical keys.

No really, groupby makes iterators, not lists, and it you have to
develop quite a delicate sense of when you can use it without having
bugs caused by the different iterators it makes getting advanced at
the wrong times. The concept of a split function that actually works
on lists is useful. I'm neutral about whether it's worth having a C
version in the stdlib.

C

Chris Rebert

Jul 3, 2009

#19

In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

FWIW, Ruby has Enumerable#partition, which does the same thing as
split() and has a better name IMHO.
http://www.ruby-doc.org/core/classes/Enumerable.html#M003130

Cheers,
Chris

S

Steven D'Aprano

Jul 3, 2009

#20

No really, groupby makes iterators, not lists, and it you have to
develop quite a delicate sense of when you can use it without having
bugs caused by the different iterators it makes getting advanced at the
wrong times. The concept of a split function that actually works on
lists is useful. I'm neutral about whether it's worth having a C
version in the stdlib.

groupby() works on lists.

The difference between what I'm suggesting and what groupby() does is
that my suggestion would collate *all* the elements with the same key,
not just runs of them. This (as far as I can tell) requires returning
lists rather than iterators.

The most important difference between my suggestion and that of the OP is
that he limited the key function to something which returns a truth
value, while I'm looking for something more general which can split the
input into an arbitrary number of collated sublists.

generate De Bruijn sequence memory and string vs lists	0	Jan 23, 2014
ChatBot	4	Jan 19, 2021
Splitting a line while keeping quoted items together	1	Nov 20, 2012
pivot() equivalent	2	Mar 11, 2010
Just curious - I thought dict keys would maintain sequence	0	Mar 18, 2013
Splitting URLs	4	Oct 21, 2007
Feature Proposal: Sequence .join method	10	Sep 30, 2005
Fwd: Issue with continous incrementing of unbroken sequence for aentire working day	1	Feb 28, 2013

schickb

Pablo Torres N.

Paul Rubin

Pablo Torres N.

Brad

Brad

Paul Rubin

Brad

Pablo Torres N.

Paul Rubin

Brad

Gabriel Genellina

Brad

Rickard Lindberg

Lie Ryan

Lie Ryan

Steven D'Aprano

Paul Rubin

Chris Rebert

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads