Sequence splitting

S

schickb

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.
http://stackoverflow.com/questions/949098/python-split-a-list-based-on-a-condition

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad
 
P

Pablo Torres N.

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
   if func is None:
       func = bool
   t, f = [], []
   for item in seq:
       if func(item):
           t.append(item)
       else:
           f.append(item)
   return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.
http://stackoverflow.com/questions/949098/python-split-a-list-based-on-a-condition

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.
 
P

Paul Rubin

schickb said:
def split(seq, func=None):
if func is None:
func = bool
t, f = [], []
for item in seq:
if func(item):
t.append(item)
else:
f.append(item)
return (t, f)

untested:

def split(seq, func=bool):
xs = zip(seq, itertools.imap(func, seq))
t = list(x for (x,y) in xs if y)
f = list(x for (x,y) in xs if not y)
return (t, f)
 
P

Pablo Torres N.

I have fairly often found the need to split a sequence into two groups
based on a function result. Much like the existing filter function,
but returning a tuple of true, false sequences. In Python, something
like:

def split(seq, func=None):
    if func is None:
        func = bool
    t, f = [], []
    for item in seq:
        if func(item):
            t.append(item)
        else:
            f.append(item)
    return (t, f)

The discussion linked to below has various approaches for doing this
now, but most traverse the sequence twice and many don't apply a
function to spit the sequence.http://stackoverflow.com/questions/949098/python-split-a-list-based-o...

Is there any interest in a C implementation of this? Seems too trivial
to write a PEP, so I'm just trying to measure interest before diving
in. This wouldn't really belong in intertool. Would it be best
implemented as a top level built-in?

-Brad

This sounds like it belongs to the python-ideas list. I suggest
posting there for better feedback, since the core developers check
that list more often than this one.
 
B

Brad

schickb said:
def split(seq, func=None):
    if func is None:
        func = bool
    t, f = [], []
    for item in seq:
        if func(item):
            t.append(item)
        else:
            f.append(item)
    return (t, f)

untested:

   def split(seq, func=bool):
      xs = zip(seq, itertools.imap(func, seq))
      t = list(x for (x,y) in xs if y)
      f = list(x for (x,y) in xs if not y)
      return (t, f)

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

-Brad
 
B

Brad

This sounds like it belongs to the python-ideas list.  I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

Thanks, I didn't know about that list.
 
P

Paul Rubin

Brad said:
schickb said:
def split(seq, func=None):
    if func is None:
        func = bool
    t, f = [], []
    for item in seq:
        if func(item):
            t.append(item)
        else:
            f.append(item)
    return (t, f)

untested:

   def split(seq, func=bool):
      xs = zip(seq, itertools.imap(func, seq))
      t = list(x for (x,y) in xs if y)
      f = list(x for (x,y) in xs if not y)
      return (t, f)

In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element. Here's a variant
of your version:

def split(seq, func=bool):
    t, f = [], []
ta, fa = t.append, f.append
    for item in seq:
(ta if func(item) else fa)(item)
    return (t, f)

This avoids some dict lookups and copying. I wonder if that helps
significantly.
 
B

Brad

Brad said:
def split(seq, func=None):
    if func is None:
        func = bool
    t, f = [], []
    for item in seq:
        if func(item):
            t.append(item)
        else:
            f.append(item)
    return (t, f)
untested:
   def split(seq, func=bool):
      xs = zip(seq, itertools.imap(func, seq))
      t = list(x for (x,y) in xs if y)
      f = list(x for (x,y) in xs if not y)
      return (t, f)
In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element.  

Perhaps true, but it would be a nice convenience (for me) as a built-
in written in either Python or C. Although the default case of a bool
function would surely be faster.
Here's a variant of your version:

 def split(seq, func=bool):
     t, f = [], []
     ta, fa = t.append, f.append
     for item in seq:
         (ta if func(item) else fa)(item)
     return (t, f)

This avoids some dict lookups and copying.  I wonder if that helps
significantly.

Faster, but in tests of a few short sequences only 1% so.

-Brad
 
P

Pablo Torres N.

Brad said:
def split(seq, func=None):
    if func is None:
        func = bool
    t, f = [], []
    for item in seq:
        if func(item):
            t.append(item)
        else:
            f.append(item)
    return (t, f)
untested:

   def split(seq, func=bool):
      xs = zip(seq, itertools.imap(func, seq))
      t = list(x for (x,y) in xs if y)
      f = list(x for (x,y) in xs if not y)
      return (t, f)
In my testing that is 3.5x slower than the original solution (and less
clear imo). I fixed my version to take a bool default. Either way, I'm
not really looking for additional ways to do this in Python unless
I've totally missed something. What I am considering is writing it in
C, much like filter.

I'm a little skeptical that the C version will help much, if it's
evaluating a python function at every list element.

Perhaps true, but it would be a nice convenience (for me) as a built-
in written in either Python or C. Although the default case of a bool
function would surely be faster.
Here's a variant of your version:

 def split(seq, func=bool):
     t, f = [], []
     ta, fa = t.append, f.append
     for item in seq:
         (ta if func(item) else fa)(item)
     return (t, f)

This avoids some dict lookups and copying.  I wonder if that helps
significantly.

Faster, but in tests of a few short sequences only 1% so.

-Brad

If it is speed that we are after, it's my understanding that map and
filter are faster than iterating with the for statement (and also
faster than list comprehensions). So here is a rewrite:

def split(seq, func=bool):
t = filter(func, seq)
f = filter(lambda x: not func(x), seq)
return list(t), list(f)

The lambda thing is kinda ugly, but I can't think of anything else.
Also, is it ok to return lists? Py3k saw a lot of APIs changed to
return iterables instead of lists, so maybe my function should have
'return t, f' as it's last statement.
 
P

Paul Rubin

Pablo Torres N. said:
def split(seq, func=bool):
t = filter(func, seq)
f = filter(lambda x: not func(x), seq)
return list(t), list(f)

That is icky--you're calling func (which might be slow) twice instead
of once on every element of the seq.
 
B

Brad

If it is speed that we are after, it's my understanding that map and
filter are faster than iterating with the for statement (and also
faster than list comprehensions).  So here is a rewrite:

def split(seq, func=bool):
        t = filter(func, seq)
        f = filter(lambda x: not func(x), seq)
        return list(t), list(f)

In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

-Brad
 
B

Brad

This sounds like it belongs to the python-ideas list.  I suggest
posting there for better feedback, since the core developers check
that list more often than this one.

I tried posting on python-ideas and received a "You are not allowed to
post to this mailing list" reply. Perhaps because I am posting through
Google groups? Or maybe one must be an approved member to post?

-Brad
 
R

Rickard Lindberg

I tried posting on python-ideas and received a "You are not allowed to
post to this mailing list" reply. Perhaps because I am posting through
Google groups? Or maybe one must be an approved member to post?

If you got an "awaiting moderator approval" message you post might appear on
the list soon. The reason for getting those can be that it is a member only
list and you posted from another address. I am not sure if that was the message
you got.
 
L

Lie Ryan

Brad said:
In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

-Brad

A built-in/itertools should always try to provide the general solution
to be as useful as possible, something like this:

def group(seq, func=bool):
ret = {}
for item in seq:
fitem = func(item)
try:
ret[fitem].append(item)
except KeyError:
ret[fitem] = [item]
return ret

definitely won't be faster, but it is a much more general solution.
Basically, the function allows you to group sequences based on the
result of func(item). It is similar to itertools.groupby() except that
this also group non-contiguous items.
 
L

Lie Ryan

Rickard said:
If you got an "awaiting moderator approval" message you post might appear on
the list soon. The reason for getting those can be that it is a member only
list and you posted from another address. I am not sure if that was the message
you got.

AFAIK, python-ideas is not moderated (I followed python-ideas). I've
never used Google Groups to access it though. Try subscribing to the
mailing list directly (instead of using Google Group's web-gateway)
here: http://mail.python.org/mailman/listinfo/python-ideas
 
S

Steven D'Aprano

This sounds like it belongs to the python-ideas list. I suggest posting
there for better feedback, since the core developers check that list
more often than this one.

If you post to python-ideas, you'll probably be told to gather feedback
here first. The core-developers aren't hugely interested in arbitrary new
features unless they have significant community support.

I've never needed such a split function, and I don't like the name, and
the functionality isn't general enough. I'd prefer something which splits
the input sequence into as many sublists as necessary, according to the
output of the key function. Something like itertools.groupby(), except it
runs through the entire sequence and collates all the elements with
identical keys.

E.g.:

splitby(range(10), lambda n: n%3)
=> [ (0, [0, 3, 6, 9]),
(1, [1, 4, 7]),
(2, [2, 5, 8]) ]

Your split() would be nearly equivalent to this with a key function that
returns a Boolean.
 
P

Paul Rubin

Steven D'Aprano said:
I've never needed such a split function, and I don't like the name, and
the functionality isn't general enough. I'd prefer something which splits
the input sequence into as many sublists as necessary, according to the
output of the key function. Something like itertools.groupby(), except it
runs through the entire sequence and collates all the elements with
identical keys.

No really, groupby makes iterators, not lists, and it you have to
develop quite a delicate sense of when you can use it without having
bugs caused by the different iterators it makes getting advanced at
the wrong times. The concept of a split function that actually works
on lists is useful. I'm neutral about whether it's worth having a C
version in the stdlib.
 
C

Chris Rebert

In my simple tests, that takes 1.8x as long as the original solution.
Better than the itertools solution, when "func" is short and fast. I
think the solution here would worse if func was more complex.

Either way, what I am still wondering is if people would find a built-
in implementation useful?

FWIW, Ruby has Enumerable#partition, which does the same thing as
split() and has a better name IMHO.
http://www.ruby-doc.org/core/classes/Enumerable.html#M003130

Cheers,
Chris
 
S

Steven D'Aprano

No really, groupby makes iterators, not lists, and it you have to
develop quite a delicate sense of when you can use it without having
bugs caused by the different iterators it makes getting advanced at the
wrong times. The concept of a split function that actually works on
lists is useful. I'm neutral about whether it's worth having a C
version in the stdlib.

groupby() works on lists.

The difference between what I'm suggesting and what groupby() does is
that my suggestion would collate *all* the elements with the same key,
not just runs of them. This (as far as I can tell) requires returning
lists rather than iterators.

The most important difference between my suggestion and that of the OP is
that he limited the key function to something which returns a truth
value, while I'm looking for something more general which can split the
input into an arbitrary number of collated sublists.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top