Good use for itertools.dropwhile and itertools.takewhile

N

Nick Mellor

Hi,

I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.

Fate of itertools.dropwhile() and itertools.takewhile() - Python
bytes.com
http://bit.ly/Vi2PqP

Almost nobody else of the 18 respondents seemed to be using them.

And then 2 hours later, a use case came along. I think. Anyone have any better solutions?

I have a file full of things like this:

"CAPSICUM RED fresh from Queensland"

Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:

"CAPSICUM RED fresh from QLD"

I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.

I want to split the above into:

("CAPSICUM RED", "fresh from QLD")

Enter dropwhile and takewhile. 6 lines later:

from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)


When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:

(9 lines: using for)

def split_product_1(s):
words = s.split()
product = []
for word in words:
if word == word.upper():
product.append(word)
else:
break
return " ".join(product), " ".join(words[len(product):])


(12 lines: using while)

def split_product_2(s):
words = s.split()
i = 0
product = []
while 1:
word = words
if word == word.upper():
product.append(word)
i += 1
else:
break
return " ".join(product), " ".join(words[i:])


Any thoughts?

Nick
 
N

Neil Cerutti

I have a file full of things like this:

"CAPSICUM RED fresh from Queensland"

Product names (all caps, at start of string) and descriptions
(mixed case, to end of string) all muddled up in the same
field. And I need to split them into two fields. Note that if
the text had said:

"CAPSICUM RED fresh from QLD"

I would want QLD in the description, not shunted forwards and
put in the product name. So (uncontrived) list comprehensions
and regex's are out.

I want to split the above into:

("CAPSICUM RED", "fresh from QLD")

Enter dropwhile and takewhile. 6 lines later:

from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)

When I tried to refactor this code to use while or for loops, I
couldn't find any way that felt shorter or more pythonic:

I'm really tempted to import re, and that means takewhile and
dropwhile need to stay. ;)

But seriously, this is a quick implementation of my first thought.

description = s.lstrip(string.ascii_uppercase + ' ')
product = s[:-len(description)-1]
 
N

Nick Mellor

Hi Neil,

Nice! But fails if the first word of the description starts with a capital letter.

Nick


I have a file full of things like this:

"CAPSICUM RED fresh from Queensland"

Product names (all caps, at start of string) and descriptions
(mixed case, to end of string) all muddled up in the same
field. And I need to split them into two fields. Note that if
the text had said:

"CAPSICUM RED fresh from QLD"

I would want QLD in the description, not shunted forwards and
put in the product name. So (uncontrived) list comprehensions
and regex's are out.

I want to split the above into:

("CAPSICUM RED", "fresh from QLD")

Enter dropwhile and takewhile. 6 lines later:

from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)

When I tried to refactor this code to use while or for loops, I
couldn't find any way that felt shorter or more pythonic:



I'm really tempted to import re, and that means takewhile and

dropwhile need to stay. ;)



But seriously, this is a quick implementation of my first thought.



description = s.lstrip(string.ascii_uppercase + ' ')

product = s[:-len(description)-1]
 
N

Nick Mellor

I love the way you guys can write a line of code that does the same as 20 of mine :)

I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.

Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.

Best,

Nick

2012/12/4 Nick Mellor said:
Hi,
I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
Fate of itertools.dropwhile() and itertools.takewhile() - Python
bytes.com

Almost nobody else of the 18 respondents seemed to be using them.
And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
I have a file full of things like this:
"CAPSICUM RED fresh from Queensland"
Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
"CAPSICUM RED fresh from QLD"
I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
I want to split the above into:
("CAPSICUM RED", "fresh from QLD")
Enter dropwhile and takewhile. 6 lines later:
from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)


When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:

(9 lines: using for)

def split_product_1(s):
words = s.split()
product = []
for word in words:
if word == word.upper():
product.append(word)

return " ".join(product), " ".join(words[len(product):])
(12 lines: using while)
def split_product_2(s):
words = s.split()
product = []
word = words

if word == word.upper():
product.append(word)

i += 1

return " ".join(product), " ".join(words[i:])
Any thoughts?

http://mail.python.org/mailman/listinfo/python-list



Hi,

the regex approach doesn't actually seem to be very complex, given the

mentioned specification, e.g.


import re
re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")

[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]



(It might be necessary to account for some punctuation, whitespace etc. too.)



hth,

vbr
 
N

Nick Mellor

I love the way you guys can write a line of code that does the same as 20 of mine :)

I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.

Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.

Best,

Nick

2012/12/4 Nick Mellor said:
Hi,
I came across itertools.dropwhile only today, then shortly afterwards found Raymond Hettinger wondering, in 2007, whether to drop [sic] dropwhile and takewhile from the itertools module.
Fate of itertools.dropwhile() and itertools.takewhile() - Python
bytes.com

Almost nobody else of the 18 respondents seemed to be using them.
And then 2 hours later, a use case came along. I think. Anyone have any better solutions?
I have a file full of things like this:
"CAPSICUM RED fresh from Queensland"
Product names (all caps, at start of string) and descriptions (mixed case, to end of string) all muddled up in the same field. And I need to split them into two fields. Note that if the text had said:
"CAPSICUM RED fresh from QLD"
I would want QLD in the description, not shunted forwards and put in the product name. So (uncontrived) list comprehensions and regex's are out.
I want to split the above into:
("CAPSICUM RED", "fresh from QLD")
Enter dropwhile and takewhile. 6 lines later:
from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description = takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)


When I tried to refactor this code to use while or for loops, I couldn't find any way that felt shorter or more pythonic:

(9 lines: using for)

def split_product_1(s):
words = s.split()
product = []
for word in words:
if word == word.upper():
product.append(word)

return " ".join(product), " ".join(words[len(product):])
(12 lines: using while)
def split_product_2(s):
words = s.split()
product = []
word = words

if word == word.upper():
product.append(word)

i += 1

return " ".join(product), " ".join(words[i:])
Any thoughts?

http://mail.python.org/mailman/listinfo/python-list



Hi,

the regex approach doesn't actually seem to be very complex, given the

mentioned specification, e.g.


import re
re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland")

[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from Queensland')]



(It might be necessary to account for some punctuation, whitespace etc. too.)



hth,

vbr
 
N

Neil Cerutti

I love the way you guys can write a line of code that does the
same as 20 of mine :)

I can turn up the heat on your regex by feeding it a null
description or multiple white space (both in the original
file.) I'm sure you'd adjust, but at the cost of a more complex
regex.

A re.split should be able to handle this without too much hassle.

The simplicity of my two-line version will evaporate pretty
quickly to compensate for edge cases.

Here's one that can handle one of the edge cases you mention, but
it's hardly any shorter than what you had, and it doesn't
preserve non-standard whites space, like double spaces.

def prod_desc(s):
"""split s into product name and product description. Product
name is a series of one or more capitalized words followed
by white space. Everything after the trailing white space is
the product description.
['CAR FIFTY TWO', 'Chrysler LeBaron.']
"""
prod = []
desc = []
target = prod
for word in s.split():
if target is prod and not word.isupper():
target = desc
target.append(word)
return [' '.join(prod), ' '.join(desc)]

When str methods fail I'll usually write my own parser before
turning to re. The following is no longer nice looking at all.

def prod_desc(s):
"""split s into product name and product description. Product
name is a series of one or more capitalized words followed
by white space. Everything after the trailing white space is
the product description.
['CAR FIFTY TWO', 'Chrysler LeBaron.']
['MR. JONESEY', "Saskatchewan's finest"]
"""
i = 0
while not s.islower():
i += 1
i -= 1
while not s.isspace():
i -= 1
start_desc = i+1
while s.isspace():
i -= 1
end_prod = i+1
return [s[:end_prod], s[start_desc:]]
 
D

DJC

Another neat solution with a little help from

http://stackoverflow.com/questions/...st-element-of-a-list-which-makes-a-passed-fun
.... w = p.split(" ")
.... j = (i for i,v in enumerate(w) if v.upper() != v).next()
.... return " ".join(w[:j]), " ".join(w[j:])
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
w1 = "CAPSICUM RED Fresh from Queensland"
w1.split() ['CAPSICUM', 'RED', 'Fresh', 'from', 'Queensland']
w = w1.split()
(i for i,v in enumerate(w) if v.upper() != v)
(i for i,v in enumerate(w) if v.upper() != v).next()
2

Python 3.2.3 (default, Oct 19 2012, 19:53:16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'generator' object has no attribute 'next'
 
A

Alexander Blinne

Am 04.12.2012 19:28, schrieb DJC:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'generator' object has no attribute 'next'

Yeah, i saw this problem right after i sent the posting. It now is
supposed to read like this
.... w = p.split(" ")
.... j = next(i for i,v in enumerate(w) if v.upper() != v)
.... return " ".join(w[:j]), " ".join(w[j:])

Greetings
 
I

Ian Kelly

Am 04.12.2012 19:28, schrieb DJC:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'generator' object has no attribute 'next'

Yeah, i saw this problem right after i sent the posting. It now is
supposed to read like this
... w = p.split(" ")
... j = next(i for i,v in enumerate(w) if v.upper() != v)
... return " ".join(w[:j]), " ".join(w[j:])

It still fails if the product description is empty.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in split_product
StopIteration

I'm not meaning to pick on you; some of the other solutions in this thread
also fail in that case.
re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED")
[('CAPSICUM', 'RED')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 14, in prod_desc
IndexError: string index out of range
 
M

MRAB

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'generator' object has no attribute 'next'

Yeah, i saw this problem right after i sent the posting. It now is
supposed to read like this
... w = p.split(" ")
... j = next(i for i,v in enumerate(w) if v.upper() != v)
... return " ".join(w[:j]), " ".join(w[j:])


It still fails if the product description is empty.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in split_product
StopIteration

I'm not meaning to pick on you; some of the other solutions in this
thread also fail in that case.
re.findall(r"(?m)^([A-Z\s]+) (.+)$", "CAPSICUM RED")
[('CAPSICUM', 'RED')]
That's easily fixed:
re.findall(r"(?m)^([A-Z\s]+)(?: (.*))?$", "CAPSICUM RED")
[('CAPSICUM RED', '')]
 
A

Alexander Blinne

Am 04.12.2012 20:37, schrieb Ian Kelly:
... w = p.split(" ")
... j = next(i for i,v in enumerate(w) if v.upper() != v)
... return " ".join(w[:j]), " ".join(w[j:])


It still fails if the product description is empty.

That's true... let's see, next() takes a default value in case the
iterator is empty and then we could use some special value and test for
it. But i think it would be more elegant to just handle the excepten
ourselves, so:
.... w = p.split(" ")
.... try:
.... j = next(i for i,v in enumerate(w) if v.upper() != v)
.... except StopIteration:
.... return p, ''
.... return " ".join(w[:j]), " ".join(w[j:])
I'm not meaning to pick on you; some of the other solutions in this
thread also fail in that case.

It's ok, opening the eye for edge cases is always a good idea :)

Greetings
 
T

Terry Reedy

I have a file full of things like this:

"CAPSICUM RED fresh from Queensland"

Product names (all caps, at start of string) and descriptions (mixed
case, to end of string) all muddled up in the same field. And I need
to split them into two fields. Note that if the text had said:

"CAPSICUM RED fresh from QLD"

I would want QLD in the description, not shunted forwards and put in
the product name. So (uncontrived) list comprehensions and regex's
are out.

I want to split the above into:

("CAPSICUM RED", "fresh from QLD")

Enter dropwhile and takewhile. 6 lines later:

from itertools import takewhile, dropwhile
def split_product_itertools(s):
words = s.split()
allcaps = lambda word: word == word.upper()
product, description =\
takewhile(allcaps, words), dropwhile(allcaps, words)
return " ".join(product), " ".join(description)

If the original string has no excess whitespace, description is what
remains of s after product prefix is omitted. (Py 3 code)

from itertools import takewhile
def allcaps(word): return word == word.upper()

def split_product_itertools(s):
product = ' '.join(takewhile(allcaps, s.split()))
return product, s[len(product)+1:]

print(split_product_itertools("CAPSICUM RED fresh from QLD"))('CAPSICUM RED', 'fresh from QLD')

Without that assumption, the same idea applies to the split list.

def split_product_itertools(s):
words = s.split()
product = list(takewhile(allcaps, words))
return ' '.join(product), ' '.join(words[len(product):])
 
V

Vlastimil Brom

2012/12/4 Nick Mellor said:
I love the way you guys can write a line of code that does the same as 20 of mine :)
I can turn up the heat on your regex by feeding it a null description or multiple white space (both in the original file.) I'm sure you'd adjust, but at the cost of a more complex regex.
Meanwhile takewith and dropwith are behaving themselves impeccably but my while loop has fallen over.

Best,
Nick
--

Hi,
well, for what is it worth, both cases could be addressed quite
easily, with little added complexity - e.g.: make the description part
optional, allow multiple whitespace and enforce word boundary after
the product name in order to get rid of the trailing whitespace in it:
re.findall(r"(?m)^([A-Z\s]+\b)(?:\s+(.*))?$", "CAPSICUM RED fresh from QLD\nCAPSICUM RED fresh from Queensland\nCAPSICUM RED")
[('CAPSICUM RED', 'fresh from QLD'), ('CAPSICUM RED', 'fresh from
Queensland'), ('CAPSICUM RED', '')]
However, it's certainly preferable to use a solution you are more
comfortable with, e.g. the itertools one...

regards,
vbr
 
S

Steven D'Aprano

Ian,

For the sanity of those of us reading this via Usenet using the Pan
newsreader, could you please turn off HTML emailing? It's very
distracting.

Thanks,

Steven


On Tue, 04 Dec 2012 12:37:38 -0700, Ian Kelly wrote:

[...]
<div class="gmail_quote">On Tue,
Dec 4, 2012 at 11:48 AM, Alexander Blinne <span dir="ltr">&lt;<a
href="mailto:[email protected]"
target="_blank">[email protected]</a>&gt;</span> wrote:<br><blockquote
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">

Am 04.12.2012 19:28, schrieb DJC:<br> <div class="im">&gt;&gt;&gt;&gt;
(i for i,v in enumerate(w) if v.upper() != v).next()<br> &gt; Traceback
(most recent call last):<br> &gt;   File &quot;&lt;stdin&gt;&quot;, line
1, in &lt;module&gt;<br> &gt; AttributeError: 'generator' object
has no attribute 'next'<br> <br>
</div>Yeah, i saw this problem right after i sent the posting. It now
is<br> supposed to read like this<br>
<div class="im"><br>
&gt;&gt;&gt; def split_product(p):<br> ...     w = p.split(&quot;
&quot;)<br> </div>...     j = next(i for i,v in enumerate(w) if
v.upper() != v)<br> <div class="im">...     return &quot;
&quot;.join(w[:j]), &quot;
&quot;.join(w[j:])<br></div></blockquote></div><br>It still fails if the
product description is empty.<br><br>&gt;&gt;&gt;
split_product(&quot;CAPSICUM RED&quot;)<br>

Traceback (most recent call last):<br>  File &quot;&lt;stdin&gt;&quot;,
line 1, in &lt;module&gt;<br>  File &quot;&lt;stdin&gt;&quot;, line 3,
in split_product<br>StopIteration<br><br>I'm not meaning to pick on
you; some of the other solutions in this thread also fail in that
case.<br>

<br>&gt;&gt;&gt; re.findall(r&quot;(?m)^([A-Z\s]+) (.+)$&quot;,
&quot;CAPSICUM RED&quot;)<br>[('CAPSICUM',
'RED')]<br><br>&gt;&gt;&gt; prod_desc(&quot;CAPSICUM RED&quot;) 
# the second version from Neil's post<br>

Traceback (most recent call last):<br>  File &quot;&lt;stdin&gt;&quot;,
line 1, in &lt;module&gt;<br>  File &quot;&lt;stdin&gt;&quot;, line 14,
in prod_desc<br>IndexError: string index out of range<br><br>
 
T

Terry Reedy

If the original string has no excess whitespace, description is what
remains of s after product prefix is omitted. (Py 3 code)

from itertools import takewhile
def allcaps(word): return word == word.upper()

def split_product_itertools(s):
product = ' '.join(takewhile(allcaps, s.split()))
return product, s[len(product)+1:]

print(split_product_itertools("CAPSICUM RED fresh from QLD"))('CAPSICUM RED', 'fresh from QLD')

Without that assumption, the same idea applies to the split list.

def split_product_itertools(s):
words = s.split()
product = list(takewhile(allcaps, words))
return ' '.join(product), ' '.join(words[len(product):])

Because these slice rather than index, either works trivially on an
empty description.

print(split_product_itertools("CAPSICUM RED"))('CAPSICUM RED', '')
 
N

Nick Mellor

Hi Terry,

For my money, and especially in your versions, despite several expert solutions using other features, itertools has it. It seems to me to need less nutting out than the other approaches. It's short, robust, has a minimum of symbols, uses simple expressions and is not overly clever. If we could just get used to using takewhile.

takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

Thanks all for your interest and your help,

Best,

Nick

Terry's implementations:
from itertools import takewhile

def allcaps(word): return word == word.upper()



def split_product_itertools(s):

product = ' '.join(takewhile(allcaps, s.split()))

return product, s[len(product)+1:]



print(split_product_itertools("CAPSICUM RED fresh from QLD"))

('CAPSICUM RED', 'fresh from QLD')



[if there could be surplus whitespace], the same idea applies to the split list.



def split_product_itertools(s):

words = s.split()

product = list(takewhile(allcaps, words))

return ' '.join(product), ' '.join(words[len(product):])
 
N

Nick Mellor

Hi Terry,

For my money, and especially in your versions, despite several expert solutions using other features, itertools has it. It seems to me to need less nutting out than the other approaches. It's short, robust, has a minimum of symbols, uses simple expressions and is not overly clever. If we could just get used to using takewhile.

takewhile mines for gold at the start of a sequence, dropwhile drops the dross at the start of a sequence.

Thanks all for your interest and your help,

Best,

Nick

Terry's implementations:
from itertools import takewhile

def allcaps(word): return word == word.upper()



def split_product_itertools(s):

product = ' '.join(takewhile(allcaps, s.split()))

return product, s[len(product)+1:]



print(split_product_itertools("CAPSICUM RED fresh from QLD"))

('CAPSICUM RED', 'fresh from QLD')



[if there could be surplus whitespace], the same idea applies to the split list.



def split_product_itertools(s):

words = s.split()

product = list(takewhile(allcaps, words))

return ' '.join(product), ' '.join(words[len(product):])
 
N

Neil Cerutti

Hi Terry,

For my money, and especially in your versions, despite several
expert solutions using other features, itertools has it. It
seems to me to need less nutting out than the other approaches.
It's short, robust, has a minimum of symbols, uses simple
expressions and is not overly clever. If we could just get used
to using takewhile.

The main reason most of the solutions posted failed is lack of
complete specification to work with while sumultaneously trying
to make as tiny and simplistic a solution as possible.

I'm struggling with the empty description bug right now. ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top