A gnarly little python loop

R

Roy Smith

I'm trying to pull down tweets with one of the many twitter APIs. The
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1. It returns a list of tweets.
If the list is empty, there are no more tweets. If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r:
break
for tweet in r:
process(tweet)
page += 1

It works, but it seems excessively fidgety. Is there some cleaner way
to refactor this?
 
I

Ian Kelly

I'm trying to pull down tweets with one of the many twitter APIs. The
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1. It returns a list of tweets.
If the list is empty, there are no more tweets. If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r:
break
for tweet in r:
process(tweet)
page += 1

It works, but it seems excessively fidgety. Is there some cleaner way
to refactor this?

I'd do something like this:

def get_tweets(term):
for page in itertools.count(1):
r = api.GetSearch(term, page)
if not r:
break
for tweet in r:
yield tweet

for tweet in get_tweets("foo"):
process(tweet)
 
S

Steven D'Aprano

The way it works, you start with page=1. It returns a list of tweets.
If the list is empty, there are no more tweets. If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r:
break
for tweet in r:
process(tweet)
page += 1

It works, but it seems excessively fidgety. Is there some cleaner way
to refactor this?


Seems clean enough to me. It does exactly what you need: loop until there
are no more tweets, process each tweet.

If you're allergic to nested loops, move the inner for-loop into a
function. Also you could get rid of the "if r: break".

page = 1
r = ["placeholder"]
while r:
r = api.GetSearch(term="foo", page=page)
process_all(tweets) # does nothing if r is empty
page += 1


Another way would be to use a for list for the outer loop.

for page in xrange(1, sys.maxint):
r = api.GetSearch(term="foo", page=page)
if not r: break
process_all(r)
 
S

Steve Howell

I'm trying to pull down tweets with one of the many twitter APIs.  The
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1.  It returns a list of tweets..
If the list is empty, there are no more tweets.  If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

    page = 1
    while 1:
        r = api.GetSearch(term="foo", page=page)
        if not r:
            break
        for tweet in r:
            process(tweet)
        page += 1

It works, but it seems excessively fidgety.  Is there some cleaner way
to refactor this?

I think your code is perfectly readable and clean, but you can flatten
it like so:

def get_tweets(term, get_page):
page_nums = itertools.count(1)
pages = itertools.imap(api.getSearch, page_nums)
valid_pages = itertools.takewhile(bool, pages)
tweets = itertools.chain.from_iterable(valid_pages)
return tweets
 
S

Stefan Behnel

Steve Howell, 11.11.2012 04:03:
I think your code is perfectly readable and clean, but you can flatten
it like so:

def get_tweets(term, get_page):
page_nums = itertools.count(1)
pages = itertools.imap(api.getSearch, page_nums)
valid_pages = itertools.takewhile(bool, pages)
tweets = itertools.chain.from_iterable(valid_pages)
return tweets

I'd prefer the original code ten times over this inaccessible beast.

Stefan
 
R

rusi

I'm trying to pull down tweets with one of the many twitter APIs.  The
particular one I'm using (python-twitter), has a call:

data = api.GetSearch(term="foo", page=page)

The way it works, you start with page=1.  It returns a list of tweets..
If the list is empty, there are no more tweets.  If the list is not
empty, you can try to get more tweets by asking for page=2, page=3, etc.
I've got:

    page = 1
    while 1:
        r = api.GetSearch(term="foo", page=page)
        if not r:
            break
        for tweet in r:
            process(tweet)
        page += 1

It works, but it seems excessively fidgety.  Is there some cleaner way
to refactor this?

This is a classic problem -- structure clash of parallel loops -- nd
Steve Howell has given the classic solution using the fact that
generators in python simulate/implement lazy lists.
As David Beazley http://www.dabeaz.com/coroutines/ explains,
coroutines are more general than generators and you can use those if
you prefer.

The classic problem used to be stated like this:
There is an input in cards of 80 columns.
It needs to be copied onto printer of 132 columns.

The structure clash arises because after reading 80 chars a new card
has to be read; after printing 132 chars a linefeed has to be given.

To pythonize the problem, lets replace the 80,132 by 3,4, ie take the
char-square
abc
def
ghi

and produce
abcd
efgh
i

The important difference (explained nicely by Beazley) is that in
generators the for-loop pulls the generators, in coroutines, the
'generator' pushes the consuming coroutines.


---------------
from __future__ import print_function
s= ["abc", "def", "ghi"]

# Coroutine-infrastructure from pep 342
def consumer(func):
def wrapper(*args,**kw):
gen = func(*args, **kw)
gen.next()
return gen
return wrapper

@consumer
def endStage():
while True:
for i in range(0,4):
print((yield), sep='', end='')
print("\n", sep='', end='')


def genStage(s, target):
for line in s:
for i in range(0,3):
target.send(line)


if __name__ == '__main__':
genStage(s, endStage())
 
R

rusi

This is a classic problem -- structure clash of parallel loops
<rest snipped>

Sorry wrong solution :D

The fidgetiness is entirely due to python not allowing C-style loops
like these:

Putting it into coroutine form, it becomes something like the
following [Untested since I dont have the API]. Clearly the
fidgetiness is there as before and now with extra coroutine plumbing

def genStage(term, target):
page = 1
while 1:
r = api.GetSearch(term="foo", page=page)
if not r: break
for tweet in r: target.send(tweet)
page += 1


@consumer
def endStage():
while True: process((yield))

if __name__ == '__main__':
genStage("foo", endStage())
 
P

Peter Otten

rusi said:
The fidgetiness is entirely due to python not allowing C-style loops
like these:

for c in iter(getchar, EOF):
...
Clearly the fidgetiness is there as before and now with extra coroutine
plumbing

Hmm, very funny...
 
S

Steve Howell

<rest snipped>

Sorry wrong solution :D

The fidgetiness is entirely due to python not allowing C-style loops
like these:
[...]

There are actually three fidgety things going on:

1. The API is 1-based instead of 0-based.
2. You don't know the number of pages in advance.
3. You want to process tweets, not pages of tweets.

Here's yet another take on the problem:

# wrap fidgety 1-based api
def search(i):
return api.GetSearch("foo", i+1)

paged_tweets = (search(i) for i in count())

# handle sentinel
paged_tweets = iter(paged_tweets.next, [])

# flatten pages
tweets = chain.from_iterable(paged_tweets)
for tweet in tweets:
process(tweet)
 
R

rusi

On Nov 12, 12:09 pm, rusi <[email protected]> wrote:> This is a classic problem -- structure clash of parallel loops
<rest snipped>
Sorry wrong solution :D
The fidgetiness is entirely due to python not allowing C-style loops
like these:
while ((c=getchar()!= EOF) { ... }
[...]

There are actually three fidgety things going on:

 1. The API is 1-based instead of 0-based.
 2. You don't know the number of pages in advance.
 3. You want to process tweets, not pages of tweets.

Here's yet another take on the problem:

    # wrap fidgety 1-based api
    def search(i):
        return api.GetSearch("foo", i+1)

    paged_tweets = (search(i) for i in count())

    # handle sentinel
    paged_tweets = iter(paged_tweets.next, [])

    # flatten pages
    tweets = chain.from_iterable(paged_tweets)
    for tweet in tweets:
        process(tweet)

[Steve Howell]
Nice on the whole -- thanks
Could not the 1-based-ness be dealt with by using count(1)?
ie use
paged_tweets = (api.GetSearch("foo", i) for i in count(1))

{Peter]
for c in iter(getchar, EOF):
...

Thanks. Learnt something
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top