getting n items at a time from a generator

K

Kugutsumen

I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.

def chunk(size, items):
"""generate N items from a generator."""
chunk = []
count = 0
while True:
try:
item = items.next()
count += 1
except StopIteration:
yield chunk
break
chunk.append(item)
if not (count % size):
yield chunk
chunk = []
count = 0
.... print i
....
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]

In my real world project, I have over 250 million items that are too
big to fit in memory and that processed and later used to update
records in a database... to minimize disk IO, I found it was more
efficient to process them by batch or "chunk" of 50,000 or so. Hence

Is this the proper way to do this?
 
P

Paul Hankin

I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.

def chunk(size, items):
    """generate N items from a generator."""
    chunk = []
    count = 0
    while True:
        try:
            item = items.next()
            count += 1
        except StopIteration:
            yield chunk
            break
        chunk.append(item)
        if not (count % size):
            yield chunk
            chunk = []
            count = 0

The itertools module is always a good place to look when you've got a
complicated generator.

import itertools
import operator

def chunk(N, items):
"Group items in chunks of N"
def clump((n, _)):
return n // N
for _, group in itertools.groupby(enumerate(items), clump):
yield itertools.imap(operator.itemgetter(1), group)

for ch in chunk(7, range(30)):
print list(ch)


I've changed chunk to return a generator rather than building a list
which is probably only going to be iterated over. But if you prefer
the list version, replace 'itertools.imap' with 'map'.
 
K

Kugutsumen

I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.
def chunk(size, items):
    """generate N items from a generator."""
    chunk = []
    count = 0
    while True:
        try:
            item = items.next()
            count += 1
        except StopIteration:
            yield chunk
            break
        chunk.append(item)
        if not (count % size):
            yield chunk
            chunk = []
            count = 0

The itertools module is always a good place to look when you've got a
complicated generator.

import itertools
import operator

def chunk(N, items):
    "Group items in chunks of N"
    def clump((n, _)):
        return n // N
    for _, group in itertools.groupby(enumerate(items), clump):
        yield itertools.imap(operator.itemgetter(1), group)

for ch in chunk(7, range(30)):
    print list(ch)

I've changed chunk to return a generator rather than building a list
which is probably only going to be iterated over. But if you prefer
the list version, replace 'itertools.imap' with 'map'.

Thanks, I am going to take a look at itertools.
I prefer the list version since I need to buffer that chunk in memory
at this point.
 
S

Steven D'Aprano

I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk' generator
below.

def chunk(size, items):
"""generate N items from a generator."""

[snip code]


Try this instead:


import itertools

def chunk(iterator, size):
# I prefer the argument order to be the reverse of yours.
while True:
chunk = list(itertools.islice(iterator, size))
if chunk: yield chunk
else: break


And in use:
.... print L
....
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]
 
T

Terry Jones

Kugutsumen> Thanks, I am going to take a look at itertools. I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

def chop(iterable, length=2):
return izip(*(iter(iterable),) * length)

Terry
 
K

Kugutsumen

Kugutsumen> Thanks, I am going to take a look at itertools.  I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

    def chop(iterable, length=2):
        return izip(*(iter(iterable),) * length)

Terry

Thanks Terry,

However, chop ignores the remainder of the data in the example.
... print ch
...
(0, 1, 2, 3, 4, 5, 6)
(7, 8, 9, 10, 11, 12, 13)
(14, 15, 16, 17, 18, 19, 20)
(21, 22, 23, 24, 25, 26, 27)

k
 
K

Kugutsumen

Kugutsumen> Thanks, I am going to take a look at itertools.  I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

    def chop(iterable, length=2):
        return izip(*(iter(iterable),) * length)

Terry
[snip code]

Try this instead:

import itertools

def chunk(iterator, size):
# I prefer the argument order to be the reverse of yours.
while True:
chunk = list(itertools.islice(iterator, size))
if chunk: yield chunk
else: break

Steven, I really like your version since I've managed to understand it
in one pass.
Paul's version works but is too obscure to read for me :)

Thanks a lot again.
 
S

Shane Geiger

# http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/496958

from itertools import *
def group(lst, n):
"""group([0,3,4,10,2,3], 2) => iterator

Group an iterable into an n-tuples iterable. Incomplete tuples
are padded with Nones e.g.
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, None, None)]
"""
iters = tee(lst, n)
iters = [iters[0]] + [chain(iter, repeat(None))
for iter in iters[1:]]
return izip(
*[islice(iter, i, None, n) for i, iter
in enumerate(iters)])

import string
for grp in list(group(string.letters,25)):
print grp

"""
('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y')
('z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X')
('Y', 'Z', None, None, None, None, None, None, None, None, None, None,
None, None, None, None, None, None, None, None, None, None, None, None,
None)

"""



I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.
Kugutsumen> Thanks, I am going to take a look at itertools. I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

def chop(iterable, length=2):
return izip(*(iter(iterable),) * length)

Terry


[snip code]

Try this instead:

import itertools

def chunk(iterator, size):
# I prefer the argument order to be the reverse of yours.
while True:
chunk = list(itertools.islice(iterator, size))
if chunk: yield chunk
else: break

Steven, I really like your version since I've managed to understand it
in one pass.
Paul's version works but is too obscure to read for me :)

Thanks a lot again.


--
Shane Geiger
IT Director
National Council on Economic Education
(e-mail address removed) | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy
 
T

Tim Roberts

Kugutsumen said:
I am relatively new the python language and I am afraid to be missing
some clever construct or built-in way equivalent to my 'chunk'
generator below.

I have to say that I have found this to be a surprisingly common need as
well. Would this be an appropriate construct to add to itertools?
 
I

Igor V. Rafienko

[ Terry Jones ]

[ ... ]
Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705

def chop(iterable, length=2):
return izip(*(iter(iterable),) * length)


Is this *always* guaranteed by the language to work? Should the
iterator returned by izip() change the implementation and evaluate the
underlying iterators in, say, reverse order, the solution would no
longer function, would it? Or is it something the return value of
izip() would never do?

(I am just trying to understand the solution, not criticize it. Took a
while to parse the argument(s) to izip in the example).





ivr
 
T

Terry Jones

Hi Igor

Igor> Is this *always* guaranteed by the language to work? Should the
Igor> iterator returned by izip() change the implementation and evaluate
Igor> the underlying iterators in, say, reverse order, the solution would
Igor> no longer function, would it? Or is it something the return value of
Igor> izip() would never do?

Igor> (I am just trying to understand the solution, not criticize it. Took
Igor> a while to parse the argument(s) to izip in the example).

I had to look at it a bit too. I actually deleted the comment I wrote
about it in my own code before posting it here and decided to simply say
"consider" in the above instead :)

As far as I understand it, you're right. The docstring for izip doesn't
guarantee that it will pull items from the passed iterables in any order.
So an alternate implementation of izip might produce other results. If it
did them in reverse order you'd get each n-chunk reversed, etc.

Terry
 
R

Raymond Hettinger

    def chop(iterable, length=2):
Is this *always* guaranteed by the language to work?

Yes!

Users requested this guarantee, and I agreed. The docs now explicitly
guarantee this behavior.


Raymond
 
R

Raymond Hettinger

Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705
However, chop ignores the remainder of the data in the example.

There is a recipe in the itertools docs which handles the odd-length
data at the end:

def grouper(n, iterable, padvalue=None):
"grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'),
('g','x','x')"
return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)


Raymond
 
N

NickC

Kugutsumen> Thanks, I am going to take a look at itertools. I prefer the
Kugutsumen> list version since I need to buffer that chunk in memory at
Kugutsumen> this point.
Also consider this solution from O'Reilly's Python Cookbook (2nd Ed.) p705
def chop(iterable, length=2):
return izip(*(iter(iterable),) * length)
Terry
[snip code]
Try this instead:
import itertools
def chunk(iterator, size):
# I prefer the argument order to be the reverse of yours.
while True:
chunk = list(itertools.islice(iterator, size))
if chunk: yield chunk
else: break

Steven, I really like your version since I've managed to understand it
in one pass.
Paul's version works but is too obscure to read for me :)

Thanks a lot again.

To work with an arbitrary iterable, it needs an extra line at the
start to ensure the iterator items are consumed correctly each time
around the loop. It may also be better to ensure the final item is the
same length as the other items - if that isn't the case (you want to
know where the data really ends) then leave out the parts relating to
adding the padding object.

import itertools

def chunk(iterable, size, pad=None):
iterator = iter(iterable)
padding = [pad]
while True:
chunk = list(itertools.islice(iterator, size))
if chunk:
yield chunk + (padding*(size-len(chunk)))
else:
break

Cheers,
Nick.
 
P

Paul Rubin

Tim Roberts said:
I have to say that I have found this to be a surprisingly common need as
well. Would this be an appropriate construct to add to itertools?

I'm in favor.
 
S

Shane Geiger

Paul said:
I'm in favor.


I am ecstatic about the idea of getting n items at a time from a
generator! This would eliminate the use of less elegant functions to do
this sort of thing which I would do even more frequently if it were
easier.

Is it possible that this syntax for generator expressions could be adopted?
Traceback (most recent call last):


While on the topic of generators:

Something else I have longed for is assignment within a while loop. (I
realize this might be more controversial and might have been avoided on
purpose, but I wasn't around for that discussion.)

.... print a,b,c
....
This Is A
Sentence With PadValue



--
Shane Geiger
IT Director
National Council on Economic Education
(e-mail address removed) | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy
 
D

Dennis Lee Bieber

Something else I have longed for is assignment within a while loop. (I
realize this might be more controversial and might have been avoided on
purpose, but I wasn't around for that discussion.)
Assignment is a statement, not an expression, so I doubt you'll ever
see it...

However, the "for" loop might be able to support it IF the values
are returned as a tuple...

for (a, b, c) in generator.next(3, "padvalue"):
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top