Pythonic infinite for loop?

C

Chris Angelico

Apologies for interrupting the vital off-topic discussion, but I have
a real Python question to ask.

I'm doing something that needs to scan a dictionary for elements that
have a particular beginning and a numeric tail, and turn them into a
single list with some processing. I have a function parse_kwdlist()
which takes a string (the dictionary's value) and returns the content
I want out of it, so I'm wondering what the most efficient and
Pythonic way to do this is.

My first draft looks something like this. The input dictionary is
called dct, the output list is lst.

lst=[]
for i in xrange(1,10000000): # arbitrary top, don't like this
try:
lst.append(parse_kwdlist(dct["Keyword%d"%i]))
except KeyError:
break

I'm wondering two things. One, is there a way to make an xrange object
and leave the top off? (Sounds like I'm risking the numbers
evaporating or something.) And two, can the entire thing be turned
into a list comprehension or something? Generally any construct with a
for loop that appends to a list is begging to become a list comp, but
I can't see how to do that when the input comes from a dictionary.

In the words of Adam Savage: "Am I about to feel really, really stupid?"

Thanks in advance for help... even if it is just "hey you idiot, you
forgot about X"!

Chris Angelico
 
N

Nobody

One, is there a way to make an xrange object and leave the top off?
itertools.count()

And two, can the entire thing be turned into a list comprehension or
something? Generally any construct with a for loop that appends to a
list is begging to become a list comp, but I can't see how to do that
when the input comes from a dictionary.

The source of a list comprehension can be any iterable; it doesn't have to
be a list. So:

lst = [parse_kwdlist(dct[key]) for key in dct]
or:
lst = [parse_kwdlist(val) for val in dct.itervalues()]
 
S

Steven D'Aprano

Apologies for interrupting the vital off-topic discussion, but I have a
real Python question to ask.

Sorry, you'll in the wrong forum for that.

*wink*

[...]
My first draft looks something like this. The input dictionary is called
dct, the output list is lst.

lst=[]
for i in xrange(1,10000000): # arbitrary top, don't like this
try:
lst.append(parse_kwdlist(dct["Keyword%d"%i]))
except KeyError:
break

I'm wondering two things. One, is there a way to make an xrange object
and leave the top off?

No. But you can use an itertools.count([start=0]) object, and then catch
the KeyError when you pass the end of the dict. But assuming keys are
consecutive, better to do this:

lst = [parse_kwdlist(dct["Keyword%d"%i]) for i in xrange(1, len(dct)+1)]

If you don't care about the order of the results:

lst = [parse_kwdlist(value) for value in dct.values()]
 
C

Chris Angelico

Thanks for the responses, all! In its strictest sense,
itertools.count() seems to be what I'm after, but may not be what I
need.

No. But you can use an itertools.count([start=0]) object, and then catch
the KeyError when you pass the end of the dict. But assuming keys are
consecutive, better to do this:

lst = [parse_kwdlist(dct["Keyword%d"%i]) for i in xrange(1, len(dct)+1)]

Ah, didn't think to use len(dct) as the top! And yes, the keys will be
consecutive (or rather, if someone omits Keyword5 then I can
justifiably ignore Keyword6).
If you don't care about the order of the results:

lst = [parse_kwdlist(value) for value in dct.values()]

No, order will matter (I have to pick up the first N that fit certain
conditions, after parsing).

The dictionary is potentially a lot larger than this particular set of
values (it's a mapping of header:value for a row of a user-provided
CSV file). Does this make a difference to the best option? (Currently
I'm looking at "likely" figures of 60+ keys in the dictionary and 3-8
postage options to pick up, but both of those could increase
significantly before production.)

Efficiency is important, though not king; this whole code will be
inside a loop. But readability is important too.

I can't just give all of dct.values() to parse_kwdlist; the function
parses a particular format of text string, and it's entirely possible
that other values would match that format (some of them are pure
free-form text). This has to get only the ones starting with Keyword,
and in order. Steven, the line you suggested:

lst = [parse_kwdlist(dct["Keyword%d"%i]) for i in xrange(1, len(dct)+1)]

will bomb with KeyError when it hits the first one that isn't present,
I assume. Is there an easy way to say "and when you reach any
exception, not just StopIteration, return the list"? (I could wrap the
whole "lst = " in a try/except, but then it won't set lst at all.)

If not, I think I'll go with:

for i in xrange(1,len(dct)+1):

and otherwise as per OP. Having a check for "if key%d in dct" going
all the way up seems like an odd waste of effort (or maybe I'm wrong
there).

Chris Angelico
 
P

Paul Rubin

That loop will exit at the first gap in the sequence. If that's what
you want, you could try (untested):

from itertools import takewhile

seq = takewhile(lambda n: ('Keyword%d'%n) in dct, count(1))
lst = map(dct.get, seq)

This does 2 lookups per key, which you could avoid by making the code
uglier (untested):

sentinel = object()
seq = (dct.get('Keyword%d'%i,sentinel) for i in count(1))
lst = list(takewhile(lambda x: x != sentinel, seq))
 
C

Chris Angelico

This does 2 lookups per key, which you could avoid by making the code
uglier (untested):

  sentinel = object()
  seq = (dct.get('Keyword%d'%i,sentinel) for i in count(1))
  lst = list(takewhile(lambda x: x != sentinel, seq))

If I understand this code correctly, that's creating generators,
right? It won't evaluate past the sentinel at all?

That might well be what I'm looking for. A bit ugly, but efficient and
compact. And I can bury some of the ugliness away.

Chris Angelico
 
P

Paul Rubin

Chris Angelico said:
If I understand this code correctly, that's creating generators,
right? It won't evaluate past the sentinel at all?

Right, it should stop after hitting the sentinel once.
That might well be what I'm looking for. A bit ugly, but efficient and
compact. And I can bury some of the ugliness away.

It occurs to me, operator.ne might be a little faster than the
interpreted lambda.
 
P

Peter Otten

Paul said:
Right, it should stop after hitting the sentinel once.


It occurs to me, operator.ne might be a little faster than the
interpreted lambda.

Or operator.is_not as you are dealing with a singleton. You also need
functools.partial:

$ python -m timeit -s'sentinel = object(); predicate = lambda x: x !=
sentinel' 'predicate(None)'
1000000 loops, best of 3: 0.369 usec per loop

$ python -m timeit -s'sentinel = object(); predicate = lambda x: x is not
sentinel' 'predicate(None)'
1000000 loops, best of 3: 0.314 usec per loop

$ python -m timeit -s'from functools import partial; from operator import
ne; sentinel = object(); predicate = partial(ne, sentinel)'
'predicate(None)'
1000000 loops, best of 3: 0.298 usec per loop

$ python -m timeit -s'from functools import partial; from operator import
is_not; sentinel = object(); predicate = partial(is_not, sentinel)'
'predicate(None)'
1000000 loops, best of 3: 0.252 usec per loop
 
P

Peter Otten

Chris said:
Apologies for interrupting the vital off-topic discussion, but I have
a real Python question to ask.

I'm doing something that needs to scan a dictionary for elements that
have a particular beginning and a numeric tail, and turn them into a
single list with some processing. I have a function parse_kwdlist()
which takes a string (the dictionary's value) and returns the content
I want out of it, so I'm wondering what the most efficient and
Pythonic way to do this is.

My first draft looks something like this. The input dictionary is
called dct, the output list is lst.

lst=[]
for i in xrange(1,10000000): # arbitrary top, don't like this
try:
lst.append(parse_kwdlist(dct["Keyword%d"%i]))
except KeyError:
break

I'm wondering two things. One, is there a way to make an xrange object
and leave the top off? (Sounds like I'm risking the numbers
evaporating or something.) And two, can the entire thing be turned
into a list comprehension or something? Generally any construct with a
for loop that appends to a list is begging to become a list comp, but
I can't see how to do that when the input comes from a dictionary.

In the words of Adam Savage: "Am I about to feel really, really stupid?"

Thanks in advance for help... even if it is just "hey you idiot, you
forgot about X"!

The initial data structure seems less than ideal. You might be able to
replace it with a dictionary like

{"Keyword": [value_for_keyword_1, value_for_keyword_2, ...]}

if you try hard enough.
 
P

Peter Otten

Chris said:
The initial data structure seems less than ideal. You might be able to
replace it with a dictionary like

{"Keyword": [value_for_keyword_1, value_for_keyword_2, ...]}

if you try hard enough.

The initial data structure comes from a CSV file, and is not under my
control.

ChrisA

Here's some code that might give you an idea. You can ignore the chunk
before 'import csv'; it is there to make the demo self-contained.

from contextlib import contextmanager

@contextmanager
def open(filename):
assert filename == "example.csv"
from StringIO import StringIO
yield StringIO("""\
beta3,alpha1,alpha2,beta1,beta2
b31,a11,a21,b11,b21
b32,a12,a22,b12,b22
b33,a13,a23,b13,b23
b34,a14,a24,b14,b24
""")

import csv
import re

def parse_name(s):
name, index = re.match(r"(.+?)(\d+)$", s).groups()
return name, int(index)-1

with open("example.csv") as instream:
rows = csv.reader(instream)
header = next(rows)
dct = {}
appends = []
for h in header:
name, index = parse_name(h)
outer = dct.setdefault(name, {})
inner = outer.setdefault(index, [])
appends.append(inner.append)
for row in rows:
for value, append in zip(row, appends):
append(value)
print dct
 
S

Steven D'Aprano

The initial data structure seems less than ideal. You might be able to
replace it with a dictionary like

{"Keyword": [value_for_keyword_1, value_for_keyword_2, ...]}

if you try hard enough.

The initial data structure comes from a CSV file, and is not under my
control.

There's no reason to duplicate the CSV file's design in your own data
structures though.
 
S

Steven D'Aprano

The dictionary is potentially a lot larger than this particular set of
values (it's a mapping of header:value for a row of a user-provided CSV
file). Does this make a difference to the best option? (Currently I'm
looking at "likely" figures of 60+ keys in the dictionary and 3-8
postage options to pick up, but both of those could increase
significantly before production.)

SIXTY keys?

When you get to sixty thousand keys, it might take a few seconds to
process.

Efficiency is important, though not king; this whole code will be inside
a loop. But readability is important too.

I can't just give all of dct.values() to parse_kwdlist; the function
parses a particular format of text string, and it's entirely possible
that other values would match that format (some of them are pure
free-form text). This has to get only the ones starting with Keyword,
and in order. Steven, the line you suggested:

lst = [parse_kwdlist(dct["Keyword%d"%i]) for i in xrange(1, len(dct)+1)]

will bomb with KeyError when it hits the first one that isn't present, I
assume. Is there an easy way to say "and when you reach any exception,
not just StopIteration, return the list"? (I could wrap the whole "lst =
" in a try/except, but then it won't set lst at all.)

No.

You could use the list comprehension form at the cost of running over the
dict twice:

maxkey = 0
for key in dct:
if key.startswith("Keyword"):
maxkey = max(maxkey, int(key[7:]))
lst = [parse_kwdlist(dct["Keyword%d"%i]) for i in xrange(1, maxkey+1)]


but quite frankly, at this point I'd say, forget the one-liner, do it the
old-fashioned way with a loop. Or change your data structure: often you
can simplify a task drastically just by changing the way you store the
data.
 
R

Roy Smith

Steven D'Aprano said:
for key in dct:
if key.startswith("Keyword"):
maxkey = max(maxkey, int(key[7:]))

I would make that a little easier to read, and less prone to "Did I
count correctly?" bugs with something like:

prefix = "Keyword"
n = len(prefix)
for key in dct:
name, value = key[:n], key[n:]
if name == prefix:
maxkey = max(maxkey, int(value))
 
C

Chris Angelico

SIXTY keys?

When you get to sixty thousand keys, it might take a few seconds to
process.

This whole code is inside a loop that we took, in smoke testing, to a
couple hundred million rows (I think), with the intention of having no
limit at all. So this might only look at 60-100 headers, but it will
be doing so in a tight loop.

ChrisA
 
P

Paul Rubin

Chris Angelico said:
This whole code is inside a loop that we took, in smoke testing, to a
couple hundred million rows (I think), with the intention of having no
limit at all. So this might only look at 60-100 headers, but it will
be doing so in a tight loop.

If you're talking about data sets that large, you should rethink the
concept of using python dictionaries with a key for every row. Try
importing your CSV into an SQL database and working from there instead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top