Cleaner idiom for text processing?

M

Michael Ellis

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Thanks!
Mike Ellis
 
P

Peter Hansen

Michael said:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


for line in infile:
tokens = line.split()
d = dict(zip(tokens[::2], tokens[1::2]))
do_something_with_values(...)

By the way, don't use "dict" as a variable name. It's already
a builtin factory function to create dictionaries.

-Peter
 
J

Jeff Epler

I'd move the logic that turns the file into the form you want to
process, under the assumption that you'll use this code from multiple
places.
def process_tokens(f):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens] = tokens[i+1]
yield d
Then,
for d in process_tokens(infile):
do_something_with_values(d['foo'], d['bar'])

If the specific keys you want from each line are constant for the loop,
have process_tokens yield those items in sequence:
def process_tokens2(f, keys):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens] = tokens[i+1]
yield [d[k] for k in keys]

for foo, bar in process_tokens(infile, "foo", "bar"):
do_something_with_values(foo, bar)

Jeff
 
P

Peter Otten

Michael said:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


Yet another way to create the dictionary:

Peter
 
D

Duncan Booth

Yet another way to create the dictionary:

You can also do that without using itertools:

However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?
 
P

Peter Otten

Duncan said:
You can also do that without using itertools:

The advantage of my solution is that it omits the intermediate list.
However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be sure
whether the previous consumer just read 10 items ahead for efficiency
reasons. Allowing such optimizations would in effect limit iterators to for
loops. Moreover, the calling function has no way of knowing whether that
would really be efficient as the first iterator might take a looong time to
yield the next value while the second could just throw a StopIteration. If
a way around this is ever found, checking izip()'s arguments for identity
is only a minor complication.

But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():

However, this is less readable (probably slower too) than the original with
normal slices and therefore not worth the effort for small lists like (I
guess) those in the OP's problem.

Peter
 
P

Peter Hansen

Peter said:
But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():

No! Yours is much more elegant! Wonderful... zero overhead.

-Peter
 
D

Duncan Booth

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be
sure whether the previous consumer just read 10 items ahead for
efficiency reasons. Allowing such optimizations would in effect limit
iterators to for loops. Moreover, the calling function has no way of
knowing whether that would really be efficient as the first iterator
might take a looong time to yield the next value while the second
could just throw a StopIteration. If a way around this is ever found,
checking izip()'s arguments for identity is only a minor complication.

What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.
 
P

Peter Hansen

Duncan said:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Or as an interim measure: write lots of elegant code using that
technique, and then if anyone suggests changing the way it
works the rest of the world will shout "no, it will break code!".
;-)

-Peter
 
P

Peter Otten

Duncan said:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

This would also affect the calling code, when the arguments are iterators
(just swapping arguments to simulate the effect of the proposed
optimization):
ia, ib = iter(range(3)), iter([])
zip(ia, ib) []
ia.next() 1
ia, ib = iter(range(3)), iter([])
zip(ib, ia) []
ia.next() 0

Optimizations that are visible from the calling code always seem a bad idea
and against Python's philosophy. I admit the above reusage pattern is not
very likely, though.
Passing the same iterator multiple times to izip is a pretty neat idea,
but I would still be happier if the documentation explicitly stated that
it consumes its arguments left to right.

From the itertools documentation:

"""
izip(*iterables)

Make an iterator that aggregates elements from each of the iterables. Like
zip() except that it returns an iterator instead of a list. Used for
lock-step iteration over several iterables at a time. Equivalent to:

def izip(*iterables):
iterables = map(iter, iterables)
while iterables:
result = [i.next() for i in iterables]
yield tuple(result)
"""

I'd say the "Equivalent to [reference implementation]" statement should meet
your request.

Peter
 
P

Peter Otten

Peter said:
No! Yours is much more elegant! Wonderful... zero overhead.

I should mention I've picked up the trick on c.l.py (don't remember the
poster).

Peter
 
P

Paul Rubin

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])
 
P

Peter Hansen

Paul said:
for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])



Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])


Paul, I don't understand why you say "pessimized".
The only potential flaw in the original and most
(all) of the other solutions seems to be present
in yours as well: if there are an odd number of
tokens on a line an exception will be raised.

-Peter
 
M

Michele Simionato

Peter Otten said:
Yet another way to create the dictionary:


Peter

BTW, the name I have seem for this kind of things is chop:
.... tup = (iter(it),)*n
.... return itertools.izip(*tup)
....
list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)]
list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)]
list(chop([1,2,3,4,5,6],1))
[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

Michele Simionato
 
H

has

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

If it's terseness you want:

import re

for line in infile:
d = dict(re.findall('([^ ]+) ([^ ]+)', line))
do_something_with_values(d['foo'], d['bar'])

Hardly worth worrying about though...
 
P

Peter Otten

Michele said:
... tup = (iter(it),)*n
... return itertools.izip(*tup)
...
list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)]
list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)]
list(chop([1,2,3,4,5,6],1))
[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

I don't think so. IMO the itertools examples page would be a better place
for the above than the cookbook. If an example had to be removed in
exchange, that would be iteritems(). Raymond, are you looking?

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top