Cleaner idiom for text processing?

Michael Ellis · May 26, 2004

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Thanks!
Mike Ellis

Peter Hansen · May 26, 2004

Michael said:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

for line in infile:
tokens = line.split()
d = dict(zip(tokens[::2], tokens[1::2]))
do_something_with_values(...)

By the way, don't use "dict" as a variable name. It's already
a builtin factory function to create dictionaries.

-Peter

Jeff Epler · May 26, 2004

I'd move the logic that turns the file into the form you want to
process, under the assumption that you'll use this code from multiple
places.
def process_tokens(f):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens] = tokens[i+1]
yield d
Then,
for d in process_tokens(infile):
do_something_with_values(d['foo'], d['bar'])

If the specific keys you want from each line are constant for the loop,
have process_tokens yield those items in sequence:
def process_tokens2(f, keys):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens] = tokens[i+1]
yield [d[k] for k in keys]

for foo, bar in process_tokens(infile, "foo", "bar"):
do_something_with_values(foo, bar)

Jeff

Peter Otten · May 26, 2004

Michael said:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Yet another way to create the dictionary:

Peter

Duncan Booth · May 26, 2004

Yet another way to create the dictionary:

You can also do that without using itertools:

However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

Peter Otten · May 26, 2004

Duncan said:
You can also do that without using itertools:

The advantage of my solution is that it omits the intermediate list.

However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be sure
whether the previous consumer just read 10 items ahead for efficiency
reasons. Allowing such optimizations would in effect limit iterators to for
loops. Moreover, the calling function has no way of knowing whether that
would really be efficient as the first iterator might take a looong time to
yield the next value while the second could just throw a StopIteration. If
a way around this is ever found, checking izip()'s arguments for identity
is only a minor complication.

But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():

However, this is less readable (probably slower too) than the original with
normal slices and therefore not worth the effort for small lists like (I
guess) those in the OP's problem.

Peter

Peter Hansen · May 26, 2004

Peter said:
But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():

No! Yours is much more elegant! Wonderful... zero overhead.

-Peter

Duncan Booth · May 26, 2004

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be
sure whether the previous consumer just read 10 items ahead for
efficiency reasons. Allowing such optimizations would in effect limit
iterators to for loops. Moreover, the calling function has no way of
knowing whether that would really be efficient as the first iterator
might take a looong time to yield the next value while the second
could just throw a StopIteration. If a way around this is ever found,
checking izip()'s arguments for identity is only a minor complication.

What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Peter Hansen · May 26, 2004

Duncan said:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Or as an interim measure: write lots of elegant code using that
technique, and then if anyone suggests changing the way it
works the rest of the world will shout "no, it will break code!".
;-)

-Peter

Peter Otten · May 26, 2004

Duncan said:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

This would also affect the calling code, when the arguments are iterators
(just swapping arguments to simulate the effect of the proposed
optimization):

ia, ib = iter(range(3)), iter([])
zip(ia, ib) []
ia.next() 1
ia, ib = iter(range(3)), iter([])
zip(ib, ia) []
ia.next() 0

Click to expand...

Click to expand...

Optimizations that are visible from the calling code always seem a bad idea
and against Python's philosophy. I admit the above reusage pattern is not
very likely, though.

Passing the same iterator multiple times to izip is a pretty neat idea,
but I would still be happier if the documentation explicitly stated that
it consumes its arguments left to right.

From the itertools documentation:

"""
izip(*iterables)

Make an iterator that aggregates elements from each of the iterables. Like
zip() except that it returns an iterator instead of a list. Used for
lock-step iteration over several iterables at a time. Equivalent to:

def izip(*iterables):
iterables = map(iter, iterables)
while iterables:
result = [i.next() for i in iterables]
yield tuple(result)
"""

I'd say the "Equivalent to [reference implementation]" statement should meet
your request.

Peter

Peter Otten · May 26, 2004

Peter said:
No! Yours is much more elegant! Wonderful... zero overhead.

I should mention I've picked up the trick on c.l.py (don't remember the
poster).

Peter

Paul Rubin · May 27, 2004

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])

Peter Hansen · May 27, 2004

Paul said:
for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Click to expand...

Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])

Paul, I don't understand why you say "pessimized".
The only potential flaw in the original and most
(all) of the other solutions seems to be present
in yours as well: if there are an odd number of
tokens on a line an exception will be raised.

-Peter

Michele Simionato · May 27, 2004

Peter Otten said:
Yet another way to create the dictionary:

Peter

Cool! This should go in the Cookbook, in the shortcuts section.

Michele Simionato

Michele Simionato · May 27, 2004

Peter Otten said:
Yet another way to create the dictionary:

Peter

BTW, the name I have seem for this kind of things is chop:
.... tup = (iter(it),)*n
.... return itertools.izip(*tup)
....

list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)]
list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)]
list(chop([1,2,3,4,5,6],1))

Click to expand...

Click to expand...

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

Michele Simionato

has · May 27, 2004

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

If it's terseness you want:

import re

for line in infile:
d = dict(re.findall('([^ ]+) ([^ ]+)', line))
do_something_with_values(d['foo'], d['bar'])

Hardly worth worrying about though...

Peter Otten · May 27, 2004

Michele said:
... tup = (iter(it),)*n
... return itertools.izip(*tup)
...

list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)]
list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)]
list(chop([1,2,3,4,5,6],1))

Click to expand...

Click to expand...

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

I don't think so. IMO the itertools examples page would be a better place
for the above than the cookbook. If an example had to be removed in
exchange, that would be iteritems(). Raymond, are you looking?

Peter

Text processing	29	Sep 26, 2011
emacs lisp as text processing language...	1	Oct 29, 2007
Translate tab-delimited to fixed width text	2	Sep 21, 2004
xml.parsers.expat loading xml into a dict and whitespace	6	May 23, 2007
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
An assessment of Tkinter and IDLE	20	Aug 27, 2009
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
PEP 3107 Function Annotations for review and comment	4	Dec 29, 2006

Cleaner idiom for text processing?

Michael Ellis

Peter Hansen

Jeff Epler

Peter Otten

Duncan Booth

Peter Otten

Peter Hansen

Duncan Booth

Peter Hansen

Peter Otten

Peter Otten

Paul Rubin

Peter Hansen

Michele Simionato

Michele Simionato

has

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads