Real-world use cases for map's None fill-in feature?

  • Thread starter Raymond Hettinger
  • Start date
R

Raymond Hettinger

Proposal
--------
I am gathering data to evaluate a request for an alternate version of
itertools.izip() with a None fill-in feature like that for the built-in
map() function:
[('a', '1'), ('b', '2'), ('c', '3'), (None, '4'), (None, '5')]

The motivation is to provide a means for looping over all data elements
when the input lengths are unequal. The question of the day is whether
that is both a common need and a good approach to real-world problems.
The answer can likely be found in results from other programming
languages and from surveying real-world Python code.

Other languages
---------------
I scanned the docs for Haskell, SML, and Perl6's yen operator and found
that the norm for map() and zip() is to truncate to the shortest input
or raise an exception for unequal input lengths. Ruby takes the
opposite approach and fills-in nil values -- the reasoning behind the
design choice is somewhat inscrutable:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/18651

Real-world code
---------------
I scanned the standard library, my own code, and a few third-party
tools. I
found no instances where map's fill-in feature was used.

History of zip()
----------------
PEP 201 (lock-step iteration) documents that a fill-in feature was
contemplated and rejected for the zip() built-in introduced in Py2.0.
In the years before and after, SourceForge logs show no requests for a
fill-in feature.

Request for more information
----------------------------
My request for readers of comp.lang.python is to search your own code
to see if map's None fill-in feature was ever used in real-world code
(not toy examples). I'm curious about the context, how it was used,
and what alternatives were rejected (i.e. did the fill-in feature
improve the code). Likewise, I'm curious as to whether anyone has seen
a zip-style fill-in feature employed to good effect in some other
programming language.

Parallel to SQL?
----------------
If an iterator element's ordinal position were considered as a record
key, then the proposal equates to a database-style full outer join
operation (one which includes unmatched keys in the result) where record
order is significant. Does an outer-join have anything to do with
lock-step iteration? Is this a fundamental looping construct or just a
theoretical wish-list item? Does Python need itertools.izip_longest()
or would it just become a distracting piece of cruft?



Raymond Hettinger


FWIW, the OP's use case involved printing files in multiple
columns:

for f, g in itertools.izip_longest(file1, file2, fillin_value=''):
print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())

The alternative was straightforward but less terse:

while 1:
f = file1.readline()
g = file2.readline()
if not f and not g:
break
print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())
 
A

Alex Martelli

Raymond Hettinger said:
Request for more information
----------------------------
My request for readers of comp.lang.python is to search your own code
to see if map's None fill-in feature was ever used in real-world code
(not toy examples). I'm curious about the context, how it was used,
and what alternatives were rejected (i.e. did the fill-in feature

I had (years ago, version was 1.5.2) one real-world case of map(max,
seq1, seq2). The sequences represented alternate scores for various
features, using None to mean "the score for this feature cannot be
computed by the algorithm used to produce this sequence", and it was
common to have one sequence longer (using a later-developed algorithm
that computed more features). This use may have been an abuse of my
observation that max(None, N) and max(N, None) were always N on the
platform I was using at the time. I was relatively new at Python, and
in retrospect I feel I might have been going for "use all the new toys
we've just gotten" -- looping on feature index to compute the scores,
and explicitly testing for None, might have been a better approach than
building those lists (with seq1=map(scorer1, range(N)), btw) and then
running map on them, anyway. At any rate, I later migrated to a lazily
computed version, don't recall the exact details but it was something
like (in today's Python):

class LazyMergedList(object):
def __init__(self, *fs):
self.fs = *fs
self.known= {}
def __getitem__(self, n):
try: return self.known[n]
except KeyError: pass
result = self.known[n] = max(f(n) for f in fs)
return result

when it turned out that in most cases the downstream code wasn't
actually using all the features (just a small subset in each case), so
computing all of them ahead of time was a waste of cycles.

I don't recall ever relying on map's None-filling feature in other
real-world cases, and, as I mentioned, even here the reliance was rather
doubtful. OTOH, if I had easily been able to specify a different
filler, I _would_ have been able to use it a couple of times.


Alex
 
A

Anders Hammarquist

Request for more information

I had a quick look through our (Strakt's) codebase and found one example.

The code is used to process user-designed macros, where the user wants
to append data to strings stored in the system. Note that all data is
stored as lists of whatever the relevant data type is.

While I didn't write this bit of code (so I can't say what, if any,
alternatives were considered), it does seem to me the most straight-
forward way to do it. Being able to say what the fill-in value should
be would make the code even simpler.

oldAttrVal is the original stored data, and attValue is what the macro
wants to append.

--->8---
newAttrVal = []
for x, y in map(None, oldAttrVal, attrValue):
newAttrVal.append(u''.join((x or '', y or '')))
--->8---

/Anders
 
R

Raymond Hettinger

[Alex Martelli]
I had (years ago, version was 1.5.2) one real-world case of map(max,
seq1, seq2). The sequences represented alternate scores for various
features, using None to mean "the score for this feature cannot be
computed by the algorithm used to produce this sequence", and it was
common to have one sequence longer (using a later-developed algorithm
that computed more features). This use may have been an abuse of my
observation that max(None, N) and max(N, None) were always N on the
platform I was using at the time.

Analysis
--------

That particular dataset has three unique aspects allowing the map(max,
s1, s2, s3) approach to work at all.

1) Fortuitious alignment in various meanings of None:
- the input sequence using it to mean "feature cannot be computed"
- the auto-fillin of None meaning "feature used in later
algorithms, but not earlier ones"
- the implementation quirk where max(None, n) == max(n, None) == n

2) Use of a reduction function like max() which does not care about the
order of inputs (i.e. the output sequence does not indicate which
algorithm produced the best score).

3) Later-developed sequences had to be created with the knowledge of
the features used by all earlier sequences (lest two of the sequences
get extended with different features corresponding to the same ordinal
position).

Getting around the latter limitation suggests using a mapping
(feature->score) rather than tracking scores by ordinal position (with
position corresponding to a particular feature):

bestscore = {}
for d in d1, d2, d3:
for feature, score in d.iteritems():
bestscore[feature] = max(bestscore.get(feature, 0), score)

Such an approach also gets around dependence on the other two unique
aspects of the dataset. With dict.get() any object can be specified as
a default value (with zero being a better choice for a null input to
max()). Also, the pattern is not limited to commutative reduction
functions like max(); instead, it would work just as well with a
result.setdefault(feature, []).append(score) style accumulation of all
results or with other combining/analysis functions.

So, while map's None fill-in feature happened to apply to this
dataset's unique features, I wonder if its availability steered you
away from a better data-structure with greater flexibility, less
dependence on quirks, and more generality.

Perhaps the lesson is that outer-join operations are best expressed
with dictionaries rather than sequences with unequal lengths.

I was relatively new at Python, and
in retrospect I feel I might have been going for "use all the new toys
we've just gotten"

That suggests that if itertools.zip_longest() doesn't turn out to be
TheRightTool(tm) for many tasks, then it may have ill-effects beyond
just being cruft -- it may steer folks away from better solutions. As
you know, it can take a while for Python newcomers to realize the full
power and generality of dictionary based approaches. I wonder if this
proposed itertool would distract from that realization.

I don't recall ever relying on map's None-filling feature in other
real-world cases, and, as I mentioned, even here the reliance was rather
doubtful. OTOH, if I had easily been able to specify a different
filler, I _would_ have been able to use it a couple of times.

Did you run across any cookbook code that would have been improved by
the proposed itertools.zip_longest() function?



Raymond
 
R

rurpy

Raymond Hettinger said:
Proposal
--------
I am gathering data to evaluate a request for an alternate version of
itertools.izip() with a None fill-in feature like that for the built-in
map() function:
[('a', '1'), ('b', '2'), ('c', '3'), (None, '4'), (None, '5')]

The motivation is to provide a means for looping over all data elements
when the input lengths are unequal. The question of the day is whether
that is both a common need and a good approach to real-world problems.
The answer can likely be found in results from other programming
languages and from surveying real-world Python code.

Other languages
---------------
I scanned the docs for Haskell, SML, and Perl6's yen operator and found
that the norm for map() and zip() is to truncate to the shortest input
or raise an exception for unequal input lengths. Ruby takes the
opposite approach and fills-in nil values -- the reasoning behind the
design choice is somewhat inscrutable:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/18651
From what I can make out (with help of internet
language translation sites) the relevent part
(section [2]) of this presents three options for
handling unequal length arguments:
1. zip to longest (Perl6 does it this way)
2. zip to shortest (Python does it this way)
3. use zip method and choose depending on
whether argument list is shorter or longer
than object's list.
It then solicits opinions on the best way.
It does not state or justify any particular choice.

If "perl6"=="perl6 yen operator" then there
is a contradiction with your earlier statement.
Real-world code
---------------
I scanned the standard library, my own code, and a few third-party
tools. I
found no instances where map's fill-in feature was used.

History of zip()
----------------
PEP 201 (lock-step iteration) documents that a fill-in feature was
contemplated and rejected for the zip() built-in introduced in Py2.0.
In the years before and after, SourceForge logs show no requests for a
fill-in feature.

My perception is that many people view the process
of advocating for a library addition as
1. Very time consuming due to the large amount of
work involved in presenting and defending a proposal.
2. Having a very small chance of acceptance.
I do not know whether this is really the case or even if my
perception is correct, but if it is, it could account for the
lack of feature requests.
Request for more information
----------------------------
My request for readers of comp.lang.python is to search your own code
to see if map's None fill-in feature was ever used in real-world code
(not toy examples). I'm curious about the context, how it was used,
and what alternatives were rejected (i.e. did the fill-in feature
improve the code). Likewise, I'm curious as to whether anyone has seen
a zip-style fill-in feature employed to good effect in some other
programming language.

How well correlated in the use of map()-with-fill with the
(need for) the use of zip/izip-with-fill?
Parallel to SQL?
----------------
If an iterator element's ordinal position were considered as a record
key, then the proposal equates to a database-style full outer join
operation (one which includes unmatched keys in the result) where record
order is significant. Does an outer-join have anything to do with
lock-step iteration? Is this a fundamental looping construct or just a
theoretical wish-list item? Does Python need itertools.izip_longest()
or would it just become a distracting piece of cruft?

Raymond Hettinger

FWIW, the OP's use case involved printing files in multiple
columns:

for f, g in itertools.izip_longest(file1, file2, fillin_value=''):
print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())

The alternative was straightforward but less terse:

while 1:
f = file1.readline()
g = file2.readline()
if not f and not g:
break
print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())

Actuall my use case did not have quite so much
perlish line noise :)
Compared to
for f, g in izip2 (file1, file2, fill=''):
print '%s\t%s' % (f, g)
the above looks like a relatively minor loss
of conciseness, but consider the uses of the
current izip, for example

for i1, i2 in itertools.izip (iterable_1, iterable_2):
print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())

can be replaced by:
while 1:
i1 = iterable_1.next()
i2 = iterable_2.next()
print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())

yet that was not justification for rejecting izip()'s
inclusion in itertools.

The other use case I had was a simple file diff.
All I cared about was if the files were the same or
not, and if not, what were the first differing lines.
This was to compare output from a process that
was supposed to match some saved reference
data. Because of error propagation, lines beyond
the first difference were meaningless. The code,
using an "iterate to longest with fill" izip would be
roughly:

# Simple file diff to ident
for ln1, ln2 in izip_long (file1, file2, fill="<EOF>"):
if ln1 != ln2:
break
if ln1 == ln2:
print "files are identical"
else:
print "files are different"

This same use case occured again very recently
when writing unit tests to compare output of a parser
with known correct output during refactoring.

With file iterators one can imagine many potential
use cases for izip but not imap, but there are probably
few real uses existant because generaly files may be
of different lengths, and there currently is no useable
izip for this case.

[jan09 08:30 utc]
 
D

Duncan Booth

Raymond said:
My request for readers of comp.lang.python is to search your own code
to see if map's None fill-in feature was ever used in real-world code
(not toy examples). I'm curious about the context, how it was used,
and what alternatives were rejected (i.e. did the fill-in feature
improve the code). Likewise, I'm curious as to whether anyone has seen
a zip-style fill-in feature employed to good effect in some other
programming language.

One example of padding out iterators (although I didn't use map's fill-in
to implement it) is turning a single column of items into a multi-column
table with the items laid out across the rows first. The last row may have
to be padded with some empty cells.

Here's some code I wrote to do that. Never mind for the moment that the use
of zip isn't actually defined here, it could use izip, but notice that the
input iterator has to be converted to a list first so that I can add a
suitable number of empty strings to the end. If there was an option to izip
to pad the last element with a value of choice (such as a blank string) the
code could work with iterators throughout:

def renderGroups(self, group_size=2, allow_add=True):
"""Iterates over the items rendering one item for each group.
Each group contains an iterator for group_size elements.
The last group may be padded out with empty strings.
"""
elements = list(self.renderIterator(allow_add)) + ['']*(group_size-
1)
eliter = iter(elements)
return zip(*[eliter]*group_size)

If there was a padding option to izip this could could have been something
like:

def renderGroups(self, group_size=2, allow_add=True):
"""Iterates over the items rendering one item for each group.
Each group contains an iterator for group_size elements.
The last group may be padded out with empty strings.
"""
iter = self.renderIterator(allow_add)
return itertools.izip(*[iter]*group_size, pad='')

The code is then used to build a table using tal like this:

<tal:loop repeat="row python:slot.renderGroups(group_size=4);">
<tr tal:define="isFirst repeat/row/start"
tal:attributes="class python:test(isFirst, 'slot-top','')">
<td class="slotElement" tal:repeat="cell row"
tal:content="structure cell">4X Slot element</td>
</tr>
</tal:loop>
 
R

Raymond Hettinger

[Anders Hammarquist]:
I had a quick look through our (Strakt's) codebase and found one example.

Thanks for the research :)

The code is used to process user-designed macros, where the user wants
to append data to strings stored in the system. Note that all data is
stored as lists of whatever the relevant data type is.

While I didn't write this bit of code (so I can't say what, if any,
alternatives were considered), it does seem to me the most straight-
forward way to do it. Being able to say what the fill-in value should
be would make the code even simpler.

oldAttrVal is the original stored data, and attValue is what the macro
wants to append.

newAttrVal = []
for x, y in map(None, oldAttrVal, attrValue):
newAttrVal.append(u''.join((x or '', y or '')))

I'm finding this case difficult to analyze and generalize without
knowing the significance of position in the list. It looks like None
fill-in is used because attrValue may be a longer list whenever the
user is specifying new system strings and it may be shorter when some
of there are no new strings and the system strings aren't being updated
at all. Either way, it looks like the ordinal position has some
meaning that is shared by both oldAttrVal and newAttrVal, perhaps a
message number or somesuch. If that is the case, is there some other
table the assigns meanings to the resulting strings according to their
index? What does the code look like that accesses newAttrVal and how
does it know the significance of various positions in the list? This
is important because it could shed some light on how an app finds
itself looping over two lists which share a common meaning for each
index position, yet they are unequal in length.



Raymond
 
R

Raymond Hettinger

Duncan said:
One example of padding out iterators (although I didn't use map's fill-in
to implement it) is turning a single column of items into a multi-column
table with the items laid out across the rows first. The last row may have
to be padded with some empty cells.

ANALYSIS
--------

This case relies on the side-effects of zip's implementation details --
the trick of windowing or data grouping with code like: zip(it(),
it(), it()). The remaining challenge is handling missing values when
the reshape operation produces a rectangular matrix with more elements
than provided by the iterable input.

The proposed function directly meets the challenge:

it = iter(iterable)
result = izip_longest(*[it]*group_size, pad='')

Alternately, the need can be met with existing tools by pre-padding the
iterator with enough extra values to fill any holes:

it = chain(iterable, repeat('', group_size-1))
result = izip_longest(*[it]*group_size)

Both approaches require a certain meaure of inventiveness, rely on
advacned tricks, and forgo readability to gain the raw speed and
conciseness afforded by a clever use of itertools. They are also a
challenge to review, test, modify, read, or explain to others.

In contrast, a simple generator is trivially easy to create and read,
albiet less concise and not as speedy:

it = iter(iterable)
while 1:
row = tuple(islice(it, group_size))
if len(row) == group_size:
yield row
else:
yield row + ('',) * (group_size - len(row))
break

The generator version is plain, simple, boring, and uninspirational.
But it took only seconds to write and did not require a knowledge of
advanced itertool combinations. It more easily explained than the
versions with zip tricks.


Raymond
 
P

Paul Rubin

Raymond Hettinger said:
The generator version is plain, simple, boring, and uninspirational.
But it took only seconds to write and did not require a knowledge of
advanced itertool combinations. It more easily explained than the
versions with zip tricks.

I had this cute idea of using dropwhile to detect the end of an iterable:

it = chain(iterable, repeat(''))
while True:
row = tuple(islice(it, group_size))
# next line raises StopIteration if row is entirely null-strings
dropwhile(lambda x: x=='', row).next()
yield row
 
D

Duncan Booth

Raymond said:
The generator version is plain, simple, boring, and uninspirational.
But it took only seconds to write and did not require a knowledge of
advanced itertool combinations. It more easily explained than the
versions with zip tricks.
I can't argue with that.
 
R

Raymond Hettinger

The other use case I had was a simple file diff.
All I cared about was if the files were the same or
not, and if not, what were the first differing lines.
This was to compare output from a process that
was supposed to match some saved reference
data. Because of error propagation, lines beyond
the first difference were meaningless. . . .
This same use case occured again very recently
when writing unit tests to compare output of a parser
with known correct output during refactoring.

Analysis
--------

Both of these cases compare two data streams and report the first
mismatch, if any. Data beyond the first mismatch is discarded.

The example code seeks to avoid managing two separate iterators and the
attendant code for trapping StopIteration and handling end-cases. The
simplification is accomplished by generating a single fill element so
that the end-of-file condition becomes it own element capable of being
compared or reported back as a difference. The EOF element serves as a
sentinel and allows a single line of comparison to handle all cases.
This is a normal and common use for sentinels.

The OP's code appends the sentinel using a proposed variant of zip()
which pads unequal iterables with a specified fill element:

for x, y in izip_longest(file1, file2, fill='<EOF>'):
if x != y:
return 'Mismatch', x, y
return 'Match'

Alternately, the example can be written using existing itertools:

for x, y in izip(chain(file1, ['<EOF>']), chain(file2, ['<EOF>'])):
if x != y:
return 'Mismatch', x, y
return 'Match'

This is a typical use of chain() and not at all tricky. The chain()
function was specifically designed for tacking one or more elements
onto the end of another iterable. It is ideal for appending sentinels.


Raymond
 
R

Raymond Hettinger

Alternately, the need can be met with existing tools by pre-padding the
iterator with enough extra values to fill any holes:

it = chain(iterable, repeat('', group_size-1))
result = izip_longest(*[it]*group_size)

Typo: That should be izip() instead of izip_longest()
 
R

rurpy

Raymond Hettinger said:
Duncan said:
One example of padding out iterators (although I didn't use map's fill-in
to implement it) is turning a single column of items into a multi-column
table with the items laid out across the rows first. The last row may have
to be padded with some empty cells.

ANALYSIS
--------

This case relies on the side-effects of zip's implementation details --
the trick of windowing or data grouping with code like: zip(it(),
it(), it()). The remaining challenge is handling missing values when
the reshape operation produces a rectangular matrix with more elements
than provided by the iterable input.

The proposed function directly meets the challenge:

it = iter(iterable)
result = izip_longest(*[it]*group_size, pad='')

Alternately, the need can be met with existing tools by pre-padding the
iterator with enough extra values to fill any holes:

it = chain(iterable, repeat('', group_size-1))
result = izip_longest(*[it]*group_size)

I assumed you meant izip() here (and saw your followup)
Both approaches require a certain meaure of inventiveness, rely on
advacned tricks, and forgo readability to gain the raw speed and
conciseness afforded by a clever use of itertools. They are also a
challenge to review, test, modify, read, or explain to others.

The inventiveness is in the "(*[it]*group_size, " part. The
rest is straight forward (assuming of course that itertools
has good documentation, and it was read first.)
In contrast, a simple generator is trivially easy to create and read,
albiet less concise and not as speedy:

it = iter(iterable)
while 1:
row = tuple(islice(it, group_size))
if len(row) == group_size:
yield row
else:
yield row + ('',) * (group_size - len(row))
break

Yes with 4 times the amount of code. (Yes, I am
one of those who believes production and maintence
cost is, under many circumstances, roughly correlated
with LOC.

An frankly, I don't find the above any more
comprehensible than:
result = izip_longest(*[it]*group_size, pad='')
once a little thought is given to the *[it]*group_size,
part. I see much more opaque code everytime
I look at source code in the standard library.
The generator version is plain, simple, boring, and uninspirational.
But it took only seconds to write and did not require a knowledge of
advanced itertool combinations.

"advanced itertool combinations"?? Even I, newbie
that I am, found the concepts of repeat() and chain()
pretty straight forward. Of course having to
understand/use 3 itertools tools is more difficult
than understanding one (izip_longest). Better
documentation could mitigate that a lot.
But the solution using "advanced itertool combinations"
was your's, avoided altogether with an izip_long().

Also this same argument (uses of x can be easily
coded without x by using a generator) is equally
applicable to itertools.izip() itself, yes?
It more easily explained than the versions with zip tricks.

Calling this a "trick" is unfair. The (current pre-2.5)
documentation still mentions no requirement that
izip() arguments be independent (despite the fact
that this issue was discussed here a couple months
ago as I remember. If I remember it was not clear if
that should be a requirement or not, since it would
prevent any use of the same iterable more than
once in izip's arg list, it has not been documented
for 3(?) Python versions, and clearly people are
using the current behavior.
 
C

Cappy2112

I haven't used itertools yet, so I don't know their capabilities.

I have used map twice recently with None as the first argument. This
was also the first time I've used map, and was dissapointed when I
found out about the truncation. The lists map was iterating over in my
case were of unequal lengths, so I had to pad the lists to make sure
nothing was truncated.

The most universal solution would be to provide a mechanism to
truncate, pad, or remain the same length. However, with the pad
feature, room should be provided for the user to add the pad item.
 
P

Peter Otten

Raymond said:
Alternately, the need can be met with existing tools by pre-padding the
iterator with enough extra values to fill any holes:

it = chain(iterable, repeat('', group_size-1))
result = izip_longest(*[it]*group_size)

Both approaches require a certain meaure of inventiveness, rely on
advacned tricks, and forgo readability to gain the raw speed and
conciseness afforded by a clever use of itertools. They are also a
challenge to review, test, modify, read, or explain to others.

Is this the author of itertools becoming its most articulate opponent? What
use is this collection of small functions sharing an underlying concept if
you are not supposed to combine them to your heart's content? You probably
cannot pull off some of those tricks until you have good working knowledge
of the iterator protocol, but that is becoming increasingly important to
understand all Python code.
In contrast, a simple generator is trivially easy to create and read,
albiet less concise and not as speedy:

it = iter(iterable)
while 1:
row = tuple(islice(it, group_size))
if len(row) == group_size:
yield row
else:
if row:
yield row + ('',) * (group_size - len(row))
break

The generator version is plain, simple, boring, and uninspirational.

I Can't argue with that :) But nobody spotted the bug within a day; so
dumbing down the code didn't pay off. Furthermore, simple code like above
is often inlined and therefore harder to test and an impediment to
modification. Once you put the logic into a separate function/generator it
doesn't really matter which version you use. You can't get the
chain/repeat/izip variant to meet your (changing) requirements? Throw it
away and just keep the (modified) test suite.

A newbie, by the way, would have /written/ neither. The it = iter(iterable)
voodoo isn't obvious and the barrier to switch from lst[:group_size] to
islice(it, group_size) to /improve/ one's is code high. I expect to see an
inlined list-based solution. The two versions are both part of a learning
experience and both worth the effort.

Regarding the thread's topic, I have no use cases for a map(None, ...)-like
izip_longest(), but occasionally I would prefer izip() to throw a
ValueError if its iterable arguments do not have the same "length".

Peter
 
R

Raymond Hettinger

[Raymond]
[Peter Otten]
Is this the author of itertools becoming its most articulate opponent? What
use is this collection of small functions sharing an underlying concept if
you are not supposed to combine them to your heart's content? You probably
cannot pull off some of those tricks until you have good working knowledge
of the iterator protocol, but that is becoming increasingly important to
understand all Python code.

I'm happy with the module -- it has been well received and is in
widespread use. The components were designed to be useful both
individually and in combination.

OTOH, I sometimes cringe at code reminiscent of APL:

it = chain(iterable, repeat('', group_size-1))
result = izip(*[it]*group_size)

The code is understandable IF you're conversant with all the component
idioms; however, if you're the slightest bit rusty, the meaning of the
code is not obvious. Too much of the looping logic is implicit (1D
padded input reshaped and truncated to a 2D iterator of tuples); the
style is not purely functional (relying on side-effects from multiple
calls to the same iterator); there are two distinct meanings for the
star operator; and it is unlikely that a most people remember the
precedence rules for whether *[it] expands before the [it]*group_size
repeats. All in all, it cannot be claimed to be a masterpiece of
clarity. That being said, if speed was essential, I would use it every
time (as a separate helper function and never as in-line code).

Of course, the main point of the post was that Duncan's use case was
readily solved with existing tools and did not demonstrate a need for
izip_longest(). His original code was almost there -- it just needed
to use chain() instead of list concatenation.
Regarding the thread's topic, I have no use cases for a map(None, ...)-like
izip_longest(), but occasionally I would prefer izip() to throw a
ValueError if its iterable arguments do not have the same "length".

The Standard ML authors agree. Their library offers both alternatives
(with and without an exception for unequal inputs):

http://www.standardml.org/Basis/list-pair.html#SIG:LIST_PAIR.zipEq:VAL

Thanks for the input,

Raymond
 
R

rurpy

Raymond Hettinger said:
I would characterize it as time consuming due to the amount of
research, discussion, and analysis it takes to determine whether or not
a proposal is a good idea.


It is less a matter of chance and more a matter of quality. Great
ideas usually make it. Crummy ideas have no chance unless no one takes
the time to think them through.

Great and crummy are not the problem, since the answer
in those cases is obvious. It is the middle ground where
the answer is not clear, where different people can hold
different views, that are the problem.
I've been monitoring and adjudicating feature requests for five years.
Pythonistas are not known for the lack of assertiveness. If a core
feature has usability problems, we tend to hear about it quickly.
Also, at PyCon, people are not shy about discussing issues that have
arisen.

Yet these are the people both most familiar with the
library as it exists and the most able to easily work
around any limitations, maybe without even thinking
about it. So I am not surprised that this might not
have come up.

To me, the izip solution for my use case was "obvious".
None of the other solutions posted here were.
Of course that could be fixed with documentation.
The lack of requests is not a definitive answer; however, it does
suggest that there is not an strong unmet need. The lack of examples
in the standard library and other code scans corroborates that notion.
This newsgroup query with further serve to gauge the level of interest
and to ferret-out real-word use cases. The jury is still out.

Comments at end re use cases.
Close to 100%. A non-iterator version of izip_longest() is exactly
equivalent to map(None, it1, it2, ...).

Isn't non-iterator and iterator very significant? If I use map()
I can trivially determine the arguments lengths and deal with
unequal length before map(). With iterators that is more
difficult. So I can imagine many cases where izip might
be applicable but map not, and a lack of map use cases
not representative of izip use cases.
Since "we already got one", the real issue is whether it has been so
darned useful that it warrants a second variant with two new features
(returns an iterator instead of a list and allows a user-specifiable
fill value).

I don't see it as having one and adding a second variant.
I see it as having 1/2 and adding the other 1/2.
. . .


The code was not intended to recapitulate your thread; instead, it was
a compact way of summarizing the problem context that first suggested
some value to izip_longest().

I realize that. I just thought that having a
lot extraneous stuff like the formatting made
it look at first glance, messier than it should.
Two thoughts:

1) The easily-coded-simple-alternative argument applies less strongly
to common cases (equal sequence lengths and finite sequences mixed with
infinite suppliers) than it does to less common cases (unequal sequence
lengths where order is important and missing data elements have
meaning).

2) The replacement code is not quite accurate -- the StopIteration
exception needs to be trapped.

Yes, but I don't think that negates the point.
Did you look at difflib?

Yes, but it was way overkill for what I needed.

~~~
Thanks for your response but I'm curious why you
mailed it rather than posted?

I am still left with a difficult to express feeling of
dissatifaction at this process.

Plese try to see it from the point of view of
someone who it not a expert at Python:

Here is izip().
My conception is it takes two sequence generators
and matches up the items from each. (I am talking
overall coceptual models here, not details.)
Here is my problem.
I have two files that produce lines and I want to
compare each line.
Seems like a perfect fit.

So I read that izip() only goes to shortest itereable,
I think, "why only the shortest? why not the longest?
what's so special about the shortest?"
At this point explanations involving lack of uses cases
are not very convincing. I have a use. All the
alternative solutions are more code, less clear, less
obvious, less right. But most importantly, there
seems to be a symmetry between the two cases
(shortest vs longest) that makes the lack of
support for matching-to-longest somehow a
defect.

Now if there is something fundamental about
matching items in parallel lists that makes it a
sensible thing to do only for equal lists (or to the
shortest list) that's fine. You seem to imply that's
the case by referencing Haskell, ML, etc. If so,
that needs to be pointed out in izip's docs.
(Though nothing I have read in this thread has
been convincing.)

If it is the case that a matching-longest izip is easily
handled by adding a line or to code using izip-shortest
that should be pointed out in the doc.

But if the answer is to write out an equivalent generator
in basic python, I cannot see izip but as being
excessively specialized, and needing to be fixed.

Re use-cases...

Uses cases seem to be sought from readers
of c.l.p. and python-dev. That is a pretty small
percentage of python users, and those that
choose to respond are self-selecting. I would
expect the distribution of responders to be
skewed toward advanced users for example.
The other source seems to be a search of
the standard libraries but isn't that also likely
not representative of all the code out in the
wild?

Also, can anyone really remember their code
well enough to recall when some proposed
enhancement would be beneficial?

What I am suggesting is that use cases are
important but it also should be realized is that
they may not always give an accurate quantitative
picture, and that some things still might be good
ideas even without use cases (and the converse of
course), not because the use cases don't exist,
but because they may not be seen by the current
use case solicitation process.
 
D

David Murmann

I am still left with a difficult to express feeling of
dissatifaction at this process.

Plese try to see it from the point of view of
someone who it not a expert at Python:

... [explains his POV]

i more or less completely agree with you, IOW i'd like izip
to change, too. but there are two problems that you haven't
mentioned. first is that, in the case of izip, it is not clear
how it should be fixed and if such a change does not naturally
fit an API it is difficult to incorporate. personally i think
i like the keyword version ("izip(*args, sentinel=None)") best,
but the .rest-method version is appealing too...

second (and i think this is the reason for the use-case search)
is that someone has to do it. that means implement it and fix
the docs, add a test-case and such stuff. if there are not many
use-cases the effort to do so might not be worthwhile.

that means if someone (you?) steps forward with a patch that does
this, it would dramatically increase the chance of a change ;).
 
R

Raymond Hettinger

[David Murmann]
i'd like izip
to change, too.

The zip() function, introduced in Py2.0, was popular and broadly
useful. The izip() function is a zip() substitute with better memory
utilization yet almost identical in how it is used. It is bugfree,
successful, fast, and won't change.

The map() function, introduced shortly after the transistor was
invented, incorporates an option that functions like zip() but fills-in
missing values and won't truncate. It probably seemed like a good idea
at the time, but AFAICT no one uses it (Alex once as a newbie; Strakt
once; me never; the standard library never; etc).

So, the question is not whether non-truncating fill-in will be
available. Afterall, we've already got one: map(None, it1, it2).

Instead, the question is whether to introduce another substantially
identical function with improved memory utilization and a specifiable
fill-in value. But, why would you offer a slightly improved variant of
something that doesn't get used?

Put another way: If you don't use map(None, it1, it2), then you're
going to have a hard time explaining why you need
itertools.izip_longest(it1, it2).


second (and i think this is the reason for the use-case search)
is that someone has to do it. that means implement it and fix
the docs, add a test-case and such stuff. if there are not many
use-cases the effort to do so might not be worthwhile.

In this case, the coding and testing are easy. So that's not the
problem. The real issue is the clutter factor from introducing new
functions if they're not going to be used, if they don't have good use
cases, and if there are better ways to approach most problems.

The reason for the use case search is to determine whether
izip_longest() would end-up as unutilized cruft and add dead-weight to
the language. The jury is still out but it doesn't look promising.


Raymond
 
R

Raymond Hettinger

[[email protected]]
[raymond]
Close to 100%. A non-iterator version of izip_longest() is exactly
equivalent to map(None, it1, it2, ...).
[[email protected]]
If I use map()
I can trivially determine the arguments lengths and deal with
unequal length before map(). With iterators that is more
difficult. So I can imagine many cases where izip might
be applicable but map not, and a lack of map use cases
not representative of izip use cases.

You don't seem to understand what map() does. There is no need to
deal with unequal argument lengths before map(); it does the work for
you. It handles iterator inputs the same way. Meditate on this:

def izip_longest(*args):
return iter(map(None, *args))

Modulo arbitrary fill values and lazily evaluated inputs, the semantics
are exactly what is being requested. Ergo, lack of use cases for
map(None,it1,it2) means that izip_longest(it1,it2) isn't needed.

Raymond
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top