Real-world use cases for map's None fill-in feature?

Discussion in 'Python' started by Raymond Hettinger, Jan 9, 2006.

  1. Proposal
    --------
    I am gathering data to evaluate a request for an alternate version of
    itertools.izip() with a None fill-in feature like that for the built-in
    map() function:

    >>> map(None, 'abc', '12345') # demonstrate map's None fill-in feature

    [('a', '1'), ('b', '2'), ('c', '3'), (None, '4'), (None, '5')]

    The motivation is to provide a means for looping over all data elements
    when the input lengths are unequal. The question of the day is whether
    that is both a common need and a good approach to real-world problems.
    The answer can likely be found in results from other programming
    languages and from surveying real-world Python code.

    Other languages
    ---------------
    I scanned the docs for Haskell, SML, and Perl6's yen operator and found
    that the norm for map() and zip() is to truncate to the shortest input
    or raise an exception for unequal input lengths. Ruby takes the
    opposite approach and fills-in nil values -- the reasoning behind the
    design choice is somewhat inscrutable:
    http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/18651

    Real-world code
    ---------------
    I scanned the standard library, my own code, and a few third-party
    tools. I
    found no instances where map's fill-in feature was used.

    History of zip()
    ----------------
    PEP 201 (lock-step iteration) documents that a fill-in feature was
    contemplated and rejected for the zip() built-in introduced in Py2.0.
    In the years before and after, SourceForge logs show no requests for a
    fill-in feature.

    Request for more information
    ----------------------------
    My request for readers of comp.lang.python is to search your own code
    to see if map's None fill-in feature was ever used in real-world code
    (not toy examples). I'm curious about the context, how it was used,
    and what alternatives were rejected (i.e. did the fill-in feature
    improve the code). Likewise, I'm curious as to whether anyone has seen
    a zip-style fill-in feature employed to good effect in some other
    programming language.

    Parallel to SQL?
    ----------------
    If an iterator element's ordinal position were considered as a record
    key, then the proposal equates to a database-style full outer join
    operation (one which includes unmatched keys in the result) where record
    order is significant. Does an outer-join have anything to do with
    lock-step iteration? Is this a fundamental looping construct or just a
    theoretical wish-list item? Does Python need itertools.izip_longest()
    or would it just become a distracting piece of cruft?



    Raymond Hettinger


    FWIW, the OP's use case involved printing files in multiple
    columns:

    for f, g in itertools.izip_longest(file1, file2, fillin_value=''):
    print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())

    The alternative was straightforward but less terse:

    while 1:
    f = file1.readline()
    g = file2.readline()
    if not f and not g:
    break
    print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())
     
    Raymond Hettinger, Jan 9, 2006
    #1
    1. Advertising

  2. Raymond Hettinger <> wrote:
    ...
    > Request for more information
    > ----------------------------
    > My request for readers of comp.lang.python is to search your own code
    > to see if map's None fill-in feature was ever used in real-world code
    > (not toy examples). I'm curious about the context, how it was used,
    > and what alternatives were rejected (i.e. did the fill-in feature


    I had (years ago, version was 1.5.2) one real-world case of map(max,
    seq1, seq2). The sequences represented alternate scores for various
    features, using None to mean "the score for this feature cannot be
    computed by the algorithm used to produce this sequence", and it was
    common to have one sequence longer (using a later-developed algorithm
    that computed more features). This use may have been an abuse of my
    observation that max(None, N) and max(N, None) were always N on the
    platform I was using at the time. I was relatively new at Python, and
    in retrospect I feel I might have been going for "use all the new toys
    we've just gotten" -- looping on feature index to compute the scores,
    and explicitly testing for None, might have been a better approach than
    building those lists (with seq1=map(scorer1, range(N)), btw) and then
    running map on them, anyway. At any rate, I later migrated to a lazily
    computed version, don't recall the exact details but it was something
    like (in today's Python):

    class LazyMergedList(object):
    def __init__(self, *fs):
    self.fs = *fs
    self.known= {}
    def __getitem__(self, n):
    try: return self.known[n]
    except KeyError: pass
    result = self.known[n] = max(f(n) for f in fs)
    return result

    when it turned out that in most cases the downstream code wasn't
    actually using all the features (just a small subset in each case), so
    computing all of them ahead of time was a waste of cycles.

    I don't recall ever relying on map's None-filling feature in other
    real-world cases, and, as I mentioned, even here the reliance was rather
    doubtful. OTOH, if I had easily been able to specify a different
    filler, I _would_ have been able to use it a couple of times.


    Alex
     
    Alex Martelli, Jan 9, 2006
    #2
    1. Advertising

  3. In article <>,
    Raymond Hettinger <> wrote:
    >Request for more information
    >----------------------------
    >My request for readers of comp.lang.python is to search your own code
    >to see if map's None fill-in feature was ever used in real-world code
    >(not toy examples).


    I had a quick look through our (Strakt's) codebase and found one example.

    The code is used to process user-designed macros, where the user wants
    to append data to strings stored in the system. Note that all data is
    stored as lists of whatever the relevant data type is.

    While I didn't write this bit of code (so I can't say what, if any,
    alternatives were considered), it does seem to me the most straight-
    forward way to do it. Being able to say what the fill-in value should
    be would make the code even simpler.

    oldAttrVal is the original stored data, and attValue is what the macro
    wants to append.

    --->8---
    newAttrVal = []
    for x, y in map(None, oldAttrVal, attrValue):
    newAttrVal.append(u''.join((x or '', y or '')))
    --->8---

    /Anders

    --
    -- Of course I'm crazy, but that doesn't mean I'm wrong.
    Anders Hammarquist |
    Physics student, Chalmers University of Technology, | Hem: +46 31 88 48 50
    G|teborg, Sweden. RADIO: SM6XMM and N2JGL | Mob: +46 707 27 86 87
     
    Anders Hammarquist, Jan 9, 2006
    #3
  4. [Alex Martelli]
    > I had (years ago, version was 1.5.2) one real-world case of map(max,
    > seq1, seq2). The sequences represented alternate scores for various
    > features, using None to mean "the score for this feature cannot be
    > computed by the algorithm used to produce this sequence", and it was
    > common to have one sequence longer (using a later-developed algorithm
    > that computed more features). This use may have been an abuse of my
    > observation that max(None, N) and max(N, None) were always N on the
    > platform I was using at the time.


    Analysis
    --------

    That particular dataset has three unique aspects allowing the map(max,
    s1, s2, s3) approach to work at all.

    1) Fortuitious alignment in various meanings of None:
    - the input sequence using it to mean "feature cannot be computed"
    - the auto-fillin of None meaning "feature used in later
    algorithms, but not earlier ones"
    - the implementation quirk where max(None, n) == max(n, None) == n

    2) Use of a reduction function like max() which does not care about the
    order of inputs (i.e. the output sequence does not indicate which
    algorithm produced the best score).

    3) Later-developed sequences had to be created with the knowledge of
    the features used by all earlier sequences (lest two of the sequences
    get extended with different features corresponding to the same ordinal
    position).

    Getting around the latter limitation suggests using a mapping
    (feature->score) rather than tracking scores by ordinal position (with
    position corresponding to a particular feature):

    bestscore = {}
    for d in d1, d2, d3:
    for feature, score in d.iteritems():
    bestscore[feature] = max(bestscore.get(feature, 0), score)

    Such an approach also gets around dependence on the other two unique
    aspects of the dataset. With dict.get() any object can be specified as
    a default value (with zero being a better choice for a null input to
    max()). Also, the pattern is not limited to commutative reduction
    functions like max(); instead, it would work just as well with a
    result.setdefault(feature, []).append(score) style accumulation of all
    results or with other combining/analysis functions.

    So, while map's None fill-in feature happened to apply to this
    dataset's unique features, I wonder if its availability steered you
    away from a better data-structure with greater flexibility, less
    dependence on quirks, and more generality.

    Perhaps the lesson is that outer-join operations are best expressed
    with dictionaries rather than sequences with unequal lengths.


    > I was relatively new at Python, and
    > in retrospect I feel I might have been going for "use all the new toys
    > we've just gotten"


    That suggests that if itertools.zip_longest() doesn't turn out to be
    TheRightTool(tm) for many tasks, then it may have ill-effects beyond
    just being cruft -- it may steer folks away from better solutions. As
    you know, it can take a while for Python newcomers to realize the full
    power and generality of dictionary based approaches. I wonder if this
    proposed itertool would distract from that realization.


    > I don't recall ever relying on map's None-filling feature in other
    > real-world cases, and, as I mentioned, even here the reliance was rather
    > doubtful. OTOH, if I had easily been able to specify a different
    > filler, I _would_ have been able to use it a couple of times.


    Did you run across any cookbook code that would have been improved by
    the proposed itertools.zip_longest() function?



    Raymond
     
    Raymond Hettinger, Jan 9, 2006
    #4
  5. Raymond Hettinger

    Guest

    "Raymond Hettinger" <> wrote in message
    news:...
    > Proposal
    > --------
    > I am gathering data to evaluate a request for an alternate version of
    > itertools.izip() with a None fill-in feature like that for the built-in
    > map() function:
    >
    > >>> map(None, 'abc', '12345') # demonstrate map's None fill-in feature

    > [('a', '1'), ('b', '2'), ('c', '3'), (None, '4'), (None, '5')]
    >
    > The motivation is to provide a means for looping over all data elements
    > when the input lengths are unequal. The question of the day is whether
    > that is both a common need and a good approach to real-world problems.
    > The answer can likely be found in results from other programming
    > languages and from surveying real-world Python code.
    >
    > Other languages
    > ---------------
    > I scanned the docs for Haskell, SML, and Perl6's yen operator and found
    > that the norm for map() and zip() is to truncate to the shortest input
    > or raise an exception for unequal input lengths. Ruby takes the
    > opposite approach and fills-in nil values -- the reasoning behind the
    > design choice is somewhat inscrutable:
    > http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-dev/18651


    >From what I can make out (with help of internet

    language translation sites) the relevent part
    (section [2]) of this presents three options for
    handling unequal length arguments:
    1. zip to longest (Perl6 does it this way)
    2. zip to shortest (Python does it this way)
    3. use zip method and choose depending on
    whether argument list is shorter or longer
    than object's list.
    It then solicits opinions on the best way.
    It does not state or justify any particular choice.

    If "perl6"=="perl6 yen operator" then there
    is a contradiction with your earlier statement.

    > Real-world code
    > ---------------
    > I scanned the standard library, my own code, and a few third-party
    > tools. I
    > found no instances where map's fill-in feature was used.
    >
    > History of zip()
    > ----------------
    > PEP 201 (lock-step iteration) documents that a fill-in feature was
    > contemplated and rejected for the zip() built-in introduced in Py2.0.
    > In the years before and after, SourceForge logs show no requests for a
    > fill-in feature.


    My perception is that many people view the process
    of advocating for a library addition as
    1. Very time consuming due to the large amount of
    work involved in presenting and defending a proposal.
    2. Having a very small chance of acceptance.
    I do not know whether this is really the case or even if my
    perception is correct, but if it is, it could account for the
    lack of feature requests.

    > Request for more information
    > ----------------------------
    > My request for readers of comp.lang.python is to search your own code
    > to see if map's None fill-in feature was ever used in real-world code
    > (not toy examples). I'm curious about the context, how it was used,
    > and what alternatives were rejected (i.e. did the fill-in feature
    > improve the code). Likewise, I'm curious as to whether anyone has seen
    > a zip-style fill-in feature employed to good effect in some other
    > programming language.


    How well correlated in the use of map()-with-fill with the
    (need for) the use of zip/izip-with-fill?

    > Parallel to SQL?
    > ----------------
    > If an iterator element's ordinal position were considered as a record
    > key, then the proposal equates to a database-style full outer join
    > operation (one which includes unmatched keys in the result) where record
    > order is significant. Does an outer-join have anything to do with
    > lock-step iteration? Is this a fundamental looping construct or just a
    > theoretical wish-list item? Does Python need itertools.izip_longest()
    > or would it just become a distracting piece of cruft?
    >
    > Raymond Hettinger
    >
    > FWIW, the OP's use case involved printing files in multiple
    > columns:
    >
    > for f, g in itertools.izip_longest(file1, file2, fillin_value=''):
    > print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())
    >
    > The alternative was straightforward but less terse:
    >
    > while 1:
    > f = file1.readline()
    > g = file2.readline()
    > if not f and not g:
    > break
    > print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())


    Actuall my use case did not have quite so much
    perlish line noise :)
    Compared to
    for f, g in izip2 (file1, file2, fill=''):
    print '%s\t%s' % (f, g)
    the above looks like a relatively minor loss
    of conciseness, but consider the uses of the
    current izip, for example

    for i1, i2 in itertools.izip (iterable_1, iterable_2):
    print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())

    can be replaced by:
    while 1:
    i1 = iterable_1.next()
    i2 = iterable_2.next()
    print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())

    yet that was not justification for rejecting izip()'s
    inclusion in itertools.

    The other use case I had was a simple file diff.
    All I cared about was if the files were the same or
    not, and if not, what were the first differing lines.
    This was to compare output from a process that
    was supposed to match some saved reference
    data. Because of error propagation, lines beyond
    the first difference were meaningless. The code,
    using an "iterate to longest with fill" izip would be
    roughly:

    # Simple file diff to ident
    for ln1, ln2 in izip_long (file1, file2, fill="<EOF>"):
    if ln1 != ln2:
    break
    if ln1 == ln2:
    print "files are identical"
    else:
    print "files are different"

    This same use case occured again very recently
    when writing unit tests to compare output of a parser
    with known correct output during refactoring.

    With file iterators one can imagine many potential
    use cases for izip but not imap, but there are probably
    few real uses existant because generaly files may be
    of different lengths, and there currently is no useable
    izip for this case.

    [jan09 08:30 utc]
     
    , Jan 9, 2006
    #5
  6. Raymond Hettinger

    Duncan Booth Guest

    Raymond Hettinger wrote:

    > My request for readers of comp.lang.python is to search your own code
    > to see if map's None fill-in feature was ever used in real-world code
    > (not toy examples). I'm curious about the context, how it was used,
    > and what alternatives were rejected (i.e. did the fill-in feature
    > improve the code). Likewise, I'm curious as to whether anyone has seen
    > a zip-style fill-in feature employed to good effect in some other
    > programming language.


    One example of padding out iterators (although I didn't use map's fill-in
    to implement it) is turning a single column of items into a multi-column
    table with the items laid out across the rows first. The last row may have
    to be padded with some empty cells.

    Here's some code I wrote to do that. Never mind for the moment that the use
    of zip isn't actually defined here, it could use izip, but notice that the
    input iterator has to be converted to a list first so that I can add a
    suitable number of empty strings to the end. If there was an option to izip
    to pad the last element with a value of choice (such as a blank string) the
    code could work with iterators throughout:

    def renderGroups(self, group_size=2, allow_add=True):
    """Iterates over the items rendering one item for each group.
    Each group contains an iterator for group_size elements.
    The last group may be padded out with empty strings.
    """
    elements = list(self.renderIterator(allow_add)) + ['']*(group_size-
    1)
    eliter = iter(elements)
    return zip(*[eliter]*group_size)

    If there was a padding option to izip this could could have been something
    like:

    def renderGroups(self, group_size=2, allow_add=True):
    """Iterates over the items rendering one item for each group.
    Each group contains an iterator for group_size elements.
    The last group may be padded out with empty strings.
    """
    iter = self.renderIterator(allow_add)
    return itertools.izip(*[iter]*group_size, pad='')

    The code is then used to build a table using tal like this:

    <tal:loop repeat="row python:slot.renderGroups(group_size=4);">
    <tr tal:define="isFirst repeat/row/start"
    tal:attributes="class python:test(isFirst, 'slot-top','')">
    <td class="slotElement" tal:repeat="cell row"
    tal:content="structure cell">4X Slot element</td>
    </tr>
    </tal:loop>
     
    Duncan Booth, Jan 9, 2006
    #6
  7. [Anders Hammarquist]:
    > I had a quick look through our (Strakt's) codebase and found one example.


    Thanks for the research :)


    > The code is used to process user-designed macros, where the user wants
    > to append data to strings stored in the system. Note that all data is
    > stored as lists of whatever the relevant data type is.
    >
    > While I didn't write this bit of code (so I can't say what, if any,
    > alternatives were considered), it does seem to me the most straight-
    > forward way to do it. Being able to say what the fill-in value should
    > be would make the code even simpler.
    >
    > oldAttrVal is the original stored data, and attValue is what the macro
    > wants to append.
    >
    > newAttrVal = []
    > for x, y in map(None, oldAttrVal, attrValue):
    > newAttrVal.append(u''.join((x or '', y or '')))


    I'm finding this case difficult to analyze and generalize without
    knowing the significance of position in the list. It looks like None
    fill-in is used because attrValue may be a longer list whenever the
    user is specifying new system strings and it may be shorter when some
    of there are no new strings and the system strings aren't being updated
    at all. Either way, it looks like the ordinal position has some
    meaning that is shared by both oldAttrVal and newAttrVal, perhaps a
    message number or somesuch. If that is the case, is there some other
    table the assigns meanings to the resulting strings according to their
    index? What does the code look like that accesses newAttrVal and how
    does it know the significance of various positions in the list? This
    is important because it could shed some light on how an app finds
    itself looping over two lists which share a common meaning for each
    index position, yet they are unequal in length.



    Raymond
     
    Raymond Hettinger, Jan 9, 2006
    #7
  8. Duncan Booth wrote:
    > One example of padding out iterators (although I didn't use map's fill-in
    > to implement it) is turning a single column of items into a multi-column
    > table with the items laid out across the rows first. The last row may have
    > to be padded with some empty cells.


    ANALYSIS
    --------

    This case relies on the side-effects of zip's implementation details --
    the trick of windowing or data grouping with code like: zip(it(),
    it(), it()). The remaining challenge is handling missing values when
    the reshape operation produces a rectangular matrix with more elements
    than provided by the iterable input.

    The proposed function directly meets the challenge:

    it = iter(iterable)
    result = izip_longest(*[it]*group_size, pad='')

    Alternately, the need can be met with existing tools by pre-padding the
    iterator with enough extra values to fill any holes:

    it = chain(iterable, repeat('', group_size-1))
    result = izip_longest(*[it]*group_size)

    Both approaches require a certain meaure of inventiveness, rely on
    advacned tricks, and forgo readability to gain the raw speed and
    conciseness afforded by a clever use of itertools. They are also a
    challenge to review, test, modify, read, or explain to others.

    In contrast, a simple generator is trivially easy to create and read,
    albiet less concise and not as speedy:

    it = iter(iterable)
    while 1:
    row = tuple(islice(it, group_size))
    if len(row) == group_size:
    yield row
    else:
    yield row + ('',) * (group_size - len(row))
    break

    The generator version is plain, simple, boring, and uninspirational.
    But it took only seconds to write and did not require a knowledge of
    advanced itertool combinations. It more easily explained than the
    versions with zip tricks.


    Raymond
     
    Raymond Hettinger, Jan 9, 2006
    #8
  9. Raymond Hettinger

    Paul Rubin Guest

    "Raymond Hettinger" <> writes:
    > The generator version is plain, simple, boring, and uninspirational.
    > But it took only seconds to write and did not require a knowledge of
    > advanced itertool combinations. It more easily explained than the
    > versions with zip tricks.


    I had this cute idea of using dropwhile to detect the end of an iterable:

    it = chain(iterable, repeat(''))
    while True:
    row = tuple(islice(it, group_size))
    # next line raises StopIteration if row is entirely null-strings
    dropwhile(lambda x: x=='', row).next()
    yield row
     
    Paul Rubin, Jan 9, 2006
    #9
  10. Raymond Hettinger

    Duncan Booth Guest

    Raymond Hettinger wrote:

    > The generator version is plain, simple, boring, and uninspirational.
    > But it took only seconds to write and did not require a knowledge of
    > advanced itertool combinations. It more easily explained than the
    > versions with zip tricks.
    >

    I can't argue with that.
     
    Duncan Booth, Jan 9, 2006
    #10
  11. wrote:
    > The other use case I had was a simple file diff.
    > All I cared about was if the files were the same or
    > not, and if not, what were the first differing lines.
    > This was to compare output from a process that
    > was supposed to match some saved reference
    > data. Because of error propagation, lines beyond
    > the first difference were meaningless.

    . . .
    > This same use case occured again very recently
    > when writing unit tests to compare output of a parser
    > with known correct output during refactoring.


    Analysis
    --------

    Both of these cases compare two data streams and report the first
    mismatch, if any. Data beyond the first mismatch is discarded.

    The example code seeks to avoid managing two separate iterators and the
    attendant code for trapping StopIteration and handling end-cases. The
    simplification is accomplished by generating a single fill element so
    that the end-of-file condition becomes it own element capable of being
    compared or reported back as a difference. The EOF element serves as a
    sentinel and allows a single line of comparison to handle all cases.
    This is a normal and common use for sentinels.

    The OP's code appends the sentinel using a proposed variant of zip()
    which pads unequal iterables with a specified fill element:

    for x, y in izip_longest(file1, file2, fill='<EOF>'):
    if x != y:
    return 'Mismatch', x, y
    return 'Match'

    Alternately, the example can be written using existing itertools:

    for x, y in izip(chain(file1, ['<EOF>']), chain(file2, ['<EOF>'])):
    if x != y:
    return 'Mismatch', x, y
    return 'Match'

    This is a typical use of chain() and not at all tricky. The chain()
    function was specifically designed for tacking one or more elements
    onto the end of another iterable. It is ideal for appending sentinels.


    Raymond
     
    Raymond Hettinger, Jan 9, 2006
    #11
  12. > Alternately, the need can be met with existing tools by pre-padding the
    > iterator with enough extra values to fill any holes:
    >
    > it = chain(iterable, repeat('', group_size-1))
    > result = izip_longest(*[it]*group_size)


    Typo: That should be izip() instead of izip_longest()
     
    Raymond Hettinger, Jan 9, 2006
    #12
  13. Raymond Hettinger

    Guest

    "Raymond Hettinger" <> wrote:
    > Duncan Booth wrote:
    > > One example of padding out iterators (although I didn't use map's fill-in
    > > to implement it) is turning a single column of items into a multi-column
    > > table with the items laid out across the rows first. The last row may have
    > > to be padded with some empty cells.

    >
    > ANALYSIS
    > --------
    >
    > This case relies on the side-effects of zip's implementation details --
    > the trick of windowing or data grouping with code like: zip(it(),
    > it(), it()). The remaining challenge is handling missing values when
    > the reshape operation produces a rectangular matrix with more elements
    > than provided by the iterable input.
    >
    > The proposed function directly meets the challenge:
    >
    > it = iter(iterable)
    > result = izip_longest(*[it]*group_size, pad='')
    >
    > Alternately, the need can be met with existing tools by pre-padding the
    > iterator with enough extra values to fill any holes:
    >
    > it = chain(iterable, repeat('', group_size-1))
    > result = izip_longest(*[it]*group_size)


    I assumed you meant izip() here (and saw your followup)

    > Both approaches require a certain meaure of inventiveness, rely on
    > advacned tricks, and forgo readability to gain the raw speed and
    > conciseness afforded by a clever use of itertools. They are also a
    > challenge to review, test, modify, read, or explain to others.


    The inventiveness is in the "(*[it]*group_size, " part. The
    rest is straight forward (assuming of course that itertools
    has good documentation, and it was read first.)

    > In contrast, a simple generator is trivially easy to create and read,
    > albiet less concise and not as speedy:
    >
    > it = iter(iterable)
    > while 1:
    > row = tuple(islice(it, group_size))
    > if len(row) == group_size:
    > yield row
    > else:
    > yield row + ('',) * (group_size - len(row))
    > break


    Yes with 4 times the amount of code. (Yes, I am
    one of those who believes production and maintence
    cost is, under many circumstances, roughly correlated
    with LOC.

    An frankly, I don't find the above any more
    comprehensible than:
    > result = izip_longest(*[it]*group_size, pad='')

    once a little thought is given to the *[it]*group_size,
    part. I see much more opaque code everytime
    I look at source code in the standard library.

    > The generator version is plain, simple, boring, and uninspirational.
    > But it took only seconds to write and did not require a knowledge of
    > advanced itertool combinations.


    "advanced itertool combinations"?? Even I, newbie
    that I am, found the concepts of repeat() and chain()
    pretty straight forward. Of course having to
    understand/use 3 itertools tools is more difficult
    than understanding one (izip_longest). Better
    documentation could mitigate that a lot.
    But the solution using "advanced itertool combinations"
    was your's, avoided altogether with an izip_long().

    Also this same argument (uses of x can be easily
    coded without x by using a generator) is equally
    applicable to itertools.izip() itself, yes?

    > It more easily explained than the versions with zip tricks.


    Calling this a "trick" is unfair. The (current pre-2.5)
    documentation still mentions no requirement that
    izip() arguments be independent (despite the fact
    that this issue was discussed here a couple months
    ago as I remember. If I remember it was not clear if
    that should be a requirement or not, since it would
    prevent any use of the same iterable more than
    once in izip's arg list, it has not been documented
    for 3(?) Python versions, and clearly people are
    using the current behavior.
     
    , Jan 10, 2006
    #13
  14. Raymond Hettinger

    Cappy2112 Guest

    I haven't used itertools yet, so I don't know their capabilities.

    I have used map twice recently with None as the first argument. This
    was also the first time I've used map, and was dissapointed when I
    found out about the truncation. The lists map was iterating over in my
    case were of unequal lengths, so I had to pad the lists to make sure
    nothing was truncated.

    The most universal solution would be to provide a mechanism to
    truncate, pad, or remain the same length. However, with the pad
    feature, room should be provided for the user to add the pad item.
     
    Cappy2112, Jan 10, 2006
    #14
  15. Raymond Hettinger

    Peter Otten Guest

    Raymond Hettinger wrote:

    > Alternately, the need can be met with existing tools by pre-padding the
    > iterator with enough extra values to fill any holes:
    >
    > it = chain(iterable, repeat('', group_size-1))
    > result = izip_longest(*[it]*group_size)
    >
    > Both approaches require a certain meaure of inventiveness, rely on
    > advacned tricks, and forgo readability to gain the raw speed and
    > conciseness afforded by a clever use of itertools. They are also a
    > challenge to review, test, modify, read, or explain to others.


    Is this the author of itertools becoming its most articulate opponent? What
    use is this collection of small functions sharing an underlying concept if
    you are not supposed to combine them to your heart's content? You probably
    cannot pull off some of those tricks until you have good working knowledge
    of the iterator protocol, but that is becoming increasingly important to
    understand all Python code.

    > In contrast, a simple generator is trivially easy to create and read,
    > albiet less concise and not as speedy:
    >
    > it = iter(iterable)
    > while 1:
    > row = tuple(islice(it, group_size))
    > if len(row) == group_size:
    > yield row
    > else:

    if row:
    yield row + ('',) * (group_size - len(row))
    > break
    >
    > The generator version is plain, simple, boring, and uninspirational.


    I Can't argue with that :) But nobody spotted the bug within a day; so
    dumbing down the code didn't pay off. Furthermore, simple code like above
    is often inlined and therefore harder to test and an impediment to
    modification. Once you put the logic into a separate function/generator it
    doesn't really matter which version you use. You can't get the
    chain/repeat/izip variant to meet your (changing) requirements? Throw it
    away and just keep the (modified) test suite.

    A newbie, by the way, would have /written/ neither. The it = iter(iterable)
    voodoo isn't obvious and the barrier to switch from lst[:group_size] to
    islice(it, group_size) to /improve/ one's is code high. I expect to see an
    inlined list-based solution. The two versions are both part of a learning
    experience and both worth the effort.

    Regarding the thread's topic, I have no use cases for a map(None, ...)-like
    izip_longest(), but occasionally I would prefer izip() to throw a
    ValueError if its iterable arguments do not have the same "length".

    Peter
     
    Peter Otten, Jan 10, 2006
    #15
  16. [Raymond]
    > > Both approaches require a certain measure of inventiveness, rely on
    > > advanced tricks, and forgo readability to gain the raw speed and
    > > conciseness afforded by a clever use of itertools. They are also a
    > > challenge to review, test, modify, read, or explain to others.


    [Peter Otten]
    > Is this the author of itertools becoming its most articulate opponent? What
    > use is this collection of small functions sharing an underlying concept if
    > you are not supposed to combine them to your heart's content? You probably
    > cannot pull off some of those tricks until you have good working knowledge
    > of the iterator protocol, but that is becoming increasingly important to
    > understand all Python code.


    I'm happy with the module -- it has been well received and is in
    widespread use. The components were designed to be useful both
    individually and in combination.

    OTOH, I sometimes cringe at code reminiscent of APL:

    it = chain(iterable, repeat('', group_size-1))
    result = izip(*[it]*group_size)

    The code is understandable IF you're conversant with all the component
    idioms; however, if you're the slightest bit rusty, the meaning of the
    code is not obvious. Too much of the looping logic is implicit (1D
    padded input reshaped and truncated to a 2D iterator of tuples); the
    style is not purely functional (relying on side-effects from multiple
    calls to the same iterator); there are two distinct meanings for the
    star operator; and it is unlikely that a most people remember the
    precedence rules for whether *[it] expands before the [it]*group_size
    repeats. All in all, it cannot be claimed to be a masterpiece of
    clarity. That being said, if speed was essential, I would use it every
    time (as a separate helper function and never as in-line code).

    Of course, the main point of the post was that Duncan's use case was
    readily solved with existing tools and did not demonstrate a need for
    izip_longest(). His original code was almost there -- it just needed
    to use chain() instead of list concatenation.

    > Regarding the thread's topic, I have no use cases for a map(None, ...)-like
    > izip_longest(), but occasionally I would prefer izip() to throw a
    > ValueError if its iterable arguments do not have the same "length".


    The Standard ML authors agree. Their library offers both alternatives
    (with and without an exception for unequal inputs):

    http://www.standardml.org/Basis/list-pair.html#SIG:LIST_PAIR.zipEq:VAL

    Thanks for the input,

    Raymond
     
    Raymond Hettinger, Jan 10, 2006
    #16
  17. Raymond Hettinger

    Guest

    Raymond Hettinger <> wrote:
    >
    > > > History of zip()
    > > > ----------------
    > > > PEP 201 (lock-step iteration) documents that a fill-in feature was
    > > > contemplated and rejected for the zip() built-in introduced in Py2.0.
    > > > In the years before and after, SourceForge logs show no requests for a
    > > > fill-in feature.

    > >
    > > My perception is that many people view the process
    > > of advocating for a library addition as
    > > 1. Very time consuming due to the large amount of
    > > work involved in presenting and defending a proposal.

    >
    > I would characterize it as time consuming due to the amount of
    > research, discussion, and analysis it takes to determine whether or not
    > a proposal is a good idea.
    >
    > > 2. Having a very small chance of acceptance.

    >
    > It is less a matter of chance and more a matter of quality. Great
    > ideas usually make it. Crummy ideas have no chance unless no one takes
    > the time to think them through.


    Great and crummy are not the problem, since the answer
    in those cases is obvious. It is the middle ground where
    the answer is not clear, where different people can hold
    different views, that are the problem.

    > > I do not know whether this is really the case or even if my
    > > perception is correct, but if it is, it could account for the
    > > lack of feature requests.

    >
    > I've been monitoring and adjudicating feature requests for five years.
    > Pythonistas are not known for the lack of assertiveness. If a core
    > feature has usability problems, we tend to hear about it quickly.
    > Also, at PyCon, people are not shy about discussing issues that have
    > arisen.


    Yet these are the people both most familiar with the
    library as it exists and the most able to easily work
    around any limitations, maybe without even thinking
    about it. So I am not surprised that this might not
    have come up.

    To me, the izip solution for my use case was "obvious".
    None of the other solutions posted here were.
    Of course that could be fixed with documentation.

    > The lack of requests is not a definitive answer; however, it does
    > suggest that there is not an strong unmet need. The lack of examples
    > in the standard library and other code scans corroborates that notion.
    > This newsgroup query with further serve to gauge the level of interest
    > and to ferret-out real-word use cases. The jury is still out.


    Comments at end re use cases.

    > > How well correlated in the use of map()-with-fill with the
    > > (need for) the use of zip/izip-with-fill?

    >
    > Close to 100%. A non-iterator version of izip_longest() is exactly
    > equivalent to map(None, it1, it2, ...).


    Isn't non-iterator and iterator very significant? If I use map()
    I can trivially determine the arguments lengths and deal with
    unequal length before map(). With iterators that is more
    difficult. So I can imagine many cases where izip might
    be applicable but map not, and a lack of map use cases
    not representative of izip use cases.

    > Since "we already got one", the real issue is whether it has been so
    > darned useful that it warrants a second variant with two new features
    > (returns an iterator instead of a list and allows a user-specifiable
    > fill value).


    I don't see it as having one and adding a second variant.
    I see it as having 1/2 and adding the other 1/2.

    > > FWIW, the OP's use case involved printing files in multiple
    > > > columns:
    > > >
    > > > for f, g in itertools.izip_longest(file1, file2, fillin_value=''):
    > > > print '%-20s\t|\t%-20s' % (f.rstrip(), g.rstrip())

    > . . .
    >
    > > Actuall my use case did not have quite so much
    > > perlish line noise :)

    >
    > The code was not intended to recapitulate your thread; instead, it was
    > a compact way of summarizing the problem context that first suggested
    > some value to izip_longest().


    I realize that. I just thought that having a
    lot extraneous stuff like the formatting made
    it look at first glance, messier than it should.

    > > for i1, i2 in itertools.izip (iterable_1, iterable_2):
    > > print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())
    > >
    > > can be replaced by:
    > > while 1:
    > > i1 = iterable_1.next()
    > > i2 = iterable_2.next()
    > > print '%-20s\t|\t%-20s' % (i1.rstrip(), i2.rstrip())
    > >
    > > yet that was not justification for rejecting izip()'s
    > > inclusion in itertools.

    >
    > Two thoughts:
    >
    > 1) The easily-coded-simple-alternative argument applies less strongly
    > to common cases (equal sequence lengths and finite sequences mixed with
    > infinite suppliers) than it does to less common cases (unequal sequence
    > lengths where order is important and missing data elements have
    > meaning).
    >
    > 2) The replacement code is not quite accurate -- the StopIteration
    > exception needs to be trapped.


    Yes, but I don't think that negates the point.

    > > The other use case I had was a simple file diff.
    > > All I cared about was if the files were the same or
    > > not, and if not, what were the first differing lines.

    >
    > Did you look at difflib?


    Yes, but it was way overkill for what I needed.

    > Raymond


    ~~~
    Thanks for your response but I'm curious why you
    mailed it rather than posted?

    I am still left with a difficult to express feeling of
    dissatifaction at this process.

    Plese try to see it from the point of view of
    someone who it not a expert at Python:

    Here is izip().
    My conception is it takes two sequence generators
    and matches up the items from each. (I am talking
    overall coceptual models here, not details.)
    Here is my problem.
    I have two files that produce lines and I want to
    compare each line.
    Seems like a perfect fit.

    So I read that izip() only goes to shortest itereable,
    I think, "why only the shortest? why not the longest?
    what's so special about the shortest?"
    At this point explanations involving lack of uses cases
    are not very convincing. I have a use. All the
    alternative solutions are more code, less clear, less
    obvious, less right. But most importantly, there
    seems to be a symmetry between the two cases
    (shortest vs longest) that makes the lack of
    support for matching-to-longest somehow a
    defect.

    Now if there is something fundamental about
    matching items in parallel lists that makes it a
    sensible thing to do only for equal lists (or to the
    shortest list) that's fine. You seem to imply that's
    the case by referencing Haskell, ML, etc. If so,
    that needs to be pointed out in izip's docs.
    (Though nothing I have read in this thread has
    been convincing.)

    If it is the case that a matching-longest izip is easily
    handled by adding a line or to code using izip-shortest
    that should be pointed out in the doc.

    But if the answer is to write out an equivalent generator
    in basic python, I cannot see izip but as being
    excessively specialized, and needing to be fixed.

    Re use-cases...

    Uses cases seem to be sought from readers
    of c.l.p. and python-dev. That is a pretty small
    percentage of python users, and those that
    choose to respond are self-selecting. I would
    expect the distribution of responders to be
    skewed toward advanced users for example.
    The other source seems to be a search of
    the standard libraries but isn't that also likely
    not representative of all the code out in the
    wild?

    Also, can anyone really remember their code
    well enough to recall when some proposed
    enhancement would be beneficial?

    What I am suggesting is that use cases are
    important but it also should be realized is that
    they may not always give an accurate quantitative
    picture, and that some things still might be good
    ideas even without use cases (and the converse of
    course), not because the use cases don't exist,
    but because they may not be seen by the current
    use case solicitation process.
     
    , Jan 11, 2006
    #17
  18. schrieb:
    > I am still left with a difficult to express feeling of
    > dissatifaction at this process.
    >
    > Plese try to see it from the point of view of
    > someone who it not a expert at Python:
    >
    > ... [explains his POV]


    i more or less completely agree with you, IOW i'd like izip
    to change, too. but there are two problems that you haven't
    mentioned. first is that, in the case of izip, it is not clear
    how it should be fixed and if such a change does not naturally
    fit an API it is difficult to incorporate. personally i think
    i like the keyword version ("izip(*args, sentinel=None)") best,
    but the .rest-method version is appealing too...

    second (and i think this is the reason for the use-case search)
    is that someone has to do it. that means implement it and fix
    the docs, add a test-case and such stuff. if there are not many
    use-cases the effort to do so might not be worthwhile.

    that means if someone (you?) steps forward with a patch that does
    this, it would dramatically increase the chance of a change ;).

    --
    David.
     
    David Murmann, Jan 12, 2006
    #18
  19. [David Murmann]
    > i'd like izip
    > to change, too.


    The zip() function, introduced in Py2.0, was popular and broadly
    useful. The izip() function is a zip() substitute with better memory
    utilization yet almost identical in how it is used. It is bugfree,
    successful, fast, and won't change.

    The map() function, introduced shortly after the transistor was
    invented, incorporates an option that functions like zip() but fills-in
    missing values and won't truncate. It probably seemed like a good idea
    at the time, but AFAICT no one uses it (Alex once as a newbie; Strakt
    once; me never; the standard library never; etc).

    So, the question is not whether non-truncating fill-in will be
    available. Afterall, we've already got one: map(None, it1, it2).

    Instead, the question is whether to introduce another substantially
    identical function with improved memory utilization and a specifiable
    fill-in value. But, why would you offer a slightly improved variant of
    something that doesn't get used?

    Put another way: If you don't use map(None, it1, it2), then you're
    going to have a hard time explaining why you need
    itertools.izip_longest(it1, it2).



    > second (and i think this is the reason for the use-case search)
    > is that someone has to do it. that means implement it and fix
    > the docs, add a test-case and such stuff. if there are not many
    > use-cases the effort to do so might not be worthwhile.


    In this case, the coding and testing are easy. So that's not the
    problem. The real issue is the clutter factor from introducing new
    functions if they're not going to be used, if they don't have good use
    cases, and if there are better ways to approach most problems.

    The reason for the use case search is to determine whether
    izip_longest() would end-up as unutilized cruft and add dead-weight to
    the language. The jury is still out but it doesn't look promising.


    Raymond
     
    Raymond Hettinger, Jan 12, 2006
    #19
  20. []
    > > > How well correlated in the use of map()-with-fill with the
    > > > (need for) the use of zip/izip-with-fill?


    [raymond]
    > > Close to 100%. A non-iterator version of izip_longest() is exactly
    > > equivalent to map(None, it1, it2, ...).


    []
    > If I use map()
    > I can trivially determine the arguments lengths and deal with
    > unequal length before map(). With iterators that is more
    > difficult. So I can imagine many cases where izip might
    > be applicable but map not, and a lack of map use cases
    > not representative of izip use cases.


    You don't seem to understand what map() does. There is no need to
    deal with unequal argument lengths before map(); it does the work for
    you. It handles iterator inputs the same way. Meditate on this:

    def izip_longest(*args):
    return iter(map(None, *args))

    Modulo arbitrary fill values and lazily evaluated inputs, the semantics
    are exactly what is being requested. Ergo, lack of use cases for
    map(None,it1,it2) means that izip_longest(it1,it2) isn't needed.

    Raymond
     
    Raymond Hettinger, Jan 12, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raymond Hettinger
    Replies:
    12
    Views:
    1,072
    Raymond Hettinger
    Jan 10, 2006
  2. length power
    Replies:
    2
    Views:
    103
    Rustom Mody
    Apr 10, 2014
  3. Skip Montanaro
    Replies:
    0
    Views:
    70
    Skip Montanaro
    Apr 10, 2014
  4. Johannes Schneider

    Re: why i have the output of [None, None, None]

    Johannes Schneider, Apr 10, 2014, in forum: Python
    Replies:
    0
    Views:
    61
    Johannes Schneider
    Apr 10, 2014
  5. Terry Reedy
    Replies:
    0
    Views:
    67
    Terry Reedy
    Apr 10, 2014
Loading...

Share This Page