While we're talking about annoyances

Discussion in 'Python' started by Steven D'Aprano, Apr 29, 2007.

  1. Am I the only one who finds that I'm writing more documentation than code?

    I recently needed to write a function to generate a rank table from a
    list. That is, a list of ranks, where the rank of an item is the position
    it would be in if the list were sorted:

    alist = list('defabc')
    ranks = [3, 4, 5, 0, 1, 2]

    To do that, I needed to generate an index table first. In the book
    "Numerical Recipes in Pascal" by William Press et al there is a procedure
    to generate an index table (46 lines of code) and one for a rank table
    (five lines).

    In Python, my index function is four lines of code and my rank function is
    five lines. I then wrote three more functions for verifying that my index
    and rank tables were calculated correctly (17 more lines) and four more
    lines to call doctest, giving a total of 30 lines of code.

    I also have 93 lines of documentation, including doctests, or three
    lines of documentation per line of code.

    For those interested, here is how to generate an index table and rank
    table in Python:


    def index(sequence):
    decorated = zip(sequence, xrange(len(sequence)))
    decorated.sort()
    return [idx for (value, idx) in decorated]

    def rank(sequence):
    table = [None] * len(sequence)
    for j, idx in enumerate(index(sequence)):
    table[idx] = j
    return table


    You can write your own damn documentation. *wink*



    --
    Steven.
    Steven D'Aprano, Apr 29, 2007
    #1
    1. Advertising

  2. Steven D'Aprano

    GHUM Guest

    Steven,

    > def index(sequence):
    > decorated = zip(sequence, xrange(len(sequence)))
    > decorated.sort()
    > return [idx for (value, idx) in decorated]


    would'nt that be equivalent code?

    def index(sequence):
    return [c for _,c in sorted((b,a) for a, b in
    enumerate(sequence))]

    tested, gave same results. But worsens your doc2code ratio :)

    Harald Armin Massa
    --
    GHUM, Apr 29, 2007
    #2
    1. Advertising

  3. GHUM wrote:
    > Steven,
    >
    >> def index(sequence):
    >> decorated = zip(sequence, xrange(len(sequence)))
    >> decorated.sort()
    >> return [idx for (value, idx) in decorated]

    >
    > would'nt that be equivalent code?
    >
    > def index(sequence):
    > return [c for _,c in sorted((b,a) for a, b in
    > enumerate(sequence))]


    Or even these:

    def index(sequence):
    return sorted(range(len(sequence)), key=sequence.__getitem__)

    def rank(sequence):
    return sorted(range(len(sequence)),
    key=index(sequence).__getitem__)

    Hint: if you find yourself using a decorate-sort-undecorate pattern,
    sorted(key=func) or sequence.sort(key=func) might be a better idea.
    --
    Michael Hoffman
    Michael Hoffman, Apr 29, 2007
    #3
  4. On 4/29/07, Steven D'Aprano <> wrote:
    > To do that, I needed to generate an index table first. In the book
    > "Numerical Recipes in Pascal" by William Press et al there is a procedure
    > to generate an index table (46 lines of code) and one for a rank table
    > (five lines).


    51 lines total.

    > In Python, my index function is four lines of code and my rank function is
    > five lines. I then wrote three more functions for verifying that my index
    > and rank tables were calculated correctly (17 more lines) and four more
    > lines to call doctest, giving a total of 30 lines of code.


    So 9 lines for Python, excluding tests.

    > I also have 93 lines of documentation, including doctests, or three
    > lines of documentation per line of code.


    Then, without documentation, Python is roughly 560% (51/9) as
    efficient as Pascal. But with documentation (assuming you need the
    same amount of documentation for the Python code as the Pascal code),
    (51 + 93)/(9 + 93) = 1.41 so only 141% as efficient as Pascal.

    I wonder what that means? Maybe Python the language is approaching the
    upper bound for how efficient an imperative programming language can
    be? On the other hand, there seem to be some progress that could be
    made to reduce the amount of work in writing documentation.
    Documentation in Esperanto instead of English maybe?

    --
    mvh Björn
    =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=, Apr 29, 2007
    #4
  5. Steven D'Aprano

    Ben Finney Guest

    "BJörn Lindqvist" <> writes:

    > On the other hand, there seem to be some progress that could be made
    > to reduce the amount of work in writing documentation.
    > Documentation in Esperanto instead of English maybe?


    Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
    than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
    choice for computer documentation to me.


    ... _[0] ignoring the fact that it's spoken by even fewer people than
    Esperanto.

    --
    \ "The greater the artist, the greater the doubt; perfect |
    `\ confidence is granted to the less talented as a consolation |
    _o__) prize." -- Robert Hughes |
    Ben Finney
    Ben Finney, Apr 29, 2007
    #5
  6. Steven D'Aprano

    Jarek Zgoda Guest

    Ben Finney napisa³(a):

    >> On the other hand, there seem to be some progress that could be made
    >> to reduce the amount of work in writing documentation.
    >> Documentation in Esperanto instead of English maybe?

    >
    > Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
    > than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
    > choice for computer documentation to me.


    German seems to be less "wordy" than English, despite the fact that most
    of nouns is much longer. ;)

    --
    Jarek Zgoda
    http://jpa.berlios.de/
    Jarek Zgoda, Apr 29, 2007
    #6
  7. On Apr 29, 11:46 am, Michael Hoffman <> wrote:
    > GHUM wrote:
    > > Steven,

    >
    > >> def index(sequence):
    > >> decorated = zip(sequence, xrange(len(sequence)))
    > >> decorated.sort()
    > >> return [idx for (value, idx) in decorated]

    >
    > > would'nt that be equivalent code?

    >
    > > def index(sequence):
    > > return [c for _,c in sorted((b,a) for a, b in
    > > enumerate(sequence))]

    >
    > Or even these:
    >
    > def index(sequence):
    > return sorted(range(len(sequence)), key=sequence.__getitem__)
    >
    > def rank(sequence):
    > return sorted(range(len(sequence)),
    > key=index(sequence).__getitem__)


    Better still:

    def rank(sequence):
    return index(index(sequence))

    :)
    But really these two versions of rank are slower than the original one
    (as sorting a list is O(nlogn) whereas filling a table with
    precomputed values is O(n) ).

    Anyway I would like to contribute my own index function:

    def index(seq):
    return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

    It's short and has the advantage of being self-documenting, which will
    save Steven a lot of annoying typing I hope ;) Who said Python
    couldn't rival with perl?

    --
    Arnaud
    Arnaud Delobelle, Apr 29, 2007
    #7
  8. Steven D'Aprano

    Paul Rubin Guest

    Steven D'Aprano <> writes:
    > I recently needed to write a function to generate a rank table from a
    > list. That is, a list of ranks, where the rank of an item is the position
    > it would be in if the list were sorted:
    >
    > alist = list('defabc')
    > ranks = [3, 4, 5, 0, 1, 2]


    fst = operator.itemgetter(0) # these should be builtins...
    snd = operator.itemgetter(1)

    ranks=map(fst, sorted(enumerate(alist), key=snd))
    Paul Rubin, Apr 29, 2007
    #8
  9. On Apr 29, 5:33 pm, Paul Rubin <http://> wrote:
    > Steven D'Aprano <> writes:
    > > I recently needed to write a function to generate a rank table from a
    > > list. That is, a list of ranks, where the rank of an item is the position
    > > it would be in if the list were sorted:

    >
    > > alist = list('defabc')
    > > ranks = [3, 4, 5, 0, 1, 2]

    >
    > fst = operator.itemgetter(0) # these should be builtins...
    > snd = operator.itemgetter(1)
    >
    > ranks=map(fst, sorted(enumerate(alist), key=snd))


    This is what the OP calls the index table, not the ranks table (both
    are the same for the example above, but that's an unfortunate
    coincidence...)

    --
    Arnaud
    Arnaud Delobelle, Apr 29, 2007
    #9
  10. [Steven D'Aprano]
    > I recently needed to write a function to generate a rank table from a
    > list. That is, a list of ranks, where the rank of an item is the position
    > it would be in if the list were sorted:
    >
    > alist = list('defabc')
    > ranks = [3, 4, 5, 0, 1, 2]

    .. . .
    > def rank(sequence):
    > table = [None] * len(sequence)
    > for j, idx in enumerate(index(sequence)):
    > table[idx] = j
    > return table


    FWIW, you can do ranking faster and more succinctly with the sorted()
    builtin:

    def rank(seq):
    return sorted(range(len(seq)), key=seq.__getitem__)


    Raymond
    Raymond Hettinger, Apr 29, 2007
    #10
  11. Arnaud Delobelle <> wrote:
    ...
    > > >> decorated.sort()

    ...
    > > def index(sequence):
    > > return sorted(range(len(sequence)), key=sequence.__getitem__)

    ...
    > But really these two versions of rank are slower than the original one
    > (as sorting a list is O(nlogn) whereas filling a table with
    > precomputed values is O(n) ).


    Wrong, because the original one also had a sort step, of course, so it
    was also, inevitably, O(N log N) -- I've quoted the .sort step above.

    > Anyway I would like to contribute my own index function:
    >
    > def index(seq):
    > return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])
    >
    > It's short and has the advantage of being self-documenting, which will
    > save Steven a lot of annoying typing I hope ;) Who said Python
    > couldn't rival with perl?


    sum is for summing NUMBERS -- using it on lists is O(N squared).

    So, this solution is asymptotically VERY slow, as well as obfuscated.


    Alex
    Alex Martelli, Apr 30, 2007
    #11
  12. Alex Martelli wrote:
    > Arnaud Delobelle <> wrote:
    > ...
    >>>>> decorated.sort()

    > ...
    >>> def index(sequence):
    >>> return sorted(range(len(sequence)), key=sequence.__getitem__)

    > ...
    >> But really these two versions of rank are slower than the original one
    >> (as sorting a list is O(nlogn) whereas filling a table with
    >> precomputed values is O(n) ).

    >
    > Wrong, because the original one also had a sort step, of course, so it
    > was also, inevitably, O(N log N) -- I've quoted the .sort step above.


    Well, counting the index() function that is called in both cases, the
    original rank() had one sort, but my version has two sorts.
    --
    Michael Hoffman
    Michael Hoffman, Apr 30, 2007
    #12
  13. Michael Hoffman <> wrote:

    > Alex Martelli wrote:
    > > Arnaud Delobelle <> wrote:
    > > ...
    > >>>>> decorated.sort()

    > > ...
    > >>> def index(sequence):
    > >>> return sorted(range(len(sequence)), key=sequence.__getitem__)

    > > ...
    > >> But really these two versions of rank are slower than the original one
    > >> (as sorting a list is O(nlogn) whereas filling a table with
    > >> precomputed values is O(n) ).

    > >
    > > Wrong, because the original one also had a sort step, of course, so it
    > > was also, inevitably, O(N log N) -- I've quoted the .sort step above.

    >
    > Well, counting the index() function that is called in both cases, the
    > original rank() had one sort, but my version has two sorts.


    That doesn't affet the big-O behavior -- O(N log N) holds whether you
    have one sort, or three, or twentyseven.


    Alex
    Alex Martelli, Apr 30, 2007
    #13
  14. On Apr 30, 2:50 am, (Alex Martelli) wrote:
    > Arnaud Delobelle <> wrote:
    >
    > ...
    >
    > > > >> decorated.sort()

    > ...
    > > > def index(sequence):
    > > > return sorted(range(len(sequence)), key=sequence.__getitem__)

    > ...
    > > But really these two versions of rank are slower than the original one
    > > (as sorting a list is O(nlogn) whereas filling a table with
    > > precomputed values is O(n) ).

    >
    > Wrong, because the original one also had a sort step, of course, so it
    > was also, inevitably, O(N log N) -- I've quoted the .sort step above.


    I am fully aware of the meaning of O(...). Nevertheless (speed !=
    asymptotic speed) and one sort is still better than two sorts IMHO.
    Moreover the second sort is redundant and less clear than a simple
    loop.

    > > Anyway I would like to contribute my own index function:

    >
    > > def index(seq):
    > > return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

    >
    > > It's short and has the advantage of being self-documenting, which will
    > > save Steven a lot of annoying typing I hope ;) Who said Python
    > > couldn't rival with perl?

    >
    > sum is for summing NUMBERS -- using it on lists is O(N squared).
    >
    > So, this solution is asymptotically VERY slow, as well as obfuscated.


    And it was also a JOKE. There were some clues:
    * I claimed that the function was self-documenting, even though it
    was obviously obfuscated (as you rightly pointed out).
    * It relies on a side effect of the 'key' function list.pop, which
    is very bad form.
    * It is indeed very slow (yes, sum() is not the best for lists)
    * I mentioned that Python could rival perl.

    I was meant to be a clumsy but 'concise' amalgam that would perform
    the task (although not efficiently) while being difficult to make
    sense of.

    --
    Arnaud
    Arnaud Delobelle, Apr 30, 2007
    #14
  15. Alex Martelli wrote:
    > Michael Hoffman <> wrote:
    >
    >> Alex Martelli wrote:
    >>> Arnaud Delobelle <> wrote:
    >>> ...
    >>>>>>> decorated.sort()
    >>> ...
    >>>>> def index(sequence):
    >>>>> return sorted(range(len(sequence)), key=sequence.__getitem__)
    >>> ...
    >>>> But really these two versions of rank are slower than the original one
    >>>> (as sorting a list is O(nlogn) whereas filling a table with
    >>>> precomputed values is O(n) ).
    >>> Wrong, because the original one also had a sort step, of course, so it
    >>> was also, inevitably, O(N log N) -- I've quoted the .sort step above.

    >> Well, counting the index() function that is called in both cases, the
    >> original rank() had one sort, but my version has two sorts.

    >
    > That doesn't affet the big-O behavior -- O(N log N) holds whether you
    > have one sort, or three, or twentyseven.


    I've taught programming classes before, and I would have had to fail
    anybody who misunderstood speed badly enough to claim that something
    repeating an O(N log N) algorithm 27 times was no faster than doing it
    once. ;-)

    As Arnaud points out, asymptotic behavior is not the same as speed. His
    original statement that the more recently proposed definitions of rank()
    are slower than the OP's may be correct. And if it's not, it's not
    because they're all O(N log N).
    --
    Michael Hoffman
    Michael Hoffman, Apr 30, 2007
    #15
  16. Michael Hoffman <> wrote:
    ...
    > >> Well, counting the index() function that is called in both cases, the
    > >> original rank() had one sort, but my version has two sorts.

    > >
    > > That doesn't affet the big-O behavior -- O(N log N) holds whether you
    > > have one sort, or three, or twentyseven.

    >
    > I've taught programming classes before, and I would have had to fail
    > anybody who misunderstood speed badly enough to claim that something
    > repeating an O(N log N) algorithm 27 times was no faster than doing it
    > once. ;-)


    As for me, I would fail somebody who thought they could compare speed
    this way -- if the O(N log N) executed 27 times took (on a given
    machine) 1 millisecond times N times log N, and the other one (on the
    same machine) 1 second times N times log N (and in the big-O notation
    you can NEVER infer what the multiplicative constant is), for example,
    such a comparison would be totally of base.

    > As Arnaud points out, asymptotic behavior is not the same as speed. His
    > original statement that the more recently proposed definitions of rank()
    > are slower than the OP's may be correct. And if it's not, it's not
    > because they're all O(N log N).


    And if it is, it's not because of the "one sort" versus "two sorts": by
    that sole criterion you just cannot guess (sensibly) at speed (measuring
    is much better, though it has its own pitfalls).


    Alex
    Alex Martelli, Apr 30, 2007
    #16
  17. On Apr 30, 3:51 pm, (Alex Martelli) wrote:
    > Michael Hoffman <> wrote:
    >
    > ...
    >
    > > >> Well, counting the index() function that is called in both cases, the
    > > >> original rank() had one sort, but my version has two sorts.

    >
    > > > That doesn't affet the big-O behavior -- O(N log N) holds whether you
    > > > have one sort, or three, or twentyseven.


    Again we're not talking about big-O: we're talking about which one is
    faster.

    > > I've taught programming classes before, and I would have had to fail
    > > anybody who misunderstood speed badly enough to claim that something
    > > repeating an O(N log N) algorithm 27 times was no faster than doing it
    > > once. ;-)

    >
    > As for me, I would fail somebody who thought they could compare speed
    > this way -- if the O(N log N) executed 27 times took (on a given
    > machine) 1 millisecond times N times log N, and the other one (on the
    > same machine) 1 second times N times log N (and in the big-O notation
    > you can NEVER infer what the multiplicative constant is), for example,
    > such a comparison would be totally of base.


    Of course you are right: this is basic asymptotic analysis.
    You also know very well that this is not what I or Michael were
    thinking of. He was simply saying that repeating the *same thing* 27
    times takes more time than simply doing it once, as it was made clear
    by my earlier post that his solution involved repeating the same
    index() function twice.

    > > As Arnaud points out, asymptotic behavior is not the same as speed. His
    > > original statement that the more recently proposed definitions of rank()
    > > are slower than the OP's may be correct. And if it's not, it's not
    > > because they're all O(N log N).

    >
    > And if it is, it's not because of the "one sort" versus "two sorts": by
    > that sole criterion you just cannot guess (sensibly) at speed (measuring
    > is much better, though it has its own pitfalls).


    I'm afraid you can draw some conclusions in this case, precisely
    *because* one version has one sort less than the other (that is what I
    was hinting at in my first post).

    Let's say the time complexity of the first sort is:
    f(n) ~ Knlogn
    The time complexity of the second sort is:
    g(n) ~ Lnlogn
    The time complexity of the simple loop that can replace the second
    sort is:
    h(n) ~ Hn

    Now because the whole algorithm consists of doing one sort THEN the
    second thing(sort or loop), the time complexities of the two versions
    are as follow:

    Two sort version:
    f(n)+g(n) ~ Knlogn+Lnlogn
    ~ (K+L)nlogn
    One sort version:
    f(n)+h(n) ~ Knlogn + Hn
    ~ Knlogn(1+H/Klogn)
    ~ Knlogn

    So the two sort version is (K+L)/K times slower than the one-sort
    version asymptotically.
    (in practice I reckon K=L so that makes it twice slower)

    --
    Arnaud
    "Errare humanum est, sed perseverare diabolicum"
    Arnaud Delobelle, May 1, 2007
    #17
  18. Steven D'Aprano

    Tim Roberts Guest

    Michael Hoffman <> wrote:
    >
    >Hint: if you find yourself using a decorate-sort-undecorate pattern,
    >sorted(key=func) or sequence.sort(key=func) might be a better idea.


    Is it? I thought I remember reading on this very list some years ago that
    the performance of sequence.sort became much, much worse when a key
    function was supplied.

    I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
    because of that.
    --
    Tim Roberts,
    Providenza & Boekelheide, Inc.
    Tim Roberts, May 2, 2007
    #18
  19. On Wed, 02 May 2007 06:10:54 +0000, Tim Roberts wrote:

    > Michael Hoffman <> wrote:
    >>
    >>Hint: if you find yourself using a decorate-sort-undecorate pattern,
    >>sorted(key=func) or sequence.sort(key=func) might be a better idea.

    >
    > Is it? I thought I remember reading on this very list some years ago that
    > the performance of sequence.sort became much, much worse when a key
    > function was supplied.


    You're probably thinking of a comparison function:

    sequence.sort(cmp=lambda x, y: cmp(y, x))

    is a slow way of sorting a sequence in reverse order. The fast
    way is the two liner:

    sequence.sort()
    sequence.reverse()

    Faster(?) still is:

    sequence.sort(reverse=True)


    > I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
    > because of that.


    That's what the key= argument does. cmp= is slow because the comparison
    function is called for EVERY comparison. The key= function is only called
    once per element.



    --
    Steven D'Aprano
    Steven D'Aprano, May 2, 2007
    #19
  20. Steven D'Aprano wrote:
    > On Wed, 02 May 2007 06:10:54 +0000, Tim Roberts wrote:


    >> I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
    >> because of that.

    >
    > That's what the key= argument does. cmp= is slow because the comparison
    > function is called for EVERY comparison. The key= function is only called
    > once per element.


    Right. Using sort(key=keyfunc) is supposed to be faster than
    decorate-sort-undecorate. And I think it is clearer too.
    --
    Michael Hoffman
    Michael Hoffman, May 2, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JV

    cpl VS.NET annoyances

    JV, May 31, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    475
  2. Adrienne Boswell

    OT: Annoyances in PDF

    Adrienne Boswell, Apr 27, 2007, in forum: HTML
    Replies:
    6
    Views:
    317
    Neredbojias
    Apr 28, 2007
  3. Replies:
    38
    Views:
    758
    Sion Arrowsmith
    Apr 30, 2007
  4. Seebs

    [META] Talking about talking about C.

    Seebs, Oct 23, 2010, in forum: C Programming
    Replies:
    113
    Views:
    2,269
    Tim Rentsch
    Nov 24, 2010
  5. Ben Giddings

    While we're talking Smalltalk

    Ben Giddings, Aug 18, 2003, in forum: Ruby
    Replies:
    0
    Views:
    78
    Ben Giddings
    Aug 18, 2003
Loading...

Share This Page