# While we're talking about annoyances

Discussion in 'Python' started by Steven D'Aprano, Apr 29, 2007.

1. ### Steven D'ApranoGuest

Am I the only one who finds that I'm writing more documentation than code?

I recently needed to write a function to generate a rank table from a
list. That is, a list of ranks, where the rank of an item is the position
it would be in if the list were sorted:

alist = list('defabc')
ranks = [3, 4, 5, 0, 1, 2]

To do that, I needed to generate an index table first. In the book
"Numerical Recipes in Pascal" by William Press et al there is a procedure
to generate an index table (46 lines of code) and one for a rank table
(five lines).

In Python, my index function is four lines of code and my rank function is
five lines. I then wrote three more functions for verifying that my index
and rank tables were calculated correctly (17 more lines) and four more
lines to call doctest, giving a total of 30 lines of code.

I also have 93 lines of documentation, including doctests, or three
lines of documentation per line of code.

For those interested, here is how to generate an index table and rank
table in Python:

def index(sequence):
decorated = zip(sequence, xrange(len(sequence)))
decorated.sort()
return [idx for (value, idx) in decorated]

def rank(sequence):
table = [None] * len(sequence)
for j, idx in enumerate(index(sequence)):
table[idx] = j
return table

You can write your own damn documentation. *wink*

--
Steven.

Steven D'Aprano, Apr 29, 2007

2. ### GHUMGuest

Steven,

> def index(sequence):
> decorated = zip(sequence, xrange(len(sequence)))
> decorated.sort()
> return [idx for (value, idx) in decorated]

would'nt that be equivalent code?

def index(sequence):
return [c for _,c in sorted((b,a) for a, b in
enumerate(sequence))]

tested, gave same results. But worsens your doc2code ratio

Harald Armin Massa
--

GHUM, Apr 29, 2007

3. ### Michael HoffmanGuest

GHUM wrote:
> Steven,
>
>> def index(sequence):
>> decorated = zip(sequence, xrange(len(sequence)))
>> decorated.sort()
>> return [idx for (value, idx) in decorated]

>
> would'nt that be equivalent code?
>
> def index(sequence):
> return [c for _,c in sorted((b,a) for a, b in
> enumerate(sequence))]

Or even these:

def index(sequence):
return sorted(range(len(sequence)), key=sequence.__getitem__)

def rank(sequence):
return sorted(range(len(sequence)),
key=index(sequence).__getitem__)

Hint: if you find yourself using a decorate-sort-undecorate pattern,
sorted(key=func) or sequence.sort(key=func) might be a better idea.
--
Michael Hoffman

Michael Hoffman, Apr 29, 2007
4. ### =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=Guest

On 4/29/07, Steven D'Aprano <> wrote:
> To do that, I needed to generate an index table first. In the book
> "Numerical Recipes in Pascal" by William Press et al there is a procedure
> to generate an index table (46 lines of code) and one for a rank table
> (five lines).

51 lines total.

> In Python, my index function is four lines of code and my rank function is
> five lines. I then wrote three more functions for verifying that my index
> and rank tables were calculated correctly (17 more lines) and four more
> lines to call doctest, giving a total of 30 lines of code.

So 9 lines for Python, excluding tests.

> I also have 93 lines of documentation, including doctests, or three
> lines of documentation per line of code.

Then, without documentation, Python is roughly 560% (51/9) as
efficient as Pascal. But with documentation (assuming you need the
same amount of documentation for the Python code as the Pascal code),
(51 + 93)/(9 + 93) = 1.41 so only 141% as efficient as Pascal.

I wonder what that means? Maybe Python the language is approaching the
upper bound for how efficient an imperative programming language can
be? On the other hand, there seem to be some progress that could be
made to reduce the amount of work in writing documentation.
Documentation in Esperanto instead of English maybe?

--
mvh Björn

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=, Apr 29, 2007
5. ### Ben FinneyGuest

"BJÃ¶rn Lindqvist" <> writes:

> On the other hand, there seem to be some progress that could be made
> to reduce the amount of work in writing documentation.
> Documentation in Esperanto instead of English maybe?

Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
choice for computer documentation to me.

... _[0] ignoring the fact that it's spoken by even fewer people than
Esperanto.

--
\ "The greater the artist, the greater the doubt; perfect |
`\ confidence is granted to the less talented as a consolation |
_o__) prize." -- Robert Hughes |
Ben Finney

Ben Finney, Apr 29, 2007
6. ### Jarek ZgodaGuest

Ben Finney napisa³(a):

>> On the other hand, there seem to be some progress that could be made
>> to reduce the amount of work in writing documentation.
>> Documentation in Esperanto instead of English maybe?

>
> Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
> than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
> choice for computer documentation to me.

German seems to be less "wordy" than English, despite the fact that most
of nouns is much longer.

--
Jarek Zgoda
http://jpa.berlios.de/

Jarek Zgoda, Apr 29, 2007
7. ### Arnaud DelobelleGuest

On Apr 29, 11:46 am, Michael Hoffman <> wrote:
> GHUM wrote:
> > Steven,

>
> >> def index(sequence):
> >> decorated = zip(sequence, xrange(len(sequence)))
> >> decorated.sort()
> >> return [idx for (value, idx) in decorated]

>
> > would'nt that be equivalent code?

>
> > def index(sequence):
> > return [c for _,c in sorted((b,a) for a, b in
> > enumerate(sequence))]

>
> Or even these:
>
> def index(sequence):
> return sorted(range(len(sequence)), key=sequence.__getitem__)
>
> def rank(sequence):
> return sorted(range(len(sequence)),
> key=index(sequence).__getitem__)

Better still:

def rank(sequence):
return index(index(sequence))

But really these two versions of rank are slower than the original one
(as sorting a list is O(nlogn) whereas filling a table with
precomputed values is O(n) ).

Anyway I would like to contribute my own index function:

def index(seq):
return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

It's short and has the advantage of being self-documenting, which will
save Steven a lot of annoying typing I hope Who said Python
couldn't rival with perl?

--
Arnaud

Arnaud Delobelle, Apr 29, 2007
8. ### Paul RubinGuest

Steven D'Aprano <> writes:
> I recently needed to write a function to generate a rank table from a
> list. That is, a list of ranks, where the rank of an item is the position
> it would be in if the list were sorted:
>
> alist = list('defabc')
> ranks = [3, 4, 5, 0, 1, 2]

fst = operator.itemgetter(0) # these should be builtins...
snd = operator.itemgetter(1)

ranks=map(fst, sorted(enumerate(alist), key=snd))

Paul Rubin, Apr 29, 2007
9. ### Arnaud DelobelleGuest

On Apr 29, 5:33 pm, Paul Rubin <http://> wrote:
> Steven D'Aprano <> writes:
> > I recently needed to write a function to generate a rank table from a
> > list. That is, a list of ranks, where the rank of an item is the position
> > it would be in if the list were sorted:

>
> > alist = list('defabc')
> > ranks = [3, 4, 5, 0, 1, 2]

>
> fst = operator.itemgetter(0) # these should be builtins...
> snd = operator.itemgetter(1)
>
> ranks=map(fst, sorted(enumerate(alist), key=snd))

This is what the OP calls the index table, not the ranks table (both
are the same for the example above, but that's an unfortunate
coincidence...)

--
Arnaud

Arnaud Delobelle, Apr 29, 2007
10. ### Raymond HettingerGuest

[Steven D'Aprano]
> I recently needed to write a function to generate a rank table from a
> list. That is, a list of ranks, where the rank of an item is the position
> it would be in if the list were sorted:
>
> alist = list('defabc')
> ranks = [3, 4, 5, 0, 1, 2]

.. . .
> def rank(sequence):
> table = [None] * len(sequence)
> for j, idx in enumerate(index(sequence)):
> table[idx] = j
> return table

FWIW, you can do ranking faster and more succinctly with the sorted()
builtin:

def rank(seq):
return sorted(range(len(seq)), key=seq.__getitem__)

Raymond

Raymond Hettinger, Apr 29, 2007
11. ### Alex MartelliGuest

Arnaud Delobelle <> wrote:
...
> > >> decorated.sort()

...
> > def index(sequence):
> > return sorted(range(len(sequence)), key=sequence.__getitem__)

...
> But really these two versions of rank are slower than the original one
> (as sorting a list is O(nlogn) whereas filling a table with
> precomputed values is O(n) ).

Wrong, because the original one also had a sort step, of course, so it
was also, inevitably, O(N log N) -- I've quoted the .sort step above.

> Anyway I would like to contribute my own index function:
>
> def index(seq):
> return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])
>
> It's short and has the advantage of being self-documenting, which will
> save Steven a lot of annoying typing I hope Who said Python
> couldn't rival with perl?

sum is for summing NUMBERS -- using it on lists is O(N squared).

So, this solution is asymptotically VERY slow, as well as obfuscated.

Alex

Alex Martelli, Apr 30, 2007
12. ### Michael HoffmanGuest

Alex Martelli wrote:
> Arnaud Delobelle <> wrote:
> ...
>>>>> decorated.sort()

> ...
>>> def index(sequence):
>>> return sorted(range(len(sequence)), key=sequence.__getitem__)

> ...
>> But really these two versions of rank are slower than the original one
>> (as sorting a list is O(nlogn) whereas filling a table with
>> precomputed values is O(n) ).

>
> Wrong, because the original one also had a sort step, of course, so it
> was also, inevitably, O(N log N) -- I've quoted the .sort step above.

Well, counting the index() function that is called in both cases, the
original rank() had one sort, but my version has two sorts.
--
Michael Hoffman

Michael Hoffman, Apr 30, 2007
13. ### Alex MartelliGuest

Michael Hoffman <> wrote:

> Alex Martelli wrote:
> > Arnaud Delobelle <> wrote:
> > ...
> >>>>> decorated.sort()

> > ...
> >>> def index(sequence):
> >>> return sorted(range(len(sequence)), key=sequence.__getitem__)

> > ...
> >> But really these two versions of rank are slower than the original one
> >> (as sorting a list is O(nlogn) whereas filling a table with
> >> precomputed values is O(n) ).

> >
> > Wrong, because the original one also had a sort step, of course, so it
> > was also, inevitably, O(N log N) -- I've quoted the .sort step above.

>
> Well, counting the index() function that is called in both cases, the
> original rank() had one sort, but my version has two sorts.

That doesn't affet the big-O behavior -- O(N log N) holds whether you
have one sort, or three, or twentyseven.

Alex

Alex Martelli, Apr 30, 2007
14. ### Arnaud DelobelleGuest

On Apr 30, 2:50 am, (Alex Martelli) wrote:
> Arnaud Delobelle <> wrote:
>
> ...
>
> > > >> decorated.sort()

> ...
> > > def index(sequence):
> > > return sorted(range(len(sequence)), key=sequence.__getitem__)

> ...
> > But really these two versions of rank are slower than the original one
> > (as sorting a list is O(nlogn) whereas filling a table with
> > precomputed values is O(n) ).

>
> Wrong, because the original one also had a sort step, of course, so it
> was also, inevitably, O(N log N) -- I've quoted the .sort step above.

I am fully aware of the meaning of O(...). Nevertheless (speed !=
asymptotic speed) and one sort is still better than two sorts IMHO.
Moreover the second sort is redundant and less clear than a simple
loop.

> > Anyway I would like to contribute my own index function:

>
> > def index(seq):
> > return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

>
> > It's short and has the advantage of being self-documenting, which will
> > save Steven a lot of annoying typing I hope Who said Python
> > couldn't rival with perl?

>
> sum is for summing NUMBERS -- using it on lists is O(N squared).
>
> So, this solution is asymptotically VERY slow, as well as obfuscated.

And it was also a JOKE. There were some clues:
* I claimed that the function was self-documenting, even though it
was obviously obfuscated (as you rightly pointed out).
* It relies on a side effect of the 'key' function list.pop, which
* It is indeed very slow (yes, sum() is not the best for lists)
* I mentioned that Python could rival perl.

I was meant to be a clumsy but 'concise' amalgam that would perform
the task (although not efficiently) while being difficult to make
sense of.

--
Arnaud

Arnaud Delobelle, Apr 30, 2007
15. ### Michael HoffmanGuest

Alex Martelli wrote:
> Michael Hoffman <> wrote:
>
>> Alex Martelli wrote:
>>> Arnaud Delobelle <> wrote:
>>> ...
>>>>>>> decorated.sort()
>>> ...
>>>>> def index(sequence):
>>>>> return sorted(range(len(sequence)), key=sequence.__getitem__)
>>> ...
>>>> But really these two versions of rank are slower than the original one
>>>> (as sorting a list is O(nlogn) whereas filling a table with
>>>> precomputed values is O(n) ).
>>> Wrong, because the original one also had a sort step, of course, so it
>>> was also, inevitably, O(N log N) -- I've quoted the .sort step above.

>> Well, counting the index() function that is called in both cases, the
>> original rank() had one sort, but my version has two sorts.

>
> That doesn't affet the big-O behavior -- O(N log N) holds whether you
> have one sort, or three, or twentyseven.

I've taught programming classes before, and I would have had to fail
anybody who misunderstood speed badly enough to claim that something
repeating an O(N log N) algorithm 27 times was no faster than doing it
once. ;-)

As Arnaud points out, asymptotic behavior is not the same as speed. His
original statement that the more recently proposed definitions of rank()
are slower than the OP's may be correct. And if it's not, it's not
because they're all O(N log N).
--
Michael Hoffman

Michael Hoffman, Apr 30, 2007
16. ### Alex MartelliGuest

Michael Hoffman <> wrote:
...
> >> Well, counting the index() function that is called in both cases, the
> >> original rank() had one sort, but my version has two sorts.

> >
> > That doesn't affet the big-O behavior -- O(N log N) holds whether you
> > have one sort, or three, or twentyseven.

>
> I've taught programming classes before, and I would have had to fail
> anybody who misunderstood speed badly enough to claim that something
> repeating an O(N log N) algorithm 27 times was no faster than doing it
> once. ;-)

As for me, I would fail somebody who thought they could compare speed
this way -- if the O(N log N) executed 27 times took (on a given
machine) 1 millisecond times N times log N, and the other one (on the
same machine) 1 second times N times log N (and in the big-O notation
you can NEVER infer what the multiplicative constant is), for example,
such a comparison would be totally of base.

> As Arnaud points out, asymptotic behavior is not the same as speed. His
> original statement that the more recently proposed definitions of rank()
> are slower than the OP's may be correct. And if it's not, it's not
> because they're all O(N log N).

And if it is, it's not because of the "one sort" versus "two sorts": by
that sole criterion you just cannot guess (sensibly) at speed (measuring
is much better, though it has its own pitfalls).

Alex

Alex Martelli, Apr 30, 2007
17. ### Arnaud DelobelleGuest

On Apr 30, 3:51 pm, (Alex Martelli) wrote:
> Michael Hoffman <> wrote:
>
> ...
>
> > >> Well, counting the index() function that is called in both cases, the
> > >> original rank() had one sort, but my version has two sorts.

>
> > > That doesn't affet the big-O behavior -- O(N log N) holds whether you
> > > have one sort, or three, or twentyseven.

Again we're not talking about big-O: we're talking about which one is
faster.

> > I've taught programming classes before, and I would have had to fail
> > anybody who misunderstood speed badly enough to claim that something
> > repeating an O(N log N) algorithm 27 times was no faster than doing it
> > once. ;-)

>
> As for me, I would fail somebody who thought they could compare speed
> this way -- if the O(N log N) executed 27 times took (on a given
> machine) 1 millisecond times N times log N, and the other one (on the
> same machine) 1 second times N times log N (and in the big-O notation
> you can NEVER infer what the multiplicative constant is), for example,
> such a comparison would be totally of base.

Of course you are right: this is basic asymptotic analysis.
You also know very well that this is not what I or Michael were
thinking of. He was simply saying that repeating the *same thing* 27
times takes more time than simply doing it once, as it was made clear
by my earlier post that his solution involved repeating the same
index() function twice.

> > As Arnaud points out, asymptotic behavior is not the same as speed. His
> > original statement that the more recently proposed definitions of rank()
> > are slower than the OP's may be correct. And if it's not, it's not
> > because they're all O(N log N).

>
> And if it is, it's not because of the "one sort" versus "two sorts": by
> that sole criterion you just cannot guess (sensibly) at speed (measuring
> is much better, though it has its own pitfalls).

I'm afraid you can draw some conclusions in this case, precisely
*because* one version has one sort less than the other (that is what I
was hinting at in my first post).

Let's say the time complexity of the first sort is:
f(n) ~ Knlogn
The time complexity of the second sort is:
g(n) ~ Lnlogn
The time complexity of the simple loop that can replace the second
sort is:
h(n) ~ Hn

Now because the whole algorithm consists of doing one sort THEN the
second thing(sort or loop), the time complexities of the two versions
are as follow:

Two sort version:
f(n)+g(n) ~ Knlogn+Lnlogn
~ (K+L)nlogn
One sort version:
f(n)+h(n) ~ Knlogn + Hn
~ Knlogn(1+H/Klogn)
~ Knlogn

So the two sort version is (K+L)/K times slower than the one-sort
version asymptotically.
(in practice I reckon K=L so that makes it twice slower)

--
Arnaud
"Errare humanum est, sed perseverare diabolicum"

Arnaud Delobelle, May 1, 2007
18. ### Tim RobertsGuest

Michael Hoffman <> wrote:
>
>Hint: if you find yourself using a decorate-sort-undecorate pattern,
>sorted(key=func) or sequence.sort(key=func) might be a better idea.

Is it? I thought I remember reading on this very list some years ago that
the performance of sequence.sort became much, much worse when a key
function was supplied.

I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
because of that.
--
Tim Roberts,
Providenza & Boekelheide, Inc.

Tim Roberts, May 2, 2007
19. ### Steven D'ApranoGuest

On Wed, 02 May 2007 06:10:54 +0000, Tim Roberts wrote:

> Michael Hoffman <> wrote:
>>
>>Hint: if you find yourself using a decorate-sort-undecorate pattern,
>>sorted(key=func) or sequence.sort(key=func) might be a better idea.

>
> Is it? I thought I remember reading on this very list some years ago that
> the performance of sequence.sort became much, much worse when a key
> function was supplied.

You're probably thinking of a comparison function:

sequence.sort(cmp=lambda x, y: cmp(y, x))

is a slow way of sorting a sequence in reverse order. The fast
way is the two liner:

sequence.sort()
sequence.reverse()

Faster(?) still is:

sequence.sort(reverse=True)

> I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
> because of that.

That's what the key= argument does. cmp= is slow because the comparison
function is called for EVERY comparison. The key= function is only called
once per element.

--
Steven D'Aprano

Steven D'Aprano, May 2, 2007
20. ### Michael HoffmanGuest

Steven D'Aprano wrote:
> On Wed, 02 May 2007 06:10:54 +0000, Tim Roberts wrote:

>> I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
>> because of that.

>
> That's what the key= argument does. cmp= is slow because the comparison
> function is called for EVERY comparison. The key= function is only called
> once per element.

Right. Using sort(key=keyfunc) is supposed to be faster than
decorate-sort-undecorate. And I think it is clearer too.
--
Michael Hoffman

Michael Hoffman, May 2, 2007