While we're talking about annoyances

S

Steven D'Aprano

Am I the only one who finds that I'm writing more documentation than code?

I recently needed to write a function to generate a rank table from a
list. That is, a list of ranks, where the rank of an item is the position
it would be in if the list were sorted:

alist = list('defabc')
ranks = [3, 4, 5, 0, 1, 2]

To do that, I needed to generate an index table first. In the book
"Numerical Recipes in Pascal" by William Press et al there is a procedure
to generate an index table (46 lines of code) and one for a rank table
(five lines).

In Python, my index function is four lines of code and my rank function is
five lines. I then wrote three more functions for verifying that my index
and rank tables were calculated correctly (17 more lines) and four more
lines to call doctest, giving a total of 30 lines of code.

I also have 93 lines of documentation, including doctests, or three
lines of documentation per line of code.

For those interested, here is how to generate an index table and rank
table in Python:


def index(sequence):
decorated = zip(sequence, xrange(len(sequence)))
decorated.sort()
return [idx for (value, idx) in decorated]

def rank(sequence):
table = [None] * len(sequence)
for j, idx in enumerate(index(sequence)):
table[idx] = j
return table


You can write your own damn documentation. *wink*
 
G

GHUM

Steven,
def index(sequence):
decorated = zip(sequence, xrange(len(sequence)))
decorated.sort()
return [idx for (value, idx) in decorated]

would'nt that be equivalent code?

def index(sequence):
return [c for _,c in sorted((b,a) for a, b in
enumerate(sequence))]

tested, gave same results. But worsens your doc2code ratio :)

Harald Armin Massa
--
 
M

Michael Hoffman

GHUM said:
Steven,
def index(sequence):
decorated = zip(sequence, xrange(len(sequence)))
decorated.sort()
return [idx for (value, idx) in decorated]

would'nt that be equivalent code?

def index(sequence):
return [c for _,c in sorted((b,a) for a, b in
enumerate(sequence))]

Or even these:

def index(sequence):
return sorted(range(len(sequence)), key=sequence.__getitem__)

def rank(sequence):
return sorted(range(len(sequence)),
key=index(sequence).__getitem__)

Hint: if you find yourself using a decorate-sort-undecorate pattern,
sorted(key=func) or sequence.sort(key=func) might be a better idea.
 
?

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

To do that, I needed to generate an index table first. In the book
"Numerical Recipes in Pascal" by William Press et al there is a procedure
to generate an index table (46 lines of code) and one for a rank table
(five lines).

51 lines total.
In Python, my index function is four lines of code and my rank function is
five lines. I then wrote three more functions for verifying that my index
and rank tables were calculated correctly (17 more lines) and four more
lines to call doctest, giving a total of 30 lines of code.

So 9 lines for Python, excluding tests.
I also have 93 lines of documentation, including doctests, or three
lines of documentation per line of code.

Then, without documentation, Python is roughly 560% (51/9) as
efficient as Pascal. But with documentation (assuming you need the
same amount of documentation for the Python code as the Pascal code),
(51 + 93)/(9 + 93) = 1.41 so only 141% as efficient as Pascal.

I wonder what that means? Maybe Python the language is approaching the
upper bound for how efficient an imperative programming language can
be? On the other hand, there seem to be some progress that could be
made to reduce the amount of work in writing documentation.
Documentation in Esperanto instead of English maybe?
 
B

Ben Finney

BJörn Lindqvist said:
On the other hand, there seem to be some progress that could be made
to reduce the amount of work in writing documentation.
Documentation in Esperanto instead of English maybe?

Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
choice for computer documentation to me.


... _[0] ignoring the fact that it's spoken by even fewer people than
Esperanto.
 
J

Jarek Zgoda

Ben Finney napisa³(a):
On the other hand, there seem to be some progress that could be made
to reduce the amount of work in writing documentation.
Documentation in Esperanto instead of English maybe?

Lojban <URL:http://www.lojban.org/> is both easier to learn world-wide
than Euro-biased Esperanto, and computer-parseable. Seems a better[0]_
choice for computer documentation to me.

German seems to be less "wordy" than English, despite the fact that most
of nouns is much longer. ;)
 
A

Arnaud Delobelle

GHUM said:
Steven,
def index(sequence):
decorated = zip(sequence, xrange(len(sequence)))
decorated.sort()
return [idx for (value, idx) in decorated]
would'nt that be equivalent code?
def index(sequence):
return [c for _,c in sorted((b,a) for a, b in
enumerate(sequence))]

Or even these:

def index(sequence):
return sorted(range(len(sequence)), key=sequence.__getitem__)

def rank(sequence):
return sorted(range(len(sequence)),
key=index(sequence).__getitem__)

Better still:

def rank(sequence):
return index(index(sequence))

:)
But really these two versions of rank are slower than the original one
(as sorting a list is O(nlogn) whereas filling a table with
precomputed values is O(n) ).

Anyway I would like to contribute my own index function:

def index(seq):
return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

It's short and has the advantage of being self-documenting, which will
save Steven a lot of annoying typing I hope ;) Who said Python
couldn't rival with perl?
 
P

Paul Rubin

Steven D'Aprano said:
I recently needed to write a function to generate a rank table from a
list. That is, a list of ranks, where the rank of an item is the position
it would be in if the list were sorted:

alist = list('defabc')
ranks = [3, 4, 5, 0, 1, 2]

fst = operator.itemgetter(0) # these should be builtins...
snd = operator.itemgetter(1)

ranks=map(fst, sorted(enumerate(alist), key=snd))
 
A

Arnaud Delobelle

Steven D'Aprano said:
I recently needed to write a function to generate a rank table from a
list. That is, a list of ranks, where the rank of an item is the position
it would be in if the list were sorted:
alist = list('defabc')
ranks = [3, 4, 5, 0, 1, 2]

fst = operator.itemgetter(0) # these should be builtins...
snd = operator.itemgetter(1)

ranks=map(fst, sorted(enumerate(alist), key=snd))

This is what the OP calls the index table, not the ranks table (both
are the same for the example above, but that's an unfortunate
coincidence...)
 
R

Raymond Hettinger

[Steven D'Aprano]
I recently needed to write a function to generate a rank table from a
list. That is, a list of ranks, where the rank of an item is the position
it would be in if the list were sorted:

alist = list('defabc')
ranks = [3, 4, 5, 0, 1, 2] .. . .
def rank(sequence):
table = [None] * len(sequence)
for j, idx in enumerate(index(sequence)):
table[idx] = j
return table

FWIW, you can do ranking faster and more succinctly with the sorted()
builtin:

def rank(seq):
return sorted(range(len(seq)), key=seq.__getitem__)


Raymond
 
A

Alex Martelli

Arnaud Delobelle said:
...
But really these two versions of rank are slower than the original one
(as sorting a list is O(nlogn) whereas filling a table with
precomputed values is O(n) ).

Wrong, because the original one also had a sort step, of course, so it
was also, inevitably, O(N log N) -- I've quoted the .sort step above.
Anyway I would like to contribute my own index function:

def index(seq):
return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])

It's short and has the advantage of being self-documenting, which will
save Steven a lot of annoying typing I hope ;) Who said Python
couldn't rival with perl?

sum is for summing NUMBERS -- using it on lists is O(N squared).

So, this solution is asymptotically VERY slow, as well as obfuscated.


Alex
 
M

Michael Hoffman

Alex said:
Wrong, because the original one also had a sort step, of course, so it
was also, inevitably, O(N log N) -- I've quoted the .sort step above.

Well, counting the index() function that is called in both cases, the
original rank() had one sort, but my version has two sorts.
 
A

Alex Martelli

Michael Hoffman said:
Well, counting the index() function that is called in both cases, the
original rank() had one sort, but my version has two sorts.

That doesn't affet the big-O behavior -- O(N log N) holds whether you
have one sort, or three, or twentyseven.


Alex
 
A

Arnaud Delobelle

...


Wrong, because the original one also had a sort step, of course, so it
was also, inevitably, O(N log N) -- I've quoted the .sort step above.

I am fully aware of the meaning of O(...). Nevertheless (speed !=
asymptotic speed) and one sort is still better than two sorts IMHO.
Moreover the second sort is redundant and less clear than a simple
loop.
Anyway I would like to contribute my own index function:
def index(seq):
return sum(sorted(map(list,enumerate(seq)), key=list.pop), [])
It's short and has the advantage of being self-documenting, which will
save Steven a lot of annoying typing I hope ;) Who said Python
couldn't rival with perl?

sum is for summing NUMBERS -- using it on lists is O(N squared).

So, this solution is asymptotically VERY slow, as well as obfuscated.

And it was also a JOKE. There were some clues:
* I claimed that the function was self-documenting, even though it
was obviously obfuscated (as you rightly pointed out).
* It relies on a side effect of the 'key' function list.pop, which
is very bad form.
* It is indeed very slow (yes, sum() is not the best for lists)
* I mentioned that Python could rival perl.

I was meant to be a clumsy but 'concise' amalgam that would perform
the task (although not efficiently) while being difficult to make
sense of.
 
M

Michael Hoffman

Alex said:
That doesn't affet the big-O behavior -- O(N log N) holds whether you
have one sort, or three, or twentyseven.

I've taught programming classes before, and I would have had to fail
anybody who misunderstood speed badly enough to claim that something
repeating an O(N log N) algorithm 27 times was no faster than doing it
once. ;-)

As Arnaud points out, asymptotic behavior is not the same as speed. His
original statement that the more recently proposed definitions of rank()
are slower than the OP's may be correct. And if it's not, it's not
because they're all O(N log N).
 
A

Alex Martelli

Michael Hoffman said:
I've taught programming classes before, and I would have had to fail
anybody who misunderstood speed badly enough to claim that something
repeating an O(N log N) algorithm 27 times was no faster than doing it
once. ;-)

As for me, I would fail somebody who thought they could compare speed
this way -- if the O(N log N) executed 27 times took (on a given
machine) 1 millisecond times N times log N, and the other one (on the
same machine) 1 second times N times log N (and in the big-O notation
you can NEVER infer what the multiplicative constant is), for example,
such a comparison would be totally of base.
As Arnaud points out, asymptotic behavior is not the same as speed. His
original statement that the more recently proposed definitions of rank()
are slower than the OP's may be correct. And if it's not, it's not
because they're all O(N log N).

And if it is, it's not because of the "one sort" versus "two sorts": by
that sole criterion you just cannot guess (sensibly) at speed (measuring
is much better, though it has its own pitfalls).


Alex
 
A

Arnaud Delobelle


Again we're not talking about big-O: we're talking about which one is
faster.
As for me, I would fail somebody who thought they could compare speed
this way -- if the O(N log N) executed 27 times took (on a given
machine) 1 millisecond times N times log N, and the other one (on the
same machine) 1 second times N times log N (and in the big-O notation
you can NEVER infer what the multiplicative constant is), for example,
such a comparison would be totally of base.

Of course you are right: this is basic asymptotic analysis.
You also know very well that this is not what I or Michael were
thinking of. He was simply saying that repeating the *same thing* 27
times takes more time than simply doing it once, as it was made clear
by my earlier post that his solution involved repeating the same
index() function twice.
And if it is, it's not because of the "one sort" versus "two sorts": by
that sole criterion you just cannot guess (sensibly) at speed (measuring
is much better, though it has its own pitfalls).

I'm afraid you can draw some conclusions in this case, precisely
*because* one version has one sort less than the other (that is what I
was hinting at in my first post).

Let's say the time complexity of the first sort is:
f(n) ~ Knlogn
The time complexity of the second sort is:
g(n) ~ Lnlogn
The time complexity of the simple loop that can replace the second
sort is:
h(n) ~ Hn

Now because the whole algorithm consists of doing one sort THEN the
second thing(sort or loop), the time complexities of the two versions
are as follow:

Two sort version:
f(n)+g(n) ~ Knlogn+Lnlogn
~ (K+L)nlogn
One sort version:
f(n)+h(n) ~ Knlogn + Hn
~ Knlogn(1+H/Klogn)
~ Knlogn

So the two sort version is (K+L)/K times slower than the one-sort
version asymptotically.
(in practice I reckon K=L so that makes it twice slower)
 
T

Tim Roberts

Michael Hoffman said:
Hint: if you find yourself using a decorate-sort-undecorate pattern,
sorted(key=func) or sequence.sort(key=func) might be a better idea.

Is it? I thought I remember reading on this very list some years ago that
the performance of sequence.sort became much, much worse when a key
function was supplied.

I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
because of that.
 
S

Steven D'Aprano

Is it? I thought I remember reading on this very list some years ago that
the performance of sequence.sort became much, much worse when a key
function was supplied.

You're probably thinking of a comparison function:

sequence.sort(cmp=lambda x, y: cmp(y, x))

is a slow way of sorting a sequence in reverse order. The fast
way is the two liner:

sequence.sort()
sequence.reverse()

Faster(?) still is:

sequence.sort(reverse=True)

I've tended to favor the "Schwarzian transform" (decorate-sort-undecorate)
because of that.

That's what the key= argument does. cmp= is slow because the comparison
function is called for EVERY comparison. The key= function is only called
once per element.
 
M

Michael Hoffman

Steven said:
That's what the key= argument does. cmp= is slow because the comparison
function is called for EVERY comparison. The key= function is only called
once per element.

Right. Using sort(key=keyfunc) is supposed to be faster than
decorate-sort-undecorate. And I think it is clearer too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top