Feature request: sorting a list slice

G

George Sakkis

It would be useful if list.sort() accepted two more optional
parameters, start and stop, so that you can sort a slice in place. In
other words,

x = range(1000000)
x.sort(start=3, stop=-1)

would be equivalent to

x[3:-1] = sorted(x[3:-1])

but more efficient and without repeating the slice twice. If it's not
too late for feature additions, I'd love to see this in 2.5.


George
 
R

Raymond Hettinger

George said:
It would be useful if list.sort() accepted two more optional
parameters, start and stop, so that you can sort a slice in place. In
other words,

x = range(1000000)
x.sort(start=3, stop=-1)

would be equivalent to

x[3:-1] = sorted(x[3:-1])

but more efficient and without repeating the slice twice. If it's not
too late for feature additions, I'd love to see this in 2.5.

This is a false optimization. The slicing steps are O(n) and the sort
step is O(n log n) unless the data has some internal structure that
Timsort can use to get closer to O(n).

If you implemented this and timed in it real apps, I would be surprised
to find that the slice extraction and assignments were the performance
bottleneck. IMO, a worthwhile performance gain is unlikely.


Raymond
 
G

George Sakkis

Fredrik said:
useful for what? what's the use case ?

</F>

Although there is a use case (see below), I don't think it's strictly
necessary in this case. Arguments to add it:
- Don't Repeat Yourself.
- It doesn't introduce new syntax, builtin or standard library
function; only two optional keywords in an existing method.
- These keywords are already used in several list and string methods
(index,find,..) and their meaning is instantly obvious.
- Increased (even if small) efficiency. Yes, I know that in practice,
and especially in apps that involve I/O, network, DB connections and so
on, it's unlikely to be the bottleneck but this isn't a general
argument for being sloppy. The fact that having a "for i in
xrange(100000):pass" at the top of every file costs me only a few msec
is not a reason to keep it there.

IMO the only reason not to add it would be if it makes sort's
implementation much more complicate that it already is. I don't know
Timsort's details, but I doubt that this is the case; please correct me
if I am wrong.

And here's (a cut-down version of) the use case: a recursive
generalization of groupby. On each recursive call, the argument list is
sorted and regrouped. If sort() accepted a (start,end) range I could
pass this over instead of creating a slice of the input each time. It's
not hard to see that there can be an exponential number of calls wrt to
the input's length, so cutting down the average cost of each call may
be worthwhile for medium-largish sized inputs (of course it's still
exponential asymptotically, so the constant factor does not make much
difference for really large inputs).


import itertools as it

def groupBy(iterable, *keys):
return _groupBy(list(iterable), keys, len(keys))

def _groupBy(lst, keys, depth):
if depth == 0:
return lst
key = keys[-depth]
# sort group before subgrouping
lst.sort(key=key)
# group by key and then recursively do the same for each subgroup
return [(k, _groupBy(list(subgroup),keys,depth-1))
for k,subgroup in it.groupby(lst,key)]


#==== example ========================================

class Struct(object):
def __init__(self, args):
self.args = args
def __getitem__(self,index):
return self.args[index]

data = map(Struct, [
('a', 2, True),
('b', 1, False),
('a', 1, True),
('a', 1, False),
('b', 2, True),
('b', 2, True),
('a', 2, True),
])

from pprint import pprint
from operator import itemgetter
pprint(groupBy(data, itemgetter(2), itemgetter(0), itemgetter(1)))


#======= output =================================

[(False,
[('a', [(1, [<__main__.Struct object at 0xb7dc128c>])]),
('b', [(1, [<__main__.Struct object at 0xb7dc124c>])])]),
(True,
[('a',
[(1, [<__main__.Struct object at 0xb7dc126c>]),
(2,
[<__main__.Struct object at 0xb7dc122c>,
<__main__.Struct object at 0xb7dc12ec>])]),
('b',
[(2,
[<__main__.Struct object at 0xb7dc12ac>,
<__main__.Struct object at 0xb7dc12cc>])])])]


George
 
H

Heiko Wundram

Am Donnerstag 18 Mai 2006 22:13 schrieb Raymond Hettinger:
This is a false optimization. The slicing steps are O(n) and the sort
step is O(n log n) unless the data has some internal structure that
Timsort can use to get closer to O(n).

If you implemented this and timed in it real apps, I would be surprised
to find that the slice extraction and assignments were the performance
bottleneck. IMO, a worthwhile performance gain is unlikely.

I personally wouldn't find this to be a false optimization, at least not
memory-wise. If you sort really large lists, and only want to sort part of
the list, the memory gains of not having to do a slice should be enormous,
and there should be some time-gains too. And, additionally, implementing this
(if I understand timsort correctly, just had a look at the sources) isn't
hard.

Rather, I'd pose the usage question: why are you only sorting a part of a
list? lists are basically meant for homogenous data. And sorting only a part
of that should be a pretty meaningless operation, mostly...

Anyway, I've just written up a patch to extend sort() with a start/stop
parameter (which behaves just like the default slice notation). Generally,
this will make sort() setup slightly slower in the "standard" case (because
there's some more pointer arithmetic involved, but this should be negligible,
and is O(1)), but for the actual use case, the following numbers can be seen:

modelnine@phoenix ~/mercurial/python-modelnine $ ./python test.py
New average time: 13.7254650593 ms per loop
Old average time: 14.839854002 ms per loop
[10198 refs]
modelnine@phoenix ~/mercurial/python-modelnine $

This is just a very, very simple test (testing the case of a list with one
million entries, where 99000 are sorted):
from random import randrange
from time import time

x = [randrange(256) for x in xrange(1000000)]
timesnew, timesold = [], []

for reps in range(1000):
y = x[:]
timesnew.append(time())
y.sort(start=1000,stop=10000)
timesnew[-1] = time() - timesnew[-1]

for reps in range(1000):
y = x[:]
timesold.append(time())
y[1000:10000] = sorted(y[1000:10000])
timesold[-1] = time() - timesold[-1]

print "New average time:", sum(timesnew), "ms per loop"
print "Old average time:", sum(timesold), "ms per loop"
I'll post the patch to the SF bugtracker tomorrow, it's too late today for me
to review it, but generally, I wouldn't find it to be a bad addition, if
there's actually a general use case to only sort parts of a list.

--- Heiko.
 
H

Harold Fellermann

Fredrik said:
+1

useful for what? what's the use case ?

Actually, I was in need of such a construct only few weeks ago. The
task was to organize playlists
of an mp3 player. I wanted to sort future entries but not past ones
(according to an index in the
list that seperates past from future). I can imagine many more
situations in which sorting part of
a list is usefull.

While I agree that the improvement of the additional keywords is minor,
I think that the additional
load they impose on the sort function is also minor. In my oppinion,
this is a straight-forward
extension of what we already find in other places in the library. I
would like to see it in a future version.

- harold -
 
H

Heiko Wundram

Am Donnerstag 18 Mai 2006 19:27 schrieb George Sakkis:
It would be useful if list.sort() accepted two more optional
parameters, start and stop, so that you can sort a slice in place.

I've just submitted:

http://sourceforge.net/tracker/index.php?func=detail&aid=1491804&group_id=5470&atid=305470

to the bugtracker, which extends the (start, stop) keyword arguments to
list.reverse() (which I've needed more than once). The patch updates the test
suite, documentation, list object, and sorted() builtin to accept (or
specify) the new arguments.

Any comment/feedback would be appreciated.

--- Heiko.
 
G

George Sakkis

Heiko said:
Am Donnerstag 18 Mai 2006 19:27 schrieb George Sakkis:

I've just submitted:

http://sourceforge.net/tracker/index.php?func=detail&aid=1491804&group_id=5470&atid=305470

to the bugtracker, which extends the (start, stop) keyword arguments to
list.reverse() (which I've needed more than once). The patch updates the test
suite, documentation, list object, and sorted() builtin to accept (or
specify) the new arguments.

Any comment/feedback would be appreciated.

--- Heiko.

This is great, thanks Heiko ! Any idea on the chances of being
considered for inclusion in 2.5 ?

George
 
H

Heiko Wundram

Am Freitag 19 Mai 2006 23:24 schrieb George Sakkis:
This is great, thanks Heiko ! Any idea on the chances of being
considered for inclusion in 2.5 ?

Don't ask me, I'm not one of the core developers... ;-) But, anyway, the
people on python-dev are doing their best to review patches. Just: I rather
write them, than review them... ;-)

--- Heiko.
 
S

Steve Holden

The said:
useful for what? what's the use case ?
John Machin then wrote, without quoting any context at all:
Use case?
He means "under what circumstances do you see someone actually using the
proposed feature rather than just nodding and saying "that looks
useful". IOW, who would use the new feature, and why?

regards
Steve
 
R

Robert Kern

Steve said:
John Machin then wrote, without quoting any context at all:


He means "under what circumstances do you see someone actually using the
proposed feature rather than just nodding and saying "that looks
useful". IOW, who would use the new feature, and why?

I believe that John was asking George for a use case rather than asking Fredrik
what a use case was.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
S

Steve Holden

Robert said:
I believe that John was asking George for a use case rather than asking Fredrik
what a use case was.
In which case, as I pointed out, John would have done better to quote a
little more context.

regards
Steve
 
R

Robert Kern

Steve said:
Robert Kern wrote:

In which case, as I pointed out, John would have done better to quote a
little more context.

No argument here.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Machin

Context? The whole message asked for a new feature. Please tell me what
context I should have supplied.
 
R

Raymond Hettinger

Getting a patch ready for checkin (corrected, documented, reviewed, and
tested) is only part of the battle. The real challenge of language
design is figuring out whether something should be done.

Adding optional parameters to a method makes its invocation slower and
makes the API harder to learn and remember.

There has not yet been a thorough discussion of use cases. In
particular, people should scan their existing, deployed codebase for
examples of where a particular code fragment would have been improved.
We should then compare the fragment with and without the feature -- how
much more concise is it, how much is performance improved/harmed, is
the code more or less readable, etc.

If the perf gain is small and the use cases are infrequent, the
addition is likely unwarranted. There is an entire class of feature
requests that are more appropriate as recipes than for inclusion in the
language. If python had accepted most requests like this, it would
have become somewhat bloated and unattractive.
 
H

Heiko Wundram

Am Sonntag 21 Mai 2006 18:55 schrieb Raymond Hettinger:
If the perf gain is small and the use cases are infrequent, the
addition is likely unwarranted. There is an entire class of feature
requests that are more appropriate as recipes than for inclusion in the
language.

The thing is: having an explicit start/stop argument to reverse() and sort()
doesn't slow down method call much (it's just one if() whose body is skipped
when the parameters aren't passed, I'd say that the time that's lost here is
pretty insignificant, in the order of 10^-6 seconds, on _any_ modern
machine), and on the other hand warrants huge memory gains (if not
performance gains by saving a memcpy) when you do need to sort or reverse
slices of large lists.

I've had use cases of the latter (reversing large parts of even larger lists
in memory) in several data archiving and evaluation programs I wrote, but I
can also understand the use case that was made by George for having these
arguments for sort(), so that groupBy() can be extended easily to work on
subgroups without requiring slicing.

Anyway: having these "extensions" as a recipe won't work here: the main idea
is saving the slicing by having sort() and reverse() do the "slicing"
internally (rather, they just add a constant to the lower and upper bound,
and the user specifies these constants, the internal functions they call
already work on slices, and the current listobject.c gives them the whole
list as the slice by default).

The user can't patch this as an extension on the fly, because it requires
changes to the underlying listobject.c source. That's why George is asking
for inclusion of this patch.

I just wrote the patch because I had the time to do so, and I won't battle for
it's inclusion, but again, I see the use cases clearly, at the very least for
slice support in list.reverse() and array.reverse() (which the patch also
implements).

--- Heiko.
 
R

Robert Kern

John said:
Context? The whole message asked for a new feature. Please tell me what
context I should have supplied.

When you reply to a message, please quote part of that message. That's what was
meant by context.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top