min max from tuples in list

  • Thread starter Robert Voigtländer
  • Start date
R

Robert Voigtländer

Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), (51, 193), (51, 191), (51, 190), (51, 189), (51, 188), (50, 194), (50, 187), (50, 186), (50, 185), (50, 184), (49, 194), (49, 183), (49, 182), (49, 181), (48, 194), (48, 180), (48, 179), (48, 178), (48, 177), (47, 194), (47, 176), (47, 175), (47, 174), (47, 173), (46, 195), (46, 172), (46, 171), (46, 170), (46, 169), (45, 195), (45, 168), (45, 167), (45, 166), (44, 195), (44, 165), (44, 164), (44, 163), (44, 162), (43, 195), (43, 161), (43, 160), (43, 159), (43, 158), (42, 196), (42, 157), (42, 156), (42, 155), (41, 196), (41, 154), (41, 153), (41, 152),(41, 151), (40, 196), (40, 150), (40, 149), (40, 148), (40, 147), (39, 196), (39, 146), (39, 145), (39, 144), (39, 143), (38, 196), (38, 142), (38, 141), (38, 140), (37, 197), (37, 139), (37, 138), (37, 137), (37, 136), (36,197), (36, 135), (36, 134), (36, 133)]


I need to find a -performant- way to transform this into a list with tuples(a[0],[a[0][1]min],[a[0][1]max]).

Hard to explaint what I mean .. [0] of the first three tuples is 52. [1] is193,193 and 192.
What I need as result for these three tuples is: (52,192,193).

For the next five tuples it is (51,188,193).


Extra challenges:
- This list is sorted. For performance reasons I would like to keep it unsorted.
- There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is automatically max


I hope I was able to explain what I mean.

Thanks
Robert
 
C

Chris Angelico

I need to find a -performant- way to transform this into a list with tuples (a[0],[a[0][1]min],[a[0][1]max]).

Hard to explaint what I mean .. [0] of the first three tuples is 52. [1] is 193,193 and 192.
What I need as result for these three tuples is: (52,192,193).

For the next five tuples it is (51,188,193).


Extra challenges:
- This list is sorted. For performance reasons I would like to keep it unsorted.
- There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is automaticallymax

Yep, I see what you mean! Apart from the first of the challenges,
which is ambiguous: do you mean you'd rather be able to work with it
unsorted, or is that a typo, "keep it sorted"?

This is a common task of aggregation. Your list is of (key, value)
tuples, and you want to do some per-key statistics. Here are three
variants on the code:

# Fastest version, depends on the keys being already grouped
# and the values sorted within each group. It actually returns
# the last and first, not the smallest and largest.
def min_max_1(lst):
prev_key = None
for key, value in lst:
if key != prev_key:
if prev_key is not None: yield prev_key, value, key_max
key_max = value
if prev_key is not None: yield prev_key, value, key_max

# This version depends on the keys being grouped, but
# not on them being sorted within the groups.
def min_max_2(lst):
prev_key = None
for key, value in lst:
if key != prev_key:
if prev_key is not None: yield prev_key, key_min, key_max
key_min = key_max = value
else:
key_min = min(key_min, value)
key_max = min(key_max, value)
if prev_key is not None: yield prev_key, key_min, key_max

# Slowest version, does not depend on either the keys
# or the values being sorted. Will iterate over the entire
# list before producing any results. Returns tuples in
# arbitrary order, unlike the others (which will retain).
def min_max_3(lst):
data = {}
for key, value in lst:
if key not in data:
data[key]=(value, value)
else:
data[key][0] = min(data[key][0], value)
data[key][1] = min(data[key][1], value)
for key, minmax in data.items():
yield key, minmax[0], minmax[1]

Each of these is a generator that yields (key, min, max) tuples. The
third one needs the most memory and execution time; the others simply
take the input as it comes. None of them actually requires that the
input be a list - any iterable will do.

ChrisA
 
R

Robert Voigtländer

Wow, thanks for the educating answer. I'll work through all the varaints.
And yes, I meant keep it unsorted.

As I read it, sorting may be required then if I don't want to use the slowest variant. I'll test them all.

Thanks
Robert
 
P

Peter Otten

Robert said:
Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), (51, 193), (51, 191), (51, 190),
(51, 189), (51, 188), (50, 194), (50, 187), (50, 186), (50, 185), (50,
184), (49, 194), (49, 183), (49, 182), (49, 181), (48, 194), (48, 180),
(48, 179), (48, 178), (48, 177), (47, 194), (47, 176), (47, 175), (47,
174), (47, 173), (46, 195), (46, 172), (46, 171), (46, 170), (46, 169),
(45, 195), (45, 168), (45, 167), (45, 166), (44, 195), (44, 165), (44,
164), (44, 163), (44, 162), (43, 195), (43, 161), (43, 160), (43, 159),
(43, 158), (42, 196), (42, 157), (42, 156), (42, 155), (41, 196), (41,
154), (41, 153), (41, 152), (41, 151), (40, 196), (40, 150), (40, 149),
(40, 148), (40, 147), (39, 196), (39, 146), (39, 145), (39, 144), (39,
143), (38, 196), (38, 142), (38, 141), (38, 140), (37, 197), (37, 139),
(37, 138), (37
, 137), (37, 136), (36, 197), (36, 135), (36, 134), (36, 133)]


I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

Hard to explaint what I mean .. [0] of the first three tuples is 52. [1]
is 193,193 and 192. What I need as result for these three tuples is:
(52,192,193).

For the next five tuples it is (51,188,193).


Extra challenges:
- This list is sorted. For performance reasons I would like to keep it
unsorted. - There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is automatically
max


I hope I was able to explain what I mean.

I have a hunch that sorting the list might be quite efficient. You should at
least try

import operator
import itertools

a = ...
a.sort()
result = []
for key, group in itertools.groupby(a, key=operator.itemgetter(0)):
minpair = maxpair = next(group)
for maxpair in group:
pass
result.append((key, minpair[1], maxpair[1]))

for item in result:
print(item)

to see whether it is good enough.
 
C

Chris Angelico

Wow, thanks for the educating answer. I'll work through all the varaints.
And yes, I meant keep it unsorted.

As I read it, sorting may be required then if I don't want to use the slowest variant. I'll test them all.

Sorting would be slower than using the dictionary method. Much slower.
If you don't need to sort for any other reason, don't sort!

By the way, it's usually courteous to quote at least some of the text
you're responding to, to give context. Otherwise nobody knows what
you're talking about. This isn't a PHPBB forum; each post actually is
quite separate. You may find that Google Groups makes your life
unnecessarily difficult in this, so it's probably easier to use the
mailing list:

https://mail.python.org/mailman/listinfo/python-list

ChrisA
 
J

Jussi Piitulainen

Robert said:
Hi,

I have a list like this:

# shortened:
a = [(52, 193), (52, 193), (52, 192),
(51, 193), (51, 191), (51, 190), (51, 189), (51, 188),
(50, 194)]
I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

Hard to explaint what I mean .. [0] of the first three tuples is
52. [1] is 193,193 and 192. What I need as result for these three
tuples is: (52,192,193).

For the next five tuples it is (51,188,193).

I think you want [(50, 194, 194), (51, 188, 193), (52, 192, 193)] for
the shortened list, in some order. Then you might be happy with a dict
instead. If I misunderstand, the following will be irrelevant.
Extra challenges:
- This list is sorted. For performance reasons I would like to keep
it unsorted.
- There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is
automatically max

Scan the list and keep the running min and max for each key in a dict.

mix = dict()
for k,v in a:
mix[k] = ( min(mix.get(k, (v,v))[0], v),
max(mix.get(k, (v,v))[1], v) )

# mix == {50: (194, 194), 51: (188, 193), 52: (192, 193)}
# As list: [ (k,v[0],v[1]) for k,v in mix.items() ]
 
P

Peter Otten

Peter said:
Robert said:
Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), (51, 193), (51, 191), (51, 190),
(51, 189), (51, 188), (50, 194), (50, 187), (50, 186), (50, 185), (50,
184), (49, 194), (49, 183), (49, 182), (49, 181), (48, 194), (48, 180),
(48, 179), (48, 178), (48, 177), (47, 194), (47, 176), (47, 175), (47,
174), (47, 173), (46, 195), (46, 172), (46, 171), (46, 170), (46, 169),
(45, 195), (45, 168), (45, 167), (45, 166), (44, 195), (44, 165), (44,
164), (44, 163), (44, 162), (43, 195), (43, 161), (43, 160), (43, 159),
(43, 158), (42, 196), (42, 157), (42, 156), (42, 155), (41, 196), (41,
154), (41, 153), (41, 152), (41, 151), (40, 196), (40, 150), (40, 149),
(40, 148), (40, 147), (39, 196), (39, 146), (39, 145), (39, 144), (39,
143), (38, 196), (38, 142), (38, 141), (38, 140), (37, 197), (37, 139),
(37, 138), (37
, 137), (37, 136), (36, 197), (36, 135), (36, 134), (36, 133)]


I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

Hard to explaint what I mean .. [0] of the first three tuples is 52. [1]
is 193,193 and 192. What I need as result for these three tuples is:
(52,192,193).

For the next five tuples it is (51,188,193).


Extra challenges:
- This list is sorted. For performance reasons I would like to keep it
unsorted. - There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is automatically
max


I hope I was able to explain what I mean.

I have a hunch that sorting the list might be quite efficient. You should
at least try

import operator
import itertools

a = ...
a.sort()
result = []
for key, group in itertools.groupby(a, key=operator.itemgetter(0)):
minpair = maxpair = next(group)
for maxpair in group:
pass
result.append((key, minpair[1], maxpair[1]))

for item in result:
print(item)

to see whether it is good enough.

On second thought -- Chris Angelico may be right ;)
So here's my dict-based variant:

def keyminmax(items):
d = collections.defaultdict(list)
for key, value in items:
d[key].append(value)
for key, values in d.items(): # d.iteritems() in python 2
yield key, min(values), max(values)

for item in keyminmax(a):
print(item)

It uses a bit more memory than Chris' code, but has fewer min()/max() calls.
 
S

Steven D'Aprano

Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), ...


I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

I'm afraid I don't know what you mean by "performant". It doesn't appear
to be an English word, so far as I can tell. Do you mean efficient?

Hard to explaint what I mean .. [0] of the first three tuples is 52. [1]
is 193,193 and 192. What I need as result for these three tuples is:
(52,192,193).

For the next five tuples it is (51,188,193).


Extra challenges:
- This list is sorted. For performance reasons I would like to keep it
unsorted.

Normally when people talk about "performance", they mean to *maximise*
it, not minimise it. If you can guarantee that your data is already pre-
sorted, then the resulting algorithm is likely to be much more efficient
than one which doesn't require sorted data.

In any case, sorting in Python is amazingly fast. You may be pleasantly
surprised that a version that sorts your data, while nominally
O(N log N), may be much faster than an O(N) solution that doesn't require
sorted data. If I were a betting man, I'd be willing to wager a shiny new
dollar[1] that sorting works out faster for reasonable sized sets of data.

One general piece of advice for speeding up Python code: the more work
you can delegate to Python built-ins, like sort, min, max and so forth,
the faster your code is likely to be.

- There may be tuples where min=max.
- There my be tupples where [0] only exists once. So mix is
automatically max

Neither of these should be a problem.


The first approach I'd try would be:

- sort the list, which will group the tuples by their first item;

- conveniently, it will also ensure that the first tuple in each group
will have the minimum second item, and the last tuple of that group the
maximum value;

- use itertools.groupby to extract each group;

- grab the first tuple from the group, and take both its items;

- grab the last tuple from the group (if any), and take its second item;

- yield those three items.


Something like this should work, I expect:

from collections import deque
from itertools import groupby
from operator import itemgetter

def collect(data):
data = sorted(data)
groups = groupby(data, itemgetter(0))
d = deque([], maxlen=1)
for key, subiter in groups:
smallest = largest = next(subiter)[1]
d.extend(subiter)
try:
largest = d.pop()[1]
except IndexError:
pass
yield (key, smallest, largest)



And in use:

py> data = [(1, 4), (5, 3), (2, 7), (2, 3), (1, 3), (2, 5),
.... (5, 3), (4, 9)]
py> list(collect(data))
[(1, 3, 4), (2, 3, 7), (4, 9, 9), (5, 3, 3)]




[1] Betting man or not, nobody ever accused me of being a spend-thrift.
 
T

Tim Chase

In any case, sorting in Python is amazingly fast. You may be
pleasantly surprised that a version that sorts your data, while
nominally O(N log N), may be much faster than an O(N) solution that
doesn't require sorted data. If I were a betting man, I'd be
willing to wager a shiny new dollar[1] that sorting works out
faster for reasonable sized sets of data.

An interesting observation given the "Optimizing list processing"
thread you recently opened about algorithms for processing large
volumes of data and finding that elbow where two algorithms cross on
the performance graph. :)

-tkc
 
M

MRAB

Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), ...


I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

I'm afraid I don't know what you mean by "performant". It doesn't appear
to be an English word, so far as I can tell. Do you mean efficient?
[snip]

There's some debate over whether it's English or not:

http://english.stackexchange.com/questions/38945/what-is-wrong-with-the-word-performant
 
P

Peter Otten

Steven said:
In any case, sorting in Python is amazingly fast. You may be pleasantly
surprised that a version that sorts your data, while nominally
O(N log N), may be much faster than an O(N) solution that doesn't require
sorted data. If I were a betting man, I'd be willing to wager a shiny new
dollar[1] that sorting works out faster for reasonable sized sets of data.

Well, that was my first reaction, too. But then


$ cat keyminmax.py
import operator
import itertools
import collections

def minmax_groupby(items):
for key, group in itertools.groupby(sorted(items),
key=operator.itemgetter(0)):
minpair = maxpair = next(group)
for maxpair in group:
pass
yield key, minpair[1], maxpair[1]

def minmax_dict(items):
d = collections.defaultdict(list)
for key, value in items:
d[key].append(value)
for key, values in d.items():
yield key, min(values), max(values)

a = [(52, 193), (52, 193), (52, 192), (51, 193), (51, 191), (51, 190),
(51, 189), (51, 188), (50, 194), (50, 187),(50, 186), (50, 185),
(50, 184), (49, 194), (49, 183), (49, 182), (49, 181), (48, 194),
(48, 180), (48, 179), (48, 178), (48, 177), (47, 194), (47, 176),
(47, 175), (47, 174), (47, 173), (46, 195), (46, 172), (46, 171),
(46, 170), (46, 169), (45, 195), (45, 168), (45, 167), (45, 166),
(44, 195), (44, 165), (44, 164), (44, 163), (44, 162), (43, 195),
(43, 161), (43, 160), (43, 159), (43, 158), (42, 196), (42, 157),
(42, 156), (42, 155), (41, 196), (41, 154), (41, 153), (41, 152),
(41, 151), (40, 196), (40, 150), (40, 149), (40, 148), (40, 147),
(39, 196), (39, 146), (39, 145), (39, 144), (39, 143), (38, 196),
(38, 142), (38, 141), (38, 140), (37, 197), (37, 139), (37, 138),
(37, 137), (37, 136), (36, 197), (36, 135), (36, 134), (36, 133)]

from collections import deque
from itertools import groupby
from operator import itemgetter

def collect(data):
data = sorted(data)
groups = groupby(data, itemgetter(0))
d = deque([], maxlen=1)
for key, subiter in groups:
smallest = largest = next(subiter)[1]
d.extend(subiter)
try:
largest = d.pop()[1]
except IndexError:
pass
yield (key, smallest, largest)


def time_dict():
for item in minmax_dict(a):
pass

def time_groupby():
for item in minmax_groupby(a):
pass

def time_daprano():
for item in collect(a):
pass

$ python -m timeit -s 'from keyminmax import time_groupby as t' 't()'
10000 loops, best of 3: 68.6 usec per loop
$ python -m timeit -s 'from keyminmax import time_dict as t' 't()'
10000 loops, best of 3: 53.3 usec per loop
$ python -m timeit -s 'from keyminmax import time_daprano as t' 't()'
10000 loops, best of 3: 75.7 usec per loop

So yes, sorting seems to be slower even for small datasets.
 
R

Roy Smith

Steven D'Aprano said:
I'm afraid I don't know what you mean by "performant".

I've heard the term used often. It means something like, "performs
well" or "runs fast". It may or may not be an English word, but that
doesn't stop people from using it :)
 
M

Mark Lawrence

I've heard the term used often. It means something like, "performs
well" or "runs fast". It may or may not be an English word, but that
doesn't stop people from using it :)

If "google" can be used to mean "make huge amouts of money with a
product that is inherently flawed" then I'll happily accept "performant"
as an English word, regardless of whether the English variant is UK, US,
Australian, New Zealand, Soth African, Geordie, Glaswegian or any other :)
 
S

Steven D'Aprano

Hi,

I have a list like this:

a = [(52, 193), (52, 193), (52, 192), ...


I need to find a -performant- way to transform this into a list with
tuples (a[0],[a[0][1]min],[a[0][1]max]).

I'm afraid I don't know what you mean by "performant". It doesn't
appear to be an English word, so far as I can tell. Do you mean
efficient?
[snip]

There's some debate over whether it's English or not:

http://english.stackexchange.com/questions/38945/what-is-wrong-with-the-
word-performant


Ah! I've never come across it before, and it wasn't in any of the
electronic dictionaries I tried, nor the dead-tree Shorter Oxford.

I think I don't dislike it.
 
D

Denis McMahon

I have a list like this:

a = [(52, 193), ...... (36, 133)]

# iterate over the list of tuples
# creates a dictionary n0:[n1a, n1b, n1c ... ]
# from tuples (n0,n1a), (n0,n1b), (n0,n1c) ...

b = {}
for x in a:
if x[0] in b:
pass
else:
b[x[0]] = []
if x[1] not in b[x[0]]:
b[x[0]].append( x[1] )

# iterate over the dictionary
# create the list of result tuples (n0, n1min, n1max) .....

c = [ (x, min(y), max(y) ) for x,y in b.iteritems() ]
print c

There may be a more efficient test for whether b[x[0]] eists than
trapping keyError.
 
S

Steven D'Aprano

Steven said:
In any case, sorting in Python is amazingly fast. You may be pleasantly
surprised that a version that sorts your data, while nominally O(N log
N), may be much faster than an O(N) solution that doesn't require
sorted data. If I were a betting man, I'd be willing to wager a shiny
new dollar[1] that sorting works out faster for reasonable sized sets
of data.

Well, that was my first reaction, too. But then

[snip timeit results]

On my system, I don't get quite the same results as you. I find that the
groupby solution is slightly faster than the dict solution, although both
are slightly faster than the version I came up with. All three are plenty
fast enough though. Using the same function definitions as you, I get:

py> from timeit import Timer
py> setup = "from __main__ import time_dict, time_groupby, time_daprano"
py> t1 = Timer("_ = time_dict()", setup)
py> t2 = Timer("_ = time_groupby()", setup)
py> t3 = Timer("_ = time_daprano()", setup)

and results:

py> min(t1.repeat(number=100000, repeat=6))
6.7522243447601795
py> min(t2.repeat(number=100000, repeat=6))
6.7246054373681545
py> min(t3.repeat(number=100000, repeat=6))
7.116707382723689

But, there's a flaw in the above analysis. The list that you give is pre-
sorted, and Python's sort routine is *very* efficient at sorting already
sorted lists. So I randomly shuffled it and redid the tests. Now the
advantage of the dict solution is clear:

py> from random import shuffle
py> shuffle(a)
py> min(t1.repeat(number=100000, repeat=6))
6.611802507191896
py> shuffle(a)
py> min(t2.repeat(number=100000, repeat=6))
8.499449279159307
py> shuffle(a)
py> min(t3.repeat(number=100000, repeat=6))
9.61335832811892


The dict solution is hardly changed (slightly faster, although I guess
that is just due to random timing fluctuations and not meaningful). The
other two, on the other hand, are quite a bit slower with unsorted data.

Nevertheless, it should be pointed out that even in the worst case, it's
still pretty damn fast: less than 0.0001s to process a list with 78
items, or about 1.3 microseconds per item.

A couple of other observations:

- The collect() function is the same algorithm as the minmax_groupby()
function, the only difference being the implementation. I would expect
(but haven't tested) that the collect() version using a deque instead of
manually iterating over the items in each group would be faster if there
were a lot of items per group.

- The dict version has an advantage that hashing integers is fast. If
hashing the keys is slower, it may no longer be quite so good.

- The dict version keeps a list of all the values it sees, including
duplicates. Instead of a list, a set may be sufficient:

def minmax_dict(items):
d = collections.defaultdict(set)
for key, value in items:
d[key].add(value)
for key, values in d.items():
yield key, min(values), max(values)

It might even be worthwhile experimenting with something which simply
stores the minimum and maximum values seen, rather than all the values.
If there are a lot of items in each group, or if the comparisons are
expensive, calculating the min and max in a single pass may be worthwhile:

http://code.activestate.com/recipes/577916-fast-minmax-function/
 
R

Robert Voigtländer

I've heard the term used often. It means something like, "performs
well" or "runs fast". It may or may not be an English word, but that
doesn't stop people from using it :)
If "google" can be used to mean "make huge amouts of money with a
product that is inherently flawed" then I'll happily accept "performant"
as an English word, regardless of whether the English variant is UK, US,
Australian, New Zealand, Soth African, Geordie, Glaswegian or any other :)

Indeed it's not an english word. I have to stop using it. In German it's used with the meaning of "runs fast". Even though it's already not that clearly defined there.

Thanks for the help on the topic of data aggregation. It helped a lot and I again learned somthing.
I have a performant .. well .. fast running solution now.
 
R

rusi

Indeed it's not an english word. I have to stop using it. In German
it's used with the meaning of "runs fast". Even though it's already
not that clearly defined there.
Thanks for the help on the topic of data aggregation. It helped a
lot and I again learned somthing. I have a performant .. well
.. fast running solution now.

Well "performant" is performant enough for the purposes of communicating
on the python list I think :D
 
D

Dennis Lee Bieber

Well "performant" is performant enough for the purposes of communicating
on the python list I think :D

Most probably could figure it out as being stylistically similar to
"conformant", which I believe IS used in English <G>

conformant => something that conforms
performant => something that performs
 
T

Tim Roberts

Dennis Lee Bieber said:
Most probably could figure it out as being stylistically similar to
"conformant", which I believe IS used in English <G>

conformant => something that conforms
performant => something that performs

Yes, I suspect it comes from people expecting too much consistency. If
something that has "conformance" is "conformant", then something that has
good "performance" must be "performant".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top