Merging sorted lists/iterators/generators into one stream of values...

  • Thread starter =?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=
  • Start date
A

Alex Martelli

Mike C. Fletcher said:
One thing to keep in mind (if you care about performance) is that you
one could use bisect, instead of sort, as the sorted list of streams is
already in order save for the one element you are processing. Btw, nice
trick with reverse to reduce memory copying, when did that get
introduced?

Python 2.4
Wonder if bisect can deal with reverse-sorted elements.

Not AFAIK.
Anyway, should reduce the O-complexity of that part of the operation,
though you'll still have to do a memcpy to shift the rest of the source
list's array, and if it can't deal with reverse-sorted lists it would
move you back to front-of-list popping.

Yeah, if you have to pop the front you're O(N) anyway, which is
basically what timsort does with nearly-ordered lists (though timsort
uses O(N) _comparisons_ while bisecting would use O(N) _moves_).

Oh, we're still missing use of a comparison function in both versions.

Trivial, just have the functions accept a key= and use that to prepare
the auxiliary list they're using anyway.
I'd think you'd want that functionality *available* if you're going to
make this a general tool. You'll also need to check for StopIteration
on creation of sources for null sequences.

True, the functions as prepared don't accept empty streams (exactly
because they don't specialcase StopIteration on the first calls to
next). Pretty trivial to remedy, of course.
Finally, why the 'i'
element? It's never used AFAICS.

It's used by the list's lexicographic comparison when the first elements
of two lists being compared are equal (which can happen, of course) to
avoid comparing the last elements (which are methods... no big deal if
they get compared, but it makes no logical sense).

So, an example enhanced merge_by_sort:


def merge_by_sort(streams, key=None):

if not key: key = lambda x: x
sources = []
for i, s in enumerate(streams):
try: first_item = s.next()
except StopIteration: pass
else: sources.append((key(item), i, item, s.next))

while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[2]
try: best_source[2] = best_source[-1]()
except StopIteration: sources.pop()
else: best_source[0] = key(best_source[2])


Of course, since the sort method DOES accept a key= parameter, this
could be simplified, but I'll leave it like this to make it trivial to
see how to recode the merging by heap as well (in 2.4)...


Alex
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

<snip>

Another idea for this method would be that in some cases I noticed that
it was useful to know which source each element would come from as well,
as well as removing duplicates from the results.

For instance

s1 = [1, 3, 5, 7]
s2 = [2, 3, 4]

for k, s in merge_by_sort(s1, s2):
print k, "from source", s

this would print:

1 from source 0
2 from source 1
3 from source 1
3 from source 0
4 from source 1
5 from source 0
7 from source 0

and the above list has 3 twice, so possibly:

1 from sources [0]
2 from sources [1]
3 from sources [0, 1]
4 from sources [1]
5 from sources [0]
7 from sources [0]

This latter one would be a slightly more heavy method as it would have
to compare the N first elements of the list or heap to figure out what
indices to yield as well.

However, the previous solution could be:

def merge_by_sort(*sources, **options):
if "cmp" in options:
comparison = options["cmp"]
else:
comparison = cmp

iterables = []
for index, source in enumerate(sources):
try:
source = iter(source)
iterables.append([source.next(), index, source])
except StopIteration:
pass

iterables.sort(cmp=comparison, key=lambda x: x[0], reverse=True)
while iterables:
yield iterables[-1][0], iterables[-1][1]
try:
iterables[-1][0] = iterables[-1][2].next()
if len(iterables) > 1 and comparison(iterables[-1][0],
iterables[-2][0]) > 0:
iterables.sort(comparison, key=lambda x: x[0],
reverse=True)
except StopIteration:
iterables.pop(-1)
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

Lasse said:
<snip>

Another idea for this method would be that in some cases I noticed that
it was useful to know which source each element would come from as well,
as well as removing duplicates from the results.
<snip>

The "removing duplicates" problem would probably be best as a separate
function and it occurs to me that perhaps Python has such a function
already.

Again, this function would need the following criteria:

1. Be able to iterate through something other than a list
2. Yield the values, not return a list
3. Take an arbitrary cmp function to determine what is a duplicate

As sugar, perhaps also the following criteria:

- Ability to "combine" the duplicates through a special function

A simple first-version function I hacked together does this:

def unique(source, cmp=cmp, key=None, combine=None):
it = iter(source)
first = True
value = it.next()
values = [value]
while True:
try:
value = it.next()
if key is not None:
cmp_result = cmp(values[0][key], value[key])
else:
cmp_result = cmp(values[0], value)
if cmp_result == 0:
values.append(value)
else:
if combine is not None:
yield combine(values)
else:
yield values[0]
values = [value]
except StopIteration:
if combine is not None:
yield combine(values)
else:
yield values[0]
break
raise StopIteration

Note that this function does not do any sorting so if the source that it
gets the values from is not sorted, the result will be very wrong. This
is again due to my criteria of being able to handle cursors retrieving
data from a database and thus avoid loading everything into memory.

The combine function is passed a list of "duplicate" values and must
return a value that will be yielded out of unique.

Example of usage:

def sum_counts(values):
value = values[0][0]
sum = 0
for row in values:
sum += row[1]
return value, sum

fruits = [["Apple", 10], ["Apple", 15], ["Banana", 23], ["Orange", 17],
["Orange", 17]]
for fruit, total_sum in unique(fruits, key=0, combine=sum_counts):
print fruit, "has a sum of", total_sum

This will produce:

Apple has a sum of 25
Banana has a sum of 23
Orange has a sum of 34

Function name is perhaps not the best one. It occurs to me that this is
the GROUP BY function in SQL so perhaps a different name is better, but
then again this might be moot if such a function already exists somewhere :)
 
G

George Sakkis

Function name is perhaps not the best one. It occurs to me that this
is the GROUP BY in SQL so perhaps a different name is better, but
then again this might be moot if such a already exists somewhere :)

Amazing, you keep reinventing things, even with the exact same name :)

from itertools import imap,groupby
from operator import itemgetter

for fruit,group in groupby(fruits, itemgetter(0)):
print fruit, "has a sum of", sum(imap(itemgetter(1),group))

For this to work as intended, fruits has to be already sorted by the
same key given to grouby; otherwise just replace fruits with
sorted(fruits, itemgetter(0)).

By the way, read all the functions in the itertools module
(http://docs.python.org/lib/itertools-functions.html), it will save you
a lot of time.

George
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

George said:
Amazing, you keep reinventing things, even with the exact same name :)

from itertools import imap,groupby
from operator import itemgetter

for fruit,group in groupby(fruits, itemgetter(0)):
print fruit, "has a sum of", sum(imap(itemgetter(1),group))

For this to work as intended, fruits has to be already sorted by the
same key given to grouby; otherwise just replace fruits with
sorted(fruits, itemgetter(0)).

By the way, read all the functions in the itertools module
(http://docs.python.org/lib/itertools-functions.html), it will save you
a lot of time.

George

Itertools, meh, there's that wheel again :)

Didn't know about this one so thank you :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top