Merging sorted lists/iterators/generators into one stream of values...

  • Thread starter =?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=
  • Start date
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from
them all in sorted order.

In other words:
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]

for value in ???(s1, s2, s3):
print value

will print out 10, 15, 17, 20, 25, 27, 30, 37, 40, 50 in that order.

The sources are cursors retrieving data from several databases, not from
the same server, and there's a potential for a large number of rows from
several of the sources. As such, any method that would load it all into
memory and sort it is no good as it would too much memory.

Is there a function or whatnot in Python that will do what I want? I
have come up with my own method but since it uses generators I'm not
sure how much overhead there is. Additionally, since the values
retrieved from the cursors will be dictionaries of "fieldname":value
pairs, the method would either need to handle that construct (and be
told what fieldname to sort on), or be able to take a function object to
use for the comparison operation.

Basically, I'll use my own function for this unless someone can point me
to something that is already available. Couldn't seem to find anything
in the builtin modules but I didn't find glob.glob when I was looking
for how to find filenames either so who knows what it's called :)

Since I need to deal with a variable list of sources (that is, 2 in one
application, 3 in another and 4-5 in a third), a simple 2-source method
isn't enough but if it's better than what I do for 2 sources then I can
make a wrapper for it, since that's what I do anyway.
 
M

Mike C. Fletcher

Lasse said:
I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from
them all in sorted order.

In other words:
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]

for value in ???(s1, s2, s3):
print value

will print out 10, 15, 17, 20, 25, 27, 30, 37, 40, 50 in that order.

The sources are cursors retrieving data from several databases, not from
the same server, and there's a potential for a large number of rows from
several of the sources. As such, any method that would load it all into
memory and sort it is no good as it would too much memory.

Is there a function or whatnot in Python that will do what I want? I
have come up with my own method but since it uses generators I'm not
sure how much overhead there is. Additionally, since the values
retrieved from the cursors will be dictionaries of "fieldname":value
pairs, the method would either need to handle that construct (and be
told what fieldname to sort on), or be able to take a function object to
use for the comparison operation.

Basically, I'll use my own function for this unless someone can point me
to something that is already available. Couldn't seem to find anything
in the builtin modules but I didn't find glob.glob when I was looking
for how to find filenames either so who knows what it's called :)

Since I need to deal with a variable list of sources (that is, 2 in one
application, 3 in another and 4-5 in a third), a simple 2-source method
isn't enough but if it's better than what I do for 2 sources then I can
make a wrapper for it, since that's what I do anyway.
I doubt you'll find a prebuilt one, it's fairly easy to create your own,
after all. Generators are fairly fast constructs in Python, btw.
Here's what I whipped up in a few minutes:

def firstIter( value ):
it = iter( value )
try:
return it.next(), it
except StopIteration, err:
return None, None

def inorder( comparision, *args ):
iterators = [
[value,it]
for (value,it) in [ firstIter( arg ) for arg in args ]
if it is not None
]
iterators.sort()
while iterators:
yield iterators[0][0]
try:
value = iterators[0][0] = iterators[0][1].next()
except StopIteration, err:
iterators.pop(0)
else:
if len(iterators) > 1 and comparision( value,
iterators[1][0]) == 1:
iterators.sort()
continue

if __name__ == "__main__":
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]
s4 = []
for value in inorder(cmp, s1, s2, s3, s4):
print value


Anyway, have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com
 
?

=?iso-8859-1?q?Lasse_V=E5gs=E6ther_Karlsen?=

Ok, that one looks more sleak than what I came up with.

Couple of things I learn from your solution, please correct me if I
misunderstood something:

1. list containing other lists will sort itself based on first element
on lists inside ?
2. sort(), pop() is not costly operations

Other than that you seem to not use the cmp operator but that's easily
fixed.

This one looks much better than mine, here's what I came up with:

def merge_sorted(iterables, comparer=None):
"""
Generator that will take a list of iterables/generators that is
individually sorted, and then produce
values in a sorted order by taking the lowest value from each
source each call.

@param iterables: List of iterables/generators to retrieve values
from
@type iterables: List of iterables/generators

@param comparer: Optional fn(v1, v2) function that can compare two
values, should return <0 if v1<v2,
>0 if v1>v2 and ==0 if v1==v2.
@type comparer: function-object, example: lambda x, y: x-y

@note: The "list of iterables" can actually be anything that
produces a list of iterables, so you can
use a function that yields lists for instance.
"""

# First convert whatever we're given into a list of sources
iterables = [iterable for iterable in iterables]

# This series of if-statements will determine how many sources we
have and work out sub-problems
# that are manageable.
if len(iterables) != 2:
if len(iterables) == 0:
# List, but no sources
pass
elif len(iterables) == 1:
# Only 1 source, just return its contents
for value in iterables[0]:
yield value
elif len(iterables) == 3:
# 3 sources, sub-divide into 0 <--> (1, 2)
left_iterable = iterables[0]
right_iterable = merge_sorted([iterables[1], iterables[2]],
comparer)
for value in merge_sorted([left_iterable, right_iterable],
comparer):
yield value
elif len(iterables) == 4:
# 4 sources, sub-divide into (0, 1) <--> (2, 3)
left_iterable = merge_sorted([iterables[0], iterables[1]],
comparer)
right_iterable = merge_sorted([iterables[2], iterables[3]],
comparer)
for value in merge_sorted((left_iterable, right_iterable),
comparer):
yield value
elif len(iterables) > 4:
# >4 sources, sub-divide into (0, 1) <--> (2, ...)
left_iterable = merge_sorted([iterables[0], iterables[1]],
comparer)
right_iterable = merge_sorted(iterables[2:], comparer)
for value in merge_sorted((left_iterable, right_iterable),
comparer):
yield value
raise StopIteration

# The method will only get here if there is only two sources, which
is an easy case to handle
i1 = iter(iterables[0])
i2 = iter(iterables[1])

# Grab the first two values from the two sources, if possible
try:
v1 = i1.next()
has_v1 = True
except StopIteration:
has_v1 = False
try:
v2 = i2.next()
has_v2 = True
except StopIteration:
has_v2 = False

# As long as we got values from both generates/iterators/whatever,
compare and yield the lowest of the
# two, and then get the next value from the corresponding source
while has_v1 and has_v2:
# Work out which of v1 and v2 comes first
if comparer is not None:
comp = comparer(v1, v2)
if comp <= 0:
yield_v1 = True
else:
yield_v1 = False
else:
if v1 <= v2:
yield_v1 = True
else:
yield_v1 = False

# Yield the next value, then grab a new value from the
corresponding source
if yield_v1:
yield v1
try:
v1 = i1.next()
except StopIteration:
has_v1 = False
else:
yield v2
try:
v2 = i2.next()
except StopIteration:
has_v2 = False

# When we get here, we got 3 possibilities:
# 1. has_v1 == True, has_v2 == False --> yield rest of v1/i1 and
just exit on StopIteration exception
# 2. has_v1 == False, has_v1 == True --> yield rest of v2/i2 and
just exit on StopIteration exception
# 3. has_v1 == has_v2 == False --> while-loops will skip, function
falls off the end
while has_v1:
yield v1
v1 = i1.next()
while has_v2:
yield v2
v2 = i2.next()
 
?

=?iso-8859-1?q?Lasse_V=E5gs=E6ther_Karlsen?=

Thanks, that looks like Mike's solution except that it uses the
built-in heapq module.

While this one contains less code than Mike's solution it seems to lack
the ability to control the comparison operation, which means it won't
work in my case. I need to both be able to sort on an arbitrary field
name (could be done using a list where I place the field first), and
also to be able to sort in a different order than smallest-first.

Perhaps by adjusting the data that is returned through each source
would do that. I'll look into it.
 
G

George Sakkis

Lasse Vågsæther Karlsen said:
Thanks, that looks like Mike's solution except that it uses the
built-in heapq module.

This make a big difference for the algorithmic complexity; replacing an item in a heap is much more
efficient than sorting the whole list.
While this one contains less code than Mike's solution it seems to lack
the ability to control the comparison operation, which means it won't
work in my case. I need to both be able to sort on an arbitrary field
name (could be done using a list where I place the field first), and
also to be able to sort in a different order than smallest-first.

Perhaps by adjusting the data that is returned through each source
would do that. I'll look into it.

Yes, it's a little inconvenient that the builtin heap doesn't take a comparison operation but you
can easily roll your own heap by transforming each item to a (key,item) tuple. Now that I'm thinking
about it, it might be a good addition to the cookbook.

George
 
M

Mike C. Fletcher

Lasse said:
Ok, that one looks more sleak than what I came up with.

Couple of things I learn from your solution, please correct me if I
misunderstood something:

1. list containing other lists will sort itself based on first element
on lists inside ?
Correct, comparison is recursive for lists/tuples.
2. sort(), pop() is not costly operations
They *can* be costly, but the algorithm reduces the number of times they
are called to less-than-linear for all but pathological cases (i.e. they
are only called when you need to switch streams). It also only requires
sorting only the number of streams, rather than the whole set.
Other than that you seem to not use the cmp operator but that's easily
fixed.
Sorry about that, revised version below:

def firstIter( value ):
it = iter( value )
try:
return it.next(), it
except StopIteration, err:
return None, None

def cmpFirstWith( comparison ):
def compare( a,b ):
return comparison(a[0],b[0])
return compare

def inorder( comparison, *args ):
iterators = [
[value,it]
for (value,it) in [ firstIter( arg ) for arg in args ]
if it is not None
]
iterators.sort()
while iterators:
yield iterators[0][0]
try:
value = iterators[0][0] = iterators[0][1].next()
except StopIteration, err:
iterators.pop(0)
else:
if len(iterators) > 1 and comparison( value,
iterators[1][0]) == 1:
iterators.sort( cmpFirstWith( comparison ) )
continue

if __name__ == "__main__":
s1 = [10, 20, 30, 40, 50]
s2 = [15, 25]
s3 = [17, 27, 37]
s4 = []
for value in inorder(cmp, s1, s2, s3, s4):
print value
s1 = [{'a':'b'}, {'a':'e'}]
s2 = [{'a':'d'}, {'a':'z'}]
def comp( a,b ):
return cmp( a['a'],b['a'])
for value in inorder(cmp, s1, s2 ):
print value

Have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

Lasse said:
I need to merge several sources of values into one stream of values. All
of the sources are sorted already and I need to retrieve the values from
<snip>

Ok, after working through the various sources and solutions, here's what
I finally ended up with:

def merge_sorted(comparison, *sources):
iterables = []
for source in sources:
try:
source = iter(source)
iterables.append([source.next(), source])
except StopIteration:
pass

iterables.sort(cmp=comparison, key=lambda x: x[0])
while iterables:
yield iterables[0][0]
try:
iterables[0][0] = iterables[0][1].next()
if len(iterables) > 1 and comparison(iterables[0][0],
iterables[1][0]) > 0:
iterables.sort(comparison, key=lambda x: x[0])
except StopIteration:
iterables.pop(0)

Thanks to Mike and George for the solutions and pointers.
 
A

Alex Martelli

George Sakkis said:
This make a big difference for the algorithmic complexity; replacing an
item in a heap is much more efficient than sorting the whole list.

In the most general case, yes. However, Python's sort ("timsort") is
preternaturally fast when sorting sequences that are mostly sorted
except for maybe one element being in the wrong place... try it (and
read the Timbot's article included in Python's sources, and the sources
themselves)... I suspect that heapq will still be faster, but by
nowhere as much as one might think.

Yes, it's a little inconvenient that the builtin heap doesn't take a
comparison operation but you can easily roll your own heap by transforming
each item to a (key,item) tuple. Now that I'm thinking about it, it might
be a good addition to the cookbook.

I believe Python 2.5 adds a key= argument to heapq's functions...


Alex
 
K

Kay Schluehr

Alex said:
try it (and read the Timbot's article included in Python's sources, and the
sources themselves)...

Just a reading advise. The translated PyPy source
pypy/objectspace/listsort.py might be more accessible than the
corresponding C code.

Kay
 
T

Tim Peters

[Alex Martelli]
[Kay Schluehr]
Just a reading advise. The translated PyPy source
pypy/objectspace/listsort.py might be more accessible than the
corresponding C code.
[cfbolz]
indeed. it is at

http://codespeak.net/svn/pypy/dist/pypy/objspace/std/listsort.py

While the Python version is certainly easier to read, I believe Alex
had in mind the detailed English _explanation_ of the algorithm:

http://cvs.sf.net/viewcvs.py/python/python/dist/src/Objects/listsort.txt

It's a complex algorithm, dripping with subtleties that aren't
apparent from staring at an implementation.

Note that if a list has N elements, sorting it requires at least N-1
comparisons, because that's the minimum number of compares needed
simply to determine whether or not it's already sorted. A heap-based
priority queue never requires more than O(log(N)) compares to push or
pop an element. If N is small, it shouldn't make much difference. As
N gets larger, the advantage of a heap grows without bound.
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

Alex said:
George Sakkis <[email protected]> wrote:


I believe Python 2.5 adds a key= argument to heapq's functions...
<snip>

I will revisit the heapq solution when 2.5 is released then.

Thanks for the heads up. For the moment I will stay with the list
solution that Mike came up with slightly changed to accomodate tips and
pointers from others in this thread.
 
G

George Sakkis

Lasse Vågsæther Karlsen said:
<snip>

I will revisit the heapq solution when 2.5 is released then.

Thanks for the heads up. For the moment I will stay with the list
solution that Mike came up with slightly changed to accomodate tips and
pointers from others in this thread.

Just added a recipe at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/440673. You can try
both and see if there's any significant performance difference for your data.

George
 
?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

George Sakkis wrote:
Just added a recipe at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/440673. You can try
both and see if there's any significant performance difference for your data.
<snip>

Thanks, will take a look at it later. The sort solution seems to work
nicely. Might be a big difference if I have a lot of sources though as I
bet the overhead in doing a sort of N items gets higher than doing a
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.
 
G

George Sakkis

Lasse Vågsæther Karlsen said:
George Sakkis wrote:

<snip>

Thanks, will take a look at it later. The sort solution seems to work
nicely. Might be a big difference if I have a lot of sources though as I
bet the overhead in doing a sort of N items gets higher than doing a
manipulation of a heap to place an item in the right spot, but with 4-5
or a few more sources might not make an impact at all.

Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.

George
 
A

Alex Martelli

George Sakkis said:
Unless you're talking about hundreds or thousands sources, it probably
won't. I would still go for the heap solution since IMO the resulting
code it's more readable and easier to understand.

I'm not so sure about either sentence...:

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=S
Best time for 10 loops: 0.247116088867

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=H
Best time for 10 loops: 0.10344004631

i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations). Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):

def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()

def merge_by_heap(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
heapq.heapify(sources)
while sources:
best_source = sources[0]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: heapq.heappop(sources)
else: heapq.heapreplace(sources, best_source)


Hmmm, I wonder if something like merge_by_heap would be a good candidate
for itertool. Raymond...?


Alex
 
G

George Sakkis

Alex Martelli said:
I'm not so sure about either sentence...:

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=S
Best time for 10 loops: 0.247116088867

Helen:~/pynut/samp alex$ python merger.py --numstreams=10 --minlen=100
--how=H
Best time for 10 loops: 0.10344004631

i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations).

Interesting; I thought timsort on small almost ordered lists would be practically as fast as the
heap. Still, how is 0.10344 over 4 times faster than 0.247116 ?
Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):

def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()

def merge_by_heap(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
heapq.heapify(sources)
while sources:
best_source = sources[0]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: heapq.heappop(sources)
else: heapq.heapreplace(sources, best_source)

Indeed, these are almost equivalent as far as readability goes; the previous examples in the thread
were less clear. By the way, why do you need 'i' and enumerate above ?
Hmmm, I wonder if something like merge_by_heap would be a good candidate
for itertool. Raymond...?


Alex

Yes, it would be nice to make it into 2.5.

George
 
M

Mike C. Fletcher

Alex said:
...

....

i.e., a heap solution may be over 4 times faster than a sort-based one
(in the following implementations). Readability seems quite comparable
(skipping the rest of the infrastructure, which generates random sorted
streams and ensures a stream is exhausted and verifies it etc etc):
One thing to keep in mind (if you care about performance) is that you
one could use bisect, instead of sort, as the sorted list of streams is
already in order save for the one element you are processing. Btw, nice
trick with reverse to reduce memory copying, when did that get
introduced? Wonder if bisect can deal with reverse-sorted elements.
Anyway, should reduce the O-complexity of that part of the operation,
though you'll still have to do a memcpy to shift the rest of the source
list's array, and if it can't deal with reverse-sorted lists it would
move you back to front-of-list popping.

Oh, we're still missing use of a comparison function in both versions.
I'd think you'd want that functionality *available* if you're going to
make this a general tool. You'll also need to check for StopIteration
on creation of sources for null sequences. Finally, why the 'i'
element? It's never used AFAICS.
def merge_by_sort(streams):
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
while sources:
sources.sort(reverse=True)
best_source = sources[-1]
yield best_source[0]
try: best_source[0] = best_source[-1]()
except StopIteration: sources.pop()
....

Have fun,
Mike

--
________________________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://www.vrplumber.com
http://blog.vrplumber.com
 
A

Alex Martelli

George Sakkis said:
Interesting; I thought timsort on small almost ordered lists would be
practically as fast as the heap. Still, how is 0.10344 over 4 times faster
than 0.247116 ?

Oops, 2.5 times (for 10 streams) -- the ratio keeps getting bigger with
the number of streams, the figure 4 I had in mind was probably from
comparisons with 50 streams or so.

timsort needs N comparisons even if the list is already ordered (just to
check it is) while heaps can do with log(N) comparisons or even less in
the best case -- that's the key issue here.
sources = [[s.next(), i, s.next] for i, s in enumerate(streams)]
...
Indeed, these are almost equivalent as far as readability goes; the
previous examples in the thread were less clear. By the way, why do you
need 'i' and enumerate above ?

Making sure that, if the first items are identical, the sorting (or
comparison for heap) does not compare the _methods_ -- no big deal if it
does, but that doesn't logically make sense. If we had key= arguments,
that would ENSURE no such comparison, but heaps didn't support that in
2.4 and I wanted to keep the playing field level, so I used the above
old trick to ensure stable sorting without comparisons beyond a given
field (more details were explained in the 1st edition of the Cookbook,
but I hope I give the general idea).


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top