Efficiently sequencing a stream of objects

C

Chris

Does anyone know of an efficient algorithm for merging many streams of
objects?

Suppose I have N streams of words. Each stream returns words in
alphabetical sequence. For example:

stream 0: aardvark, bear, dog
stream 1: cat, groundhog
stream 3: eagle, fox

I want the output to be:
aardvark, bear, cat, dog, eagle, fox, groundhog.

The number of words can be large (in the millions) and the number of
streams can also be large (in the thousands).

The brute force approach is to create an array of streams, get the
lowest one from each, and output the lowest word found. Repeat until all
streams exhausted. The problem is that the running time is O(N * M),
where N is the number of streams and M is the *total number of elements
across all streams*. This is obviously not scalable.
 
E

Eric Sosman

Chris said:
Does anyone know of an efficient algorithm for merging many streams of
objects?

Suppose I have N streams of words. Each stream returns words in
alphabetical sequence. For example:

stream 0: aardvark, bear, dog
stream 1: cat, groundhog
stream 3: eagle, fox

I want the output to be:
aardvark, bear, cat, dog, eagle, fox, groundhog.

The number of words can be large (in the millions) and the number of
streams can also be large (in the thousands).

The brute force approach is to create an array of streams, get the
lowest one from each, and output the lowest word found. Repeat until all
streams exhausted. The problem is that the running time is O(N * M),
where N is the number of streams and M is the *total number of elements
across all streams*. This is obviously not scalable.

Merge pairs of streams, then merge pairs of merged streams,
then merge pairs of twice-merged streams, ... Once the tree is
set up and until it begins to collapse as the streams start to
run out, you'll need about lg(M) comparisons per item emitted.
Use K-way instead of two-way merges ("Pair up by threes" -- L.P.
Berra) and you'll use about (K-1)*ln(M)/ln(K) comparisons per
item, but lower overhead might compensate for the increased
comparison count.

See also "replacement selection," which is not exactly what
you're trying to do but is closely related.
 
P

Patricia Shanahan

Chris said:
Does anyone know of an efficient algorithm for merging many streams of
objects?

Suppose I have N streams of words. Each stream returns words in
alphabetical sequence. For example:

stream 0: aardvark, bear, dog
stream 1: cat, groundhog
stream 3: eagle, fox

I want the output to be:
aardvark, bear, cat, dog, eagle, fox, groundhog.

The number of words can be large (in the millions) and the number of
streams can also be large (in the thousands).

The brute force approach is to create an array of streams, get the
lowest one from each, and output the lowest word found. Repeat until all
streams exhausted. The problem is that the running time is O(N * M),
where N is the number of streams and M is the *total number of elements
across all streams*. This is obviously not scalable.

java.util.PriorityQueue

Patricia
 
P

Patricia Shanahan

Eric said:
Merge pairs of streams, then merge pairs of merged streams,
then merge pairs of twice-merged streams, ... Once the tree is
set up and until it begins to collapse as the streams start to
run out, you'll need about lg(M) comparisons per item emitted.
Use K-way instead of two-way merges ("Pair up by threes" -- L.P.
Berra) and you'll use about (K-1)*ln(M)/ln(K) comparisons per
item, but lower overhead might compensate for the increased
comparison count.

See also "replacement selection," which is not exactly what
you're trying to do but is closely related.

Why this rather than java.util.PriorityQueue? The two approaches have he
same computational complexity, but the PriorityQueue implementation
should be much simpler.

Patricia
 
E

Eric Sosman

Patricia said:
Why this rather than java.util.PriorityQueue? The two approaches have he
same computational complexity, but the PriorityQueue implementation
should be much simpler.

PriorityQueue has several significant disadvantages, chief
among them being that <font size=tiny style=shamefaced> I didn't
think of it ... </font>
 
M

Mark Thornton

Eric said:
PriorityQueue has several significant disadvantages, chief
among them being that <font size=tiny style=shamefaced> I didn't
think of it ... </font>

I think the biggest failing of PriorityQueue was its absence prior to
jdk 1.5 (aka 5).

Mark Thornton
 
C

Chris

java.util.PriorityQueue
I kind of thought that might help, but couldn't think of a way to
implement it efficiently. In a burst of caffeine, though, I produced the
following code. Thanks for the inspiration. This is elegant. The only
thing I don't like about it is all the casting required, but that's the
cost of using Collections.


/**
* Wraps multiple Iterators, and returns the Objects in the Iterators
* in the correct, ascending sequence. Assumes that the Iterators
* contain Objects that implement the Comparable interface.
*/
public class MergingIterator implements Iterator {
private PriorityQueue queue = new PriorityQueue();

public void add(Iterator it) {
if (!it.hasNext())
return; // do nothing, do not add

Carrier carrier = new Carrier();
carrier.it = it;
carrier.next = (Comparable)it.next();
queue.offer(carrier);
}


public boolean hasNext() {
return !queue.isEmpty();
}

public Object next() {
Carrier carrier = (Carrier)queue.poll();
Comparable next = carrier.next;

// if there's another object to be had from this stream, insert it.
// if not, the queue just gets smaller
if (carrier.it.hasNext()) {
carrier.next = (Comparable) carrier.it.next();
queue.offer(carrier);
}

return next;
}

public void remove() {
throw new UnsupportedOperationException();
}

private class Carrier implements Comparable {
Iterator it;
Comparable next;

public int compareTo(Object other) {
return next.compareTo(((Carrier)other).next);
}
}
}
 
P

Patricia Shanahan

Chris said:
I kind of thought that might help, but couldn't think of a way to
implement it efficiently. In a burst of caffeine, though, I produced the
following code. Thanks for the inspiration. This is elegant. The only
thing I don't like about it is all the casting required, but that's the
cost of using Collections.

Nice. Take a look at generics. You might be able to do something to
reduce the casting with the 1.5 collections.

Patricia
 
C

Chris

Patricia said:
Nice. Take a look at generics. You might be able to do something to
reduce the casting with the 1.5 collections.

Patricia

Generics, unfortunately, are worthless for performance improvements.
They provide only compile-time type checking. At runtime, all the casts
are still there.
 
P

Piotr Kobzda

Chris said:
Generics, unfortunately, are worthless for performance improvements.
They provide only compile-time type checking. At runtime, all the casts
are still there.

Not all cast are there. Compare your version with "genericified" one
(attached below).


piotr

--
public class MergingIterator<E extends Comparable<? super E>> implements
Iterator<E> {
private PriorityQueue<Carrier> queue = new PriorityQueue<Carrier>();

public void add(Iterator<? extends E> it) {
if (!it.hasNext())
return; // do nothing, do not add

Carrier carrier = new Carrier();
carrier.it = it;
carrier.next = it.next();
queue.offer(carrier);
}


public boolean hasNext() {
return !queue.isEmpty();
}

public E next() {
Carrier carrier = queue.poll();
E next = carrier.next;

// if there's another object to be had from this stream, insert it.
// if not, the queue just gets smaller
if (carrier.it.hasNext()) {
carrier.next = carrier.it.next();
queue.offer(carrier);
}

return next;
}

public void remove() {
throw new UnsupportedOperationException();
}

private class Carrier implements Comparable<Carrier> {
Iterator<? extends E> it;
E next;

public int compareTo(Carrier other) {
return next.compareTo(other.next);
}
}
}
 
P

Piotr Kobzda

Piotr said:
Not all cast are there. Compare your version with "genericified" one
(attached below).

Well, I compared them instruction by instruction, and Chris is right,
all casts are still there. Not exactly in the same places, but
effectively all casts from original version are performed in generic
code also.

I assumed at first, that generic version of Carrier will take away one
cast, unfortunately, that cast is moved into synthetic compareTo(Objcet)
method generated to fulfill all needs of parameterized Comparator
implementation.


Maybe planned the next generation of generics will improve that somehow:
http://tech.puredanger.com/java7#reified


However, even without any performance improvements, there are still
other benefits of generic version, like clearer implementation (IMHO),
constrains previously expressed in comments only, are now checked by the
compiler, etc.


piotr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top