Finding 1000 largest numbers from a file having some billion numbers

R

Richard Tobin

user923005 said:
You will have to adjust the (K-element) heap for every element in the
list (e.g. one billion of them, less 1000).
This involves at least two operations:
1. Adding a new element. (Constant time, so this just affects the
constant of proportionality)
2. Removing the max element. (This requires a re-balance and is not
O(1)).

It may depend on K, but K is a constant (1000). The time to rebalance
does not depend on N.

(And you don't have to adjust the heap for the vast majority of the
billion items, since they will almost all be smaller than the current
smallest heap item.)

-- Richard
 
U

user923005

It may depend on K, but K is a constant (1000). The time to rebalance
does not depend on N.

Then the total set size (of one billion) is also a constant and so the
whole thing is O(1).
(And you don't have to adjust the heap for the vast majority of the
billion items, since they will almost all be smaller than the current
smallest heap item.)

Here is an adversary data set:
1000 instances of 0, 1000 instances of 1, 1000 instances of 2 ... 1000
instances of 1000000.

Also, the sequence 1... 1000000000 is an adversary set.
 
C

christian.bau

It's still O(N) {of course}.
O(1) will win you some kind of prize, I am sure.

Picking the top 1000 items out of one billion is obviously O (1) as
long as you implement it using an algorithm with some upper bound for
the execution time.

Picking the top M out of N items using a heap takes O (N log M) worst
case; if the N items are in random order then I suppose it is
something like O (N + M log (N/M) log M). If you suspect that the N
elements might be sorted from lowest to highest, then you could use a
heap and insert the elements in order 1, N, 2, N-1, 3, N-2, 4, N-3
etc.
 
R

Richard Tobin

It may depend on K, but K is a constant (1000). The time to rebalance
does not depend on N.
[/QUOTE]
Then the total set size (of one billion) is also a constant and so the
whole thing is O(1).

The original question was phrased so as to suggest that the number 1000
was fixed, and 1 billion was an order of magnitude estimate of the size.
Here is an adversary data set:
1000 instances of 0, 1000 instances of 1, 1000 instances of 2 ... 1000
instances of 1000000.

Yes, I said in one of the first articles in the thread that I was
assuming that the data was in a random order. I didn't bother to
repeat it in each article.

-- Richard
 
U

user923005

The original question was phrased so as to suggest that the number 1000
was fixed, and 1 billion was an order of magnitude estimate of the size.


It doesn't work that way. In the same way Radix sort is not O(n) but
O(n*m) where m is the string length. For any value of m, it is just a
fixed constant, but it is crucial to the understanding of the
algorithm and also crucial to the runtime. In a similar way, the
number of items in the subset is going to have a significant impact on
the performance of the operation.
Yes, I said in one of the first articles in the thread that I was
assuming that the data was in a random order. I didn't bother to
repeat it in each article.

That is a bad assumption. Real data usually has a great deal of
pattern in it, and nearly sorted is a very frequent arrival pattern
for data.
 
U

user923005

Picking the top 1000 items out of one billion is obviously O (1) as
long as you implement it using an algorithm with some upper bound for
the execution time.

Picking the top M out of N items using a heap takes O (N log M) worst
case; if the N items are in random order then I suppose it is
something like O (N + M log (N/M) log M). If you suspect that the N
elements might be sorted from lowest to highest, then you could use a
heap and insert the elements in order 1, N, 2, N-1, 3, N-2, 4, N-3
etc.

I agree with your analysis.

This thread makes me wonder, is it possible to create a "leaking heap"
that simply disregards data outside of its useful range, and when an
item is inserted, the least significant item simply gets shoved off a
cliff into the etherbits.

I think it might be a useful item.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top