Finding 1000 largest numbers from a file having some billion numbers

S

Subra

Hi,

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

Please help,
Subramanya M
 
R

Richard Tobin

Subra said:
What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

This is really just an algorithm question, nothing to do with C in
particular.

I can't see a better way to do it than keeping the 1000 largest so far
in a sorted list with some structure that allows you to quickly insert
a new item in the correct position and discard the old 1000th number.
A B+-tree has those properties.

Keep a note of the current 1000th value so that you can discard
smaller values without even looking at the tree. If you really have a
billion numbers, and they're in a random order, the vast majority will
be smaller than the currernt 1000th, so the efficiency of the structure
in which you keep the largest 1000 may be of little importance. You
might consider how the average number of insertions varies with the
number of items in the file.

-- Richard
 
E

Eric Sosman

Subra said:
Hi,

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

I'd suggest using a heap, as in Heapsort.
 
B

bytebro

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

sort file | tail -1000
 
R

Richard Tobin

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??
[/QUOTE]
sort file | tail -1000

Apart from the fact that you at least need "sort -n", it's true that
that might be the best way if you only wanted to do it a small number
of times. If the number were increased from a billion to, say, a
hundred billion it would probably not be usable on most current machines.

-- Richard
 
U

user923005

Hi,

What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

Please help,
Subramanya M

Your question is better suited to
The answer to your question is called Quickselect() and is described
in detail in:
T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, Cambridge: The MIT Press, 1990.

It's O(N). All the other solutions posed here are O(N*log(N))
 
E

Eric Sosman

user923005 wrote On 03/06/07 13:12,:
Your question is better suited to
The answer to your question is called Quickselect() and is described
in detail in:
T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, Cambridge: The MIT Press, 1990.

Observation #1: The "Quick Select" Google turns up
chooses *one* value from a set, not a thousand.

Observation #2: It also needs random access to the
entire set, which consists in this case of a billion
elements.
It's O(N). All the other solutions posed here are O(N*log(N))

Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).
 
U

user923005

user923005 wrote On 03/06/07 13:12,:






Observation #1: The "Quick Select" Google turns up
chooses *one* value from a set, not a thousand.

Sorry, but google is full of crap (I bet it was a Wikipedia article
which half the time are bologna). True, one value is returned, but
the set is partitioned with the top k elements at the top of the set.
That's how it finds the one it is looking for, by partitioning random
partitions.
Observation #2: It also needs random access to the
entire set, which consists in this case of a billion
elements.

There you have me. You would have to be able to load all billion
keys. As an alternative, you could process the data one million rows
at a time. Take the top 1000 from those one thousand sets and perform
the algorithm on that. It's still O(1), with a larger constant of
proportionality/
Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).

Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.
 
C

CBFalconer

Subra said:
What is the best way to find the 1000 largest numbers from the file
having hell lot of entries ? Can you please help me to find out the
way ? Do I need to go for B+ trees ??

Look up heaps. All you need is an array[1000] and a few auxiliary
variables.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
U

user923005

It turns out that the ideal solution to his problem is discussed in
this paper:

"Improved Algorithms and Analysis for Secretary Problems and
Generalizations"
Miklos Ajtai, Nimrod Megiddo, Orli Waarts
IEEE Symposium on Foundations of Computer Science

It deals not only with the selection issue, but also directly with the
problem of not being able to hold all of the items in memory.
 
U

user923005

Sorry, but google is full of crap (I bet it was a Wikipedia article
which half the time are bologna). True, one value is returned, but
the set is partitioned with the top k elements at the top of the set.
That's how it finds the one it is looking for, by partitioning random
partitions.


There you have me. You would have to be able to load all billion
keys. As an alternative, you could process the data one million rows
at a time. Take the top 1000 from those one thousand sets and perform
the algorithm on that. It's still O(1), with a larger constant of
proportionality/

It's still O(N) {of course}.
O(1) will win you some kind of prize, I am sure.
 
E

Eric Sosman

user923005 wrote On 03/06/07 14:06,:
user923005 wrote On 03/06/07 13:12,:
[snip]

It's O(N). All the other solutions posed here are O(N*log(N))

Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).


Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.

You seem to be mixing up two different N's: the size
of the data set being scanned, and the size of the heap.
My suggestion (which I didn't spell out fully, because I
thought it was obvious) is to initialize by reading the
first 1000 numbers and forming a heap from them, then to
read the remaining numbers one at a time, sifting-up each
into the heap and discarding the smallest to keep the
heap size fixed at 1000. Taking N to mean the data size:

Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
Each subsequent sift-up: O(log 1000) = O(1)
Number of sift-ups: N-1000 = O(N)
Total: O(N)*O(1) + O(1) = O(N)

The important point is that the heap size is fixed at
1000, so each heap operation takes O(1) time. If the
problem were to find the largest M of N numbers, then the
parameterization would be different. The heap approach
would then be characterized as

Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
Each subsequent sift-up: O(log M)
Number of sift-ups: N-M
Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
If M << N, this is O(N * log M)

.... but even here no O(N log N) term is present.
 
E

Eric Sosman

Eric Sosman wrote On 03/06/07 14:54,:
[...]

The important point is that the heap size is fixed at
1000, so each heap operation takes O(1) time. If the
problem were to find the largest M of N numbers, then the
parameterization would be different. The heap approach
would then be characterized as

Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)

My day for retractions: That should be O(M log M), of
course.
Each subsequent sift-up: O(log M)
Number of sift-ups: N-M
Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)

... which makes this (N-M)*O(log M) + O(M log M)
= O(N log M) + O(M log M)
If M << N, this is O(N * log M)

... but even here no O(N log N) term is present.

At least I got *something* right. Until my next
retraction, anyhow ... Sigh.
 
U

user923005

user923005 wrote On 03/06/07 14:06,:






user923005 wrote On 03/06/07 13:12,:
[snip]
It's O(N). All the other solutions posed here are O(N*log(N))
Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).
Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.

You seem to be mixing up two different N's: the size
of the data set being scanned, and the size of the heap.

No, I am saying that we should consider both of them as variables --
the size of the data and the size of the heap.
My suggestion (which I didn't spell out fully, because I
thought it was obvious) is to initialize by reading the
first 1000 numbers and forming a heap from them, then to
read the remaining numbers one at a time, sifting-up each
into the heap and discarding the smallest to keep the
heap size fixed at 1000. Taking N to mean the data size:

Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
Each subsequent sift-up: O(log 1000) = O(1)
Number of sift-ups: N-1000 = O(N)
Total: O(N)*O(1) + O(1) = O(N)

The important point is that the heap size is fixed at
1000, so each heap operation takes O(1) time. If the
problem were to find the largest M of N numbers, then the
parameterization would be different. The heap approach
would then be characterized as

Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
Each subsequent sift-up: O(log M)
Number of sift-ups: N-M
Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
If M << N, this is O(N * log M)

... but even here no O(N log N) term is present.

Yes, it is O(N*log(M)), not O(N*log(N)).
The solution I proposed is O(N).
Your solution might be faster [even if I can load the whole thing into
RAM], because the typical runtime for a well designed quickselect is ~
24*N.
log2(1000) is just under 10, so if the proportionality constant for
your method is better than 2.4 it is probably faster in practice.
Also, loading all 1e9 elements is probably impractical, and so I have
a large constant of proportionality to deal with in the practical
solution (processing the subsets).

I guess that the paper I gave a link to has the best solution, but the
heap solution is admirable for low memory + efficiency.
 
C

CBFalconer

user923005 said:
.... snip ...

Using a heap is not O(n). If you consider the set a fixed size
(e.g. one billion) and you consider the region of interest a
fixed size (e.g. 1000) then true, we can call it O(n). But in
algorithmic terms both of those things are variables. We could
(in fact) bogosort the whole mess and then pick the top 1000
items. Since 1e9 is a constant, and if N is a constant, then
N! is a constant -- we could say that the whole thing is still
O(1). Of course, it does not work like that.

You are confusing heapsort with the action of building a heap.
O(1) means that the time involved is constant. O(N) means that it
is proportional to N (not N!). All in the limit as N increases
indefinitely. Building a heap is O(N). Heapsort is O(NlogN).

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
R

Richard Tobin

It's O(N). All the other solutions posed here are O(N*log(N))
[/QUOTE]
Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).

If you only look in the heap if the number is greater than the heap
minimum, it will be better than that.

-- Richard
 
R

Richard Tobin

If you only look in the heap if the number is greater than the heap
minimum, it will be better than that.

Oops, not better than O(N) of course. But there won't be a constant
factor of log(1000).

-- Richard
 
E

Eric Sosman

Richard said:
Oops, not better than O(N) of course. But there won't be a constant
factor of log(1000).

You're misusing the O notation. Consider: all of

O(log 1000)
O(1000)
O(10^100)
O(10^(10^100))

.... are O(1). Constant factors simply disappear with big-Oh.
(Which is why I always grind my teeth when some nitwit tries to
explain the difference between O(lg N) and O(log N), unaware
that they are identical.)

As to comparing against the heap minimum: That would be the
first comparison in the sift-up, hence doing it an extra time
would not be an optimization. It might be an optimization if
the heap were being maintained with "sift-down" steps starting
from the root instead of from the leaves. But since by the
nature of the problem "most" numbers are likely to be rejected
early (unless the original data are have a strong ascending
bias), starting from the leaves and sifting up seems natural.
 
R

Richard Tobin

Oops, not better than O(N) of course. But there won't be a constant
factor of log(1000).
[/QUOTE]
You're misusing the O notation. Consider: all of

O(log 1000)
O(1000)
O(10^100)
O(10^(10^100))

Yes, that's why I corrected my mis-statement that it was better
than O(N).
As to comparing against the heap minimum: That would be the
first comparison in the sift-up, hence doing it an extra time
would not be an optimization.

Ok, but the statement above (that it's O(N * log(1000))) suggested
that you would need log(k) operations on average with a heap of size
k, which in turn suggested that there wasn't an "early out" when the
value was less than the minimum.

-- Richard
 
U

user923005

... snip ...



You are confusing heapsort with the action of building a heap.

No I am not.
O(1) means that the time involved is constant. O(N) means that it
is proportional to N (not N!).
Agreed.

All in the limit as N increases
indefinitely. Building a heap is O(N).
Agreed.

Heapsort is O(NlogN).
Agreed.

You will have to adjust the (K-element) heap for every element in the
list (e.g. one billion of them, less 1000).
This involves at least two operations:
1. Adding a new element. (Constant time, so this just affects the
constant of proportionality)
2. Removing the max element. (This requires a re-balance and is not
O(1)).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top