Finding 1000 largest numbers from a file having some billion numbers

Discussion in 'C Programming' started by Subra, Mar 6, 2007.

1. SubraGuest

Hi,

What is the best way to find the 1000 largest numbers from the file
way ? Do I need to go for B+ trees ??

Subramanya M

Subra, Mar 6, 2007

2. Richard TobinGuest

In article <>,
Subra <> wrote:

> What is the best way to find the 1000 largest numbers from the file
>way ? Do I need to go for B+ trees ??

This is really just an algorithm question, nothing to do with C in
particular.

I can't see a better way to do it than keeping the 1000 largest so far
in a sorted list with some structure that allows you to quickly insert
a new item in the correct position and discard the old 1000th number.
A B+-tree has those properties.

Keep a note of the current 1000th value so that you can discard
smaller values without even looking at the tree. If you really have a
billion numbers, and they're in a random order, the vast majority will
be smaller than the currernt 1000th, so the efficiency of the structure
in which you keep the largest 1000 may be of little importance. You
might consider how the average number of insertions varies with the
number of items in the file.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Richard Tobin, Mar 6, 2007

3. Eric SosmanGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

Subra wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

I'd suggest using a heap, as in Heapsort.

--
Eric Sosman
lid

Eric Sosman, Mar 6, 2007
4. bytebroGuest

On 6 Mar, 10:30, "Subra" <> wrote:
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

sort file | tail -1000

bytebro, Mar 6, 2007
5. Richard TobinGuest

In article <>,
bytebro <> wrote:

>> What is the best way to find the 1000 largest numbers from the file
>> way ? Do I need to go for B+ trees ??

>sort file | tail -1000

Apart from the fact that you at least need "sort -n", it's true that
that might be the best way if you only wanted to do it a small number
of times. If the number were increased from a billion to, say, a
hundred billion it would probably not be usable on most current machines.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Richard Tobin, Mar 6, 2007
6. user923005Guest

On Mar 6, 2:30 am, "Subra" <> wrote:
> Hi,
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??
>
> Subramanya M

Your question is better suited to news:comp.programming.

in detail in:
T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms, Cambridge: The MIT Press, 1990.

It's O(N). All the other solutions posed here are O(N*log(N))

user923005, Mar 6, 2007
7. Eric SosmanGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

user923005 wrote On 03/06/07 13:12,:
> On Mar 6, 2:30 am, "Subra" <> wrote:
>
>>Hi,
>>
>> What is the best way to find the 1000 largest numbers from the file
>>way ? Do I need to go for B+ trees ??
>>
>>Subramanya M

>
>
> Your question is better suited to news:comp.programming.
>
> The answer to your question is called Quickselect() and is described
> in detail in:
> T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> Algorithms, Cambridge: The MIT Press, 1990.

Observation #1: The "Quick Select" Google turns up
chooses *one* value from a set, not a thousand.

entire set, which consists in this case of a billion
elements.

> It's O(N). All the other solutions posed here are O(N*log(N))

Observation #3: This is wrong. For (counter-)example,
I suggested using a heap, whose time complexity would be
O(N*log(1000))=O(N).

--

Eric Sosman, Mar 6, 2007
8. user923005Guest

On Mar 6, 10:46 am, Eric Sosman <> wrote:
> user923005 wrote On 03/06/07 13:12,:
>
>
>
> > On Mar 6, 2:30 am, "Subra" <> wrote:

>
> >>Hi,

>
> >> What is the best way to find the 1000 largest numbers from the file
> >>way ? Do I need to go for B+ trees ??

>
> >>Subramanya M

>
> > Your question is better suited to news:comp.programming.

>
> > The answer to your question is called Quickselect() and is described
> > in detail in:
> > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> > Algorithms, Cambridge: The MIT Press, 1990.

>
> Observation #1: The "Quick Select" Google turns up
> chooses *one* value from a set, not a thousand.

Sorry, but google is full of crap (I bet it was a Wikipedia article
which half the time are bologna). True, one value is returned, but
the set is partitioned with the top k elements at the top of the set.
That's how it finds the one it is looking for, by partitioning random
partitions.

> entire set, which consists in this case of a billion
> elements.

There you have me. You would have to be able to load all billion
keys. As an alternative, you could process the data one million rows
at a time. Take the top 1000 from those one thousand sets and perform
the algorithm on that. It's still O(1), with a larger constant of
proportionality/

> > It's O(N). All the other solutions posed here are O(N*log(N))

>
> Observation #3: This is wrong. For (counter-)example,
> I suggested using a heap, whose time complexity would be
> O(N*log(1000))=O(N).

Using a heap is not O(n). If you consider the set a fixed size (e.g.
one billion) and you consider the region of interest a fixed size
(e.g. 1000) then true, we can call it O(n). But in algorithmic terms
both of those things are variables. We could (in fact) bogosort the
whole mess and then pick the top 1000 items. Since 1e9 is a constant,
and if N is a constant, then N! is a constant -- we could say that the
whole thing is still O(1). Of course, it does not work like that.

> --
>

user923005, Mar 6, 2007
9. CBFalconerGuest

Subra wrote:
>
> What is the best way to find the 1000 largest numbers from the file
> way ? Do I need to go for B+ trees ??

Look up heaps. All you need is an array[1000] and a few auxiliary
variables.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

CBFalconer, Mar 6, 2007
10. user923005Guest

It turns out that the ideal solution to his problem is discussed in
this paper:

"Improved Algorithms and Analysis for Secretary Problems and
Generalizations"
Miklos Ajtai, Nimrod Megiddo, Orli Waarts
IEEE Symposium on Foundations of Computer Science

It deals not only with the selection issue, but also directly with the
problem of not being able to hold all of the items in memory.

user923005, Mar 6, 2007
11. user923005Guest

On Mar 6, 11:06 am, "user923005" <> wrote:
> On Mar 6, 10:46 am, Eric Sosman <> wrote:
>
>
>
> > user923005 wrote On 03/06/07 13:12,:

>
> > > On Mar 6, 2:30 am, "Subra" <> wrote:

>
> > >>Hi,

>
> > >> What is the best way to find the 1000 largest numbers from the file
> > >>having hell lot of entries ? Can you please help me to find out the
> > >>way ? Do I need to go for B+ trees ??

>
> > >>Subramanya M

>
> > > Your question is better suited to news:comp.programming.

>
> > > The answer to your question is called Quickselect() and is described
> > > in detail in:
> > > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
> > > Algorithms, Cambridge: The MIT Press, 1990.

>
> > Observation #1: The "Quick Select" Google turns up
> > chooses *one* value from a set, not a thousand.

>
> Sorry, but google is full of crap (I bet it was a Wikipedia article
> which half the time are bologna). True, one value is returned, but
> the set is partitioned with the top k elements at the top of the set.
> That's how it finds the one it is looking for, by partitioning random
> partitions.
>
> > entire set, which consists in this case of a billion
> > elements.

>
> There you have me. You would have to be able to load all billion
> keys. As an alternative, you could process the data one million rows
> at a time. Take the top 1000 from those one thousand sets and perform
> the algorithm on that. It's still O(1), with a larger constant of
> proportionality/

It's still O(N) {of course}.
O(1) will win you some kind of prize, I am sure.

user923005, Mar 6, 2007
12. Eric SosmanGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

user923005 wrote On 03/06/07 14:06,:
> On Mar 6, 10:46 am, Eric Sosman <> wrote:
>
>>user923005 wrote On 03/06/07 13:12,:
>>>
>>> [snip]
>>>
>>>It's O(N). All the other solutions posed here are O(N*log(N))

>>
>> Observation #3: This is wrong. For (counter-)example,
>>I suggested using a heap, whose time complexity would be
>>O(N*log(1000))=O(N).

>
>
> Using a heap is not O(n). If you consider the set a fixed size (e.g.
> one billion) and you consider the region of interest a fixed size
> (e.g. 1000) then true, we can call it O(n). But in algorithmic terms
> both of those things are variables. We could (in fact) bogosort the
> whole mess and then pick the top 1000 items. Since 1e9 is a constant,
> and if N is a constant, then N! is a constant -- we could say that the
> whole thing is still O(1). Of course, it does not work like that.

You seem to be mixing up two different N's: the size
of the data set being scanned, and the size of the heap.
My suggestion (which I didn't spell out fully, because I
thought it was obvious) is to initialize by reading the
first 1000 numbers and forming a heap from them, then to
read the remaining numbers one at a time, sifting-up each
into the heap and discarding the smallest to keep the
heap size fixed at 1000. Taking N to mean the data size:

Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
Each subsequent sift-up: O(log 1000) = O(1)
Number of sift-ups: N-1000 = O(N)
Total: O(N)*O(1) + O(1) = O(N)

The important point is that the heap size is fixed at
1000, so each heap operation takes O(1) time. If the
problem were to find the largest M of N numbers, then the
parameterization would be different. The heap approach
would then be characterized as

Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
Each subsequent sift-up: O(log M)
Number of sift-ups: N-M
Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
If M << N, this is O(N * log M)

.... but even here no O(N log N) term is present.

--

Eric Sosman, Mar 6, 2007
13. Eric SosmanGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

Eric Sosman wrote On 03/06/07 14:54,:
> [...]
>
> The important point is that the heap size is fixed at
> 1000, so each heap operation takes O(1) time. If the
> problem were to find the largest M of N numbers, then the
> parameterization would be different. The heap approach
> would then be characterized as
>
> Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)

My day for retractions: That should be O(M log M), of
course.

> Each subsequent sift-up: O(log M)
> Number of sift-ups: N-M
> Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)

... which makes this (N-M)*O(log M) + O(M log M)
= O(N log M) + O(M log M)

> If M << N, this is O(N * log M)
>
> ... but even here no O(N log N) term is present.

At least I got *something* right. Until my next
retraction, anyhow ... Sigh.

--

Eric Sosman, Mar 6, 2007
14. user923005Guest

On Mar 6, 11:54 am, Eric Sosman <> wrote:
> user923005 wrote On 03/06/07 14:06,:
>
>
>
>
>
>
>
> > On Mar 6, 10:46 am, Eric Sosman <> wrote:

>
> >>user923005 wrote On 03/06/07 13:12,:

>
> >>> [snip]

>
> >>>It's O(N). All the other solutions posed here are O(N*log(N))

>
> >> Observation #3: This is wrong. For (counter-)example,
> >>I suggested using a heap, whose time complexity would be
> >>O(N*log(1000))=O(N).

>
> > Using a heap is not O(n). If you consider the set a fixed size (e.g.
> > one billion) and you consider the region of interest a fixed size
> > (e.g. 1000) then true, we can call it O(n). But in algorithmic terms
> > both of those things are variables. We could (in fact) bogosort the
> > whole mess and then pick the top 1000 items. Since 1e9 is a constant,
> > and if N is a constant, then N! is a constant -- we could say that the
> > whole thing is still O(1). Of course, it does not work like that.

>
> You seem to be mixing up two different N's: the size
> of the data set being scanned, and the size of the heap.

No, I am saying that we should consider both of them as variables --
the size of the data and the size of the heap.

> My suggestion (which I didn't spell out fully, because I
> thought it was obvious) is to initialize by reading the
> first 1000 numbers and forming a heap from them, then to
> read the remaining numbers one at a time, sifting-up each
> into the heap and discarding the smallest to keep the
> heap size fixed at 1000. Taking N to mean the data size:
>
> Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
> Each subsequent sift-up: O(log 1000) = O(1)
> Number of sift-ups: N-1000 = O(N)
> Total: O(N)*O(1) + O(1) = O(N)
>
> The important point is that the heap size is fixed at
> 1000, so each heap operation takes O(1) time. If the
> problem were to find the largest M of N numbers, then the
> parameterization would be different. The heap approach
> would then be characterized as
>
> Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
> Each subsequent sift-up: O(log M)
> Number of sift-ups: N-M
> Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
> If M << N, this is O(N * log M)
>
> ... but even here no O(N log N) term is present.

Yes, it is O(N*log(M)), not O(N*log(N)).
The solution I proposed is O(N).
Your solution might be faster [even if I can load the whole thing into
RAM], because the typical runtime for a well designed quickselect is ~
24*N.
log2(1000) is just under 10, so if the proportionality constant for
your method is better than 2.4 it is probably faster in practice.
Also, loading all 1e9 elements is probably impractical, and so I have
a large constant of proportionality to deal with in the practical
solution (processing the subsets).

I guess that the paper I gave a link to has the best solution, but the
heap solution is admirable for low memory + efficiency.

user923005, Mar 6, 2007
15. CBFalconerGuest

user923005 wrote:
> On Mar 6, 10:46 am, Eric Sosman <> wrote:
>

.... snip ...
>>
>> Observation #3: This is wrong. For (counter-)example,
>> I suggested using a heap, whose time complexity would be
>> O(N*log(1000))=O(N).

>
> Using a heap is not O(n). If you consider the set a fixed size
> (e.g. one billion) and you consider the region of interest a
> fixed size (e.g. 1000) then true, we can call it O(n). But in
> algorithmic terms both of those things are variables. We could
> (in fact) bogosort the whole mess and then pick the top 1000
> items. Since 1e9 is a constant, and if N is a constant, then
> N! is a constant -- we could say that the whole thing is still
> O(1). Of course, it does not work like that.

You are confusing heapsort with the action of building a heap.
O(1) means that the time involved is constant. O(N) means that it
is proportional to N (not N!). All in the limit as N increases
indefinitely. Building a heap is O(N). Heapsort is O(NlogN).

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

CBFalconer, Mar 6, 2007
16. Richard TobinGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

In article <1173206785.750827@news1nwk>,
Eric Sosman <> wrote:

>> It's O(N). All the other solutions posed here are O(N*log(N))

> Observation #3: This is wrong. For (counter-)example,
>I suggested using a heap, whose time complexity would be
>O(N*log(1000))=O(N).

If you only look in the heap if the number is greater than the heap
minimum, it will be better than that.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Richard Tobin, Mar 6, 2007
17. Richard TobinGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

In article <esktms\$13tb\$>, I wrote:

>> Observation #3: This is wrong. For (counter-)example,
>>I suggested using a heap, whose time complexity would be
>>O(N*log(1000))=O(N).

>If you only look in the heap if the number is greater than the heap
>minimum, it will be better than that.

Oops, not better than O(N) of course. But there won't be a constant
factor of log(1000).

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Richard Tobin, Mar 6, 2007
18. Eric SosmanGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

Richard Tobin wrote:
> In article <esktms\$13tb\$>, I wrote:
>
>>> Observation #3: This is wrong. For (counter-)example,
>>> I suggested using a heap, whose time complexity would be
>>> O(N*log(1000))=O(N).

>
>> If you only look in the heap if the number is greater than the heap
>> minimum, it will be better than that.

>
> Oops, not better than O(N) of course. But there won't be a constant
> factor of log(1000).

You're misusing the O notation. Consider: all of

O(log 1000)
O(1000)
O(10^100)
O(10^(10^100))

.... are O(1). Constant factors simply disappear with big-Oh.
(Which is why I always grind my teeth when some nitwit tries to
explain the difference between O(lg N) and O(log N), unaware
that they are identical.)

As to comparing against the heap minimum: That would be the
first comparison in the sift-up, hence doing it an extra time
would not be an optimization. It might be an optimization if
the heap were being maintained with "sift-down" steps starting
from the root instead of from the leaves. But since by the
nature of the problem "most" numbers are likely to be rejected
early (unless the original data are have a strong ascending
bias), starting from the leaves and sifting up seems natural.

--
Eric Sosman
lid

Eric Sosman, Mar 7, 2007
19. Richard TobinGuest

Re: Finding 1000 largest numbers from a file having some billionnumbers

In article <>,
Eric Sosman <> wrote:

>>>> Observation #3: This is wrong. For (counter-)example,
>>>> I suggested using a heap, whose time complexity would be
>>>> O(N*log(1000))=O(N).

>>> If you only look in the heap if the number is greater than the heap
>>> minimum, it will be better than that.

>> Oops, not better than O(N) of course. But there won't be a constant
>> factor of log(1000).

> You're misusing the O notation. Consider: all of
>
> O(log 1000)
> O(1000)
> O(10^100)
> O(10^(10^100))

Yes, that's why I corrected my mis-statement that it was better
than O(N).

> As to comparing against the heap minimum: That would be the
>first comparison in the sift-up, hence doing it an extra time
>would not be an optimization.

Ok, but the statement above (that it's O(N * log(1000))) suggested
that you would need log(k) operations on average with a heap of size
k, which in turn suggested that there wasn't an "early out" when the
value was less than the minimum.

-- Richard

--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.

Richard Tobin, Mar 7, 2007
20. user923005Guest

On Mar 6, 2:52 pm, CBFalconer <> wrote:
> user923005 wrote:
> > On Mar 6, 10:46 am, Eric Sosman <> wrote:

>
> ... snip ...
>
> >> Observation #3: This is wrong. For (counter-)example,
> >> I suggested using a heap, whose time complexity would be
> >> O(N*log(1000))=O(N).

>
> > Using a heap is not O(n). If you consider the set a fixed size
> > (e.g. one billion) and you consider the region of interest a
> > fixed size (e.g. 1000) then true, we can call it O(n). But in
> > algorithmic terms both of those things are variables. We could
> > (in fact) bogosort the whole mess and then pick the top 1000
> > items. Since 1e9 is a constant, and if N is a constant, then
> > N! is a constant -- we could say that the whole thing is still
> > O(1). Of course, it does not work like that.

>
> You are confusing heapsort with the action of building a heap.

No I am not.

> O(1) means that the time involved is constant. O(N) means that it
> is proportional to N (not N!).

Agreed.

> All in the limit as N increases
> indefinitely. Building a heap is O(N).

Agreed.

>Heapsort is O(NlogN).

Agreed.

You will have to adjust the (K-element) heap for every element in the
list (e.g. one billion of them, less 1000).
This involves at least two operations:
1. Adding a new element. (Constant time, so this just affects the
constant of proportionality)
2. Removing the max element. (This requires a re-balance and is not
O(1)).

user923005, Mar 7, 2007