Finding 1000 largest numbers from a file having some billion numbers

Discussion in 'C Programming' started by Subra, Mar 6, 2007.

  1. Subra

    Subra Guest

    Hi,

    What is the best way to find the 1000 largest numbers from the file
    having hell lot of entries ? Can you please help me to find out the
    way ? Do I need to go for B+ trees ??

    Please help,
    Subramanya M
    Subra, Mar 6, 2007
    #1
    1. Advertising

  2. In article <>,
    Subra <> wrote:

    > What is the best way to find the 1000 largest numbers from the file
    >having hell lot of entries ? Can you please help me to find out the
    >way ? Do I need to go for B+ trees ??


    This is really just an algorithm question, nothing to do with C in
    particular.

    I can't see a better way to do it than keeping the 1000 largest so far
    in a sorted list with some structure that allows you to quickly insert
    a new item in the correct position and discard the old 1000th number.
    A B+-tree has those properties.

    Keep a note of the current 1000th value so that you can discard
    smaller values without even looking at the tree. If you really have a
    billion numbers, and they're in a random order, the vast majority will
    be smaller than the currernt 1000th, so the efficiency of the structure
    in which you keep the largest 1000 may be of little importance. You
    might consider how the average number of insertions varies with the
    number of items in the file.

    -- Richard

    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 6, 2007
    #2
    1. Advertising

  3. Subra

    Eric Sosman Guest

    Re: Finding 1000 largest numbers from a file having some billionnumbers

    Subra wrote:
    > Hi,
    >
    > What is the best way to find the 1000 largest numbers from the file
    > having hell lot of entries ? Can you please help me to find out the
    > way ? Do I need to go for B+ trees ??


    I'd suggest using a heap, as in Heapsort.

    --
    Eric Sosman
    lid
    Eric Sosman, Mar 6, 2007
    #3
  4. Subra

    bytebro Guest

    On 6 Mar, 10:30, "Subra" <> wrote:
    > What is the best way to find the 1000 largest numbers from the file
    > having hell lot of entries ? Can you please help me to find out the
    > way ? Do I need to go for B+ trees ??


    sort file | tail -1000
    bytebro, Mar 6, 2007
    #4
  5. In article <>,
    bytebro <> wrote:

    >> What is the best way to find the 1000 largest numbers from the file
    >> having hell lot of entries ? Can you please help me to find out the
    >> way ? Do I need to go for B+ trees ??


    >sort file | tail -1000


    Apart from the fact that you at least need "sort -n", it's true that
    that might be the best way if you only wanted to do it a small number
    of times. If the number were increased from a billion to, say, a
    hundred billion it would probably not be usable on most current machines.

    -- Richard
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 6, 2007
    #5
  6. Subra

    user923005 Guest

    On Mar 6, 2:30 am, "Subra" <> wrote:
    > Hi,
    >
    > What is the best way to find the 1000 largest numbers from the file
    > having hell lot of entries ? Can you please help me to find out the
    > way ? Do I need to go for B+ trees ??
    >
    > Please help,
    > Subramanya M


    Your question is better suited to news:comp.programming.

    The answer to your question is called Quickselect() and is described
    in detail in:
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
    Algorithms, Cambridge: The MIT Press, 1990.

    It's O(N). All the other solutions posed here are O(N*log(N))
    user923005, Mar 6, 2007
    #6
  7. Subra

    Eric Sosman Guest

    Re: Finding 1000 largest numbers from a file having some billionnumbers

    user923005 wrote On 03/06/07 13:12,:
    > On Mar 6, 2:30 am, "Subra" <> wrote:
    >
    >>Hi,
    >>
    >> What is the best way to find the 1000 largest numbers from the file
    >>having hell lot of entries ? Can you please help me to find out the
    >>way ? Do I need to go for B+ trees ??
    >>
    >>Please help,
    >>Subramanya M

    >
    >
    > Your question is better suited to news:comp.programming.
    >
    > The answer to your question is called Quickselect() and is described
    > in detail in:
    > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
    > Algorithms, Cambridge: The MIT Press, 1990.


    Observation #1: The "Quick Select" Google turns up
    chooses *one* value from a set, not a thousand.

    Observation #2: It also needs random access to the
    entire set, which consists in this case of a billion
    elements.

    > It's O(N). All the other solutions posed here are O(N*log(N))


    Observation #3: This is wrong. For (counter-)example,
    I suggested using a heap, whose time complexity would be
    O(N*log(1000))=O(N).

    --
    Eric Sosman, Mar 6, 2007
    #7
  8. Subra

    user923005 Guest

    On Mar 6, 10:46 am, Eric Sosman <> wrote:
    > user923005 wrote On 03/06/07 13:12,:
    >
    >
    >
    > > On Mar 6, 2:30 am, "Subra" <> wrote:

    >
    > >>Hi,

    >
    > >> What is the best way to find the 1000 largest numbers from the file
    > >>having hell lot of entries ? Can you please help me to find out the
    > >>way ? Do I need to go for B+ trees ??

    >
    > >>Please help,
    > >>Subramanya M

    >
    > > Your question is better suited to news:comp.programming.

    >
    > > The answer to your question is called Quickselect() and is described
    > > in detail in:
    > > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
    > > Algorithms, Cambridge: The MIT Press, 1990.

    >
    > Observation #1: The "Quick Select" Google turns up
    > chooses *one* value from a set, not a thousand.


    Sorry, but google is full of crap (I bet it was a Wikipedia article
    which half the time are bologna). True, one value is returned, but
    the set is partitioned with the top k elements at the top of the set.
    That's how it finds the one it is looking for, by partitioning random
    partitions.

    > Observation #2: It also needs random access to the
    > entire set, which consists in this case of a billion
    > elements.


    There you have me. You would have to be able to load all billion
    keys. As an alternative, you could process the data one million rows
    at a time. Take the top 1000 from those one thousand sets and perform
    the algorithm on that. It's still O(1), with a larger constant of
    proportionality/

    > > It's O(N). All the other solutions posed here are O(N*log(N))

    >
    > Observation #3: This is wrong. For (counter-)example,
    > I suggested using a heap, whose time complexity would be
    > O(N*log(1000))=O(N).


    Using a heap is not O(n). If you consider the set a fixed size (e.g.
    one billion) and you consider the region of interest a fixed size
    (e.g. 1000) then true, we can call it O(n). But in algorithmic terms
    both of those things are variables. We could (in fact) bogosort the
    whole mess and then pick the top 1000 items. Since 1e9 is a constant,
    and if N is a constant, then N! is a constant -- we could say that the
    whole thing is still O(1). Of course, it does not work like that.

    > --
    >
    user923005, Mar 6, 2007
    #8
  9. Subra

    CBFalconer Guest

    Subra wrote:
    >
    > What is the best way to find the 1000 largest numbers from the file
    > having hell lot of entries ? Can you please help me to find out the
    > way ? Do I need to go for B+ trees ??


    Look up heaps. All you need is an array[1000] and a few auxiliary
    variables.

    --
    <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
    <http://www.securityfocus.com/columnists/423>

    "A man who is right every time is not likely to do very much."
    -- Francis Crick, co-discover of DNA
    "There is nothing more amazing than stupidity in action."
    -- Thomas Matthews
    CBFalconer, Mar 6, 2007
    #9
  10. Subra

    user923005 Guest

    It turns out that the ideal solution to his problem is discussed in
    this paper:

    "Improved Algorithms and Analysis for Secretary Problems and
    Generalizations"
    Miklos Ajtai, Nimrod Megiddo, Orli Waarts
    IEEE Symposium on Foundations of Computer Science

    It deals not only with the selection issue, but also directly with the
    problem of not being able to hold all of the items in memory.
    user923005, Mar 6, 2007
    #10
  11. Subra

    user923005 Guest

    On Mar 6, 11:06 am, "user923005" <> wrote:
    > On Mar 6, 10:46 am, Eric Sosman <> wrote:
    >
    >
    >
    > > user923005 wrote On 03/06/07 13:12,:

    >
    > > > On Mar 6, 2:30 am, "Subra" <> wrote:

    >
    > > >>Hi,

    >
    > > >> What is the best way to find the 1000 largest numbers from the file
    > > >>having hell lot of entries ? Can you please help me to find out the
    > > >>way ? Do I need to go for B+ trees ??

    >
    > > >>Please help,
    > > >>Subramanya M

    >
    > > > Your question is better suited to news:comp.programming.

    >
    > > > The answer to your question is called Quickselect() and is described
    > > > in detail in:
    > > > T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
    > > > Algorithms, Cambridge: The MIT Press, 1990.

    >
    > > Observation #1: The "Quick Select" Google turns up
    > > chooses *one* value from a set, not a thousand.

    >
    > Sorry, but google is full of crap (I bet it was a Wikipedia article
    > which half the time are bologna). True, one value is returned, but
    > the set is partitioned with the top k elements at the top of the set.
    > That's how it finds the one it is looking for, by partitioning random
    > partitions.
    >
    > > Observation #2: It also needs random access to the
    > > entire set, which consists in this case of a billion
    > > elements.

    >
    > There you have me. You would have to be able to load all billion
    > keys. As an alternative, you could process the data one million rows
    > at a time. Take the top 1000 from those one thousand sets and perform
    > the algorithm on that. It's still O(1), with a larger constant of
    > proportionality/


    It's still O(N) {of course}.
    O(1) will win you some kind of prize, I am sure.
    user923005, Mar 6, 2007
    #11
  12. Subra

    Eric Sosman Guest

    Re: Finding 1000 largest numbers from a file having some billionnumbers

    user923005 wrote On 03/06/07 14:06,:
    > On Mar 6, 10:46 am, Eric Sosman <> wrote:
    >
    >>user923005 wrote On 03/06/07 13:12,:
    >>>
    >>> [snip]
    >>>
    >>>It's O(N). All the other solutions posed here are O(N*log(N))

    >>
    >> Observation #3: This is wrong. For (counter-)example,
    >>I suggested using a heap, whose time complexity would be
    >>O(N*log(1000))=O(N).

    >
    >
    > Using a heap is not O(n). If you consider the set a fixed size (e.g.
    > one billion) and you consider the region of interest a fixed size
    > (e.g. 1000) then true, we can call it O(n). But in algorithmic terms
    > both of those things are variables. We could (in fact) bogosort the
    > whole mess and then pick the top 1000 items. Since 1e9 is a constant,
    > and if N is a constant, then N! is a constant -- we could say that the
    > whole thing is still O(1). Of course, it does not work like that.


    You seem to be mixing up two different N's: the size
    of the data set being scanned, and the size of the heap.
    My suggestion (which I didn't spell out fully, because I
    thought it was obvious) is to initialize by reading the
    first 1000 numbers and forming a heap from them, then to
    read the remaining numbers one at a time, sifting-up each
    into the heap and discarding the smallest to keep the
    heap size fixed at 1000. Taking N to mean the data size:

    Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
    Each subsequent sift-up: O(log 1000) = O(1)
    Number of sift-ups: N-1000 = O(N)
    Total: O(N)*O(1) + O(1) = O(N)

    The important point is that the heap size is fixed at
    1000, so each heap operation takes O(1) time. If the
    problem were to find the largest M of N numbers, then the
    parameterization would be different. The heap approach
    would then be characterized as

    Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
    Each subsequent sift-up: O(log M)
    Number of sift-ups: N-M
    Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
    If M << N, this is O(N * log M)

    .... but even here no O(N log N) term is present.

    --
    Eric Sosman, Mar 6, 2007
    #12
  13. Subra

    Eric Sosman Guest

    Re: Finding 1000 largest numbers from a file having some billionnumbers

    Eric Sosman wrote On 03/06/07 14:54,:
    > [...]
    >
    > The important point is that the heap size is fixed at
    > 1000, so each heap operation takes O(1) time. If the
    > problem were to find the largest M of N numbers, then the
    > parameterization would be different. The heap approach
    > would then be characterized as
    >
    > Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)


    My day for retractions: That should be O(M log M), of
    course.

    > Each subsequent sift-up: O(log M)
    > Number of sift-ups: N-M
    > Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)


    ... which makes this (N-M)*O(log M) + O(M log M)
    = O(N log M) + O(M log M)

    > If M << N, this is O(N * log M)
    >
    > ... but even here no O(N log N) term is present.


    At least I got *something* right. Until my next
    retraction, anyhow ... Sigh.

    --
    Eric Sosman, Mar 6, 2007
    #13
  14. Subra

    user923005 Guest

    On Mar 6, 11:54 am, Eric Sosman <> wrote:
    > user923005 wrote On 03/06/07 14:06,:
    >
    >
    >
    >
    >
    >
    >
    > > On Mar 6, 10:46 am, Eric Sosman <> wrote:

    >
    > >>user923005 wrote On 03/06/07 13:12,:

    >
    > >>> [snip]

    >
    > >>>It's O(N). All the other solutions posed here are O(N*log(N))

    >
    > >> Observation #3: This is wrong. For (counter-)example,
    > >>I suggested using a heap, whose time complexity would be
    > >>O(N*log(1000))=O(N).

    >
    > > Using a heap is not O(n). If you consider the set a fixed size (e.g.
    > > one billion) and you consider the region of interest a fixed size
    > > (e.g. 1000) then true, we can call it O(n). But in algorithmic terms
    > > both of those things are variables. We could (in fact) bogosort the
    > > whole mess and then pick the top 1000 items. Since 1e9 is a constant,
    > > and if N is a constant, then N! is a constant -- we could say that the
    > > whole thing is still O(1). Of course, it does not work like that.

    >
    > You seem to be mixing up two different N's: the size
    > of the data set being scanned, and the size of the heap.


    No, I am saying that we should consider both of them as variables --
    the size of the data and the size of the heap.

    > My suggestion (which I didn't spell out fully, because I
    > thought it was obvious) is to initialize by reading the
    > first 1000 numbers and forming a heap from them, then to
    > read the remaining numbers one at a time, sifting-up each
    > into the heap and discarding the smallest to keep the
    > heap size fixed at 1000. Taking N to mean the data size:
    >
    > Initialization: O(log 1)+O(log 2)+...+O(log 1000) = O(1)
    > Each subsequent sift-up: O(log 1000) = O(1)
    > Number of sift-ups: N-1000 = O(N)
    > Total: O(N)*O(1) + O(1) = O(N)
    >
    > The important point is that the heap size is fixed at
    > 1000, so each heap operation takes O(1) time. If the
    > problem were to find the largest M of N numbers, then the
    > parameterization would be different. The heap approach
    > would then be characterized as
    >
    > Initialization: O(log 1)+O(log 2)+...+O(log M) = O(1)
    > Each subsequent sift-up: O(log M)
    > Number of sift-ups: N-M
    > Total: (N-M)*O(log M) + O(1) = (N-M)*O(log M)
    > If M << N, this is O(N * log M)
    >
    > ... but even here no O(N log N) term is present.


    Yes, it is O(N*log(M)), not O(N*log(N)).
    The solution I proposed is O(N).
    Your solution might be faster [even if I can load the whole thing into
    RAM], because the typical runtime for a well designed quickselect is ~
    24*N.
    log2(1000) is just under 10, so if the proportionality constant for
    your method is better than 2.4 it is probably faster in practice.
    Also, loading all 1e9 elements is probably impractical, and so I have
    a large constant of proportionality to deal with in the practical
    solution (processing the subsets).

    I guess that the paper I gave a link to has the best solution, but the
    heap solution is admirable for low memory + efficiency.
    user923005, Mar 6, 2007
    #14
  15. Subra

    CBFalconer Guest

    user923005 wrote:
    > On Mar 6, 10:46 am, Eric Sosman <> wrote:
    >

    .... snip ...
    >>
    >> Observation #3: This is wrong. For (counter-)example,
    >> I suggested using a heap, whose time complexity would be
    >> O(N*log(1000))=O(N).

    >
    > Using a heap is not O(n). If you consider the set a fixed size
    > (e.g. one billion) and you consider the region of interest a
    > fixed size (e.g. 1000) then true, we can call it O(n). But in
    > algorithmic terms both of those things are variables. We could
    > (in fact) bogosort the whole mess and then pick the top 1000
    > items. Since 1e9 is a constant, and if N is a constant, then
    > N! is a constant -- we could say that the whole thing is still
    > O(1). Of course, it does not work like that.


    You are confusing heapsort with the action of building a heap.
    O(1) means that the time involved is constant. O(N) means that it
    is proportional to N (not N!). All in the limit as N increases
    indefinitely. Building a heap is O(N). Heapsort is O(NlogN).

    --
    <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
    <http://www.securityfocus.com/columnists/423>

    "A man who is right every time is not likely to do very much."
    -- Francis Crick, co-discover of DNA
    "There is nothing more amazing than stupidity in action."
    -- Thomas Matthews
    CBFalconer, Mar 6, 2007
    #15
  16. Re: Finding 1000 largest numbers from a file having some billionnumbers

    In article <1173206785.750827@news1nwk>,
    Eric Sosman <> wrote:

    >> It's O(N). All the other solutions posed here are O(N*log(N))


    > Observation #3: This is wrong. For (counter-)example,
    >I suggested using a heap, whose time complexity would be
    >O(N*log(1000))=O(N).


    If you only look in the heap if the number is greater than the heap
    minimum, it will be better than that.

    -- Richard

    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 6, 2007
    #16
  17. Re: Finding 1000 largest numbers from a file having some billionnumbers

    In article <esktms$13tb$>, I wrote:

    >> Observation #3: This is wrong. For (counter-)example,
    >>I suggested using a heap, whose time complexity would be
    >>O(N*log(1000))=O(N).


    >If you only look in the heap if the number is greater than the heap
    >minimum, it will be better than that.


    Oops, not better than O(N) of course. But there won't be a constant
    factor of log(1000).

    -- Richard
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 6, 2007
    #17
  18. Subra

    Eric Sosman Guest

    Re: Finding 1000 largest numbers from a file having some billionnumbers

    Richard Tobin wrote:
    > In article <esktms$13tb$>, I wrote:
    >
    >>> Observation #3: This is wrong. For (counter-)example,
    >>> I suggested using a heap, whose time complexity would be
    >>> O(N*log(1000))=O(N).

    >
    >> If you only look in the heap if the number is greater than the heap
    >> minimum, it will be better than that.

    >
    > Oops, not better than O(N) of course. But there won't be a constant
    > factor of log(1000).


    You're misusing the O notation. Consider: all of

    O(log 1000)
    O(1000)
    O(10^100)
    O(10^(10^100))

    .... are O(1). Constant factors simply disappear with big-Oh.
    (Which is why I always grind my teeth when some nitwit tries to
    explain the difference between O(lg N) and O(log N), unaware
    that they are identical.)

    As to comparing against the heap minimum: That would be the
    first comparison in the sift-up, hence doing it an extra time
    would not be an optimization. It might be an optimization if
    the heap were being maintained with "sift-down" steps starting
    from the root instead of from the leaves. But since by the
    nature of the problem "most" numbers are likely to be rejected
    early (unless the original data are have a strong ascending
    bias), starting from the leaves and sifting up seems natural.

    --
    Eric Sosman
    lid
    Eric Sosman, Mar 7, 2007
    #18
  19. Re: Finding 1000 largest numbers from a file having some billionnumbers

    In article <>,
    Eric Sosman <> wrote:

    >>>> Observation #3: This is wrong. For (counter-)example,
    >>>> I suggested using a heap, whose time complexity would be
    >>>> O(N*log(1000))=O(N).


    >>> If you only look in the heap if the number is greater than the heap
    >>> minimum, it will be better than that.


    >> Oops, not better than O(N) of course. But there won't be a constant
    >> factor of log(1000).


    > You're misusing the O notation. Consider: all of
    >
    > O(log 1000)
    > O(1000)
    > O(10^100)
    > O(10^(10^100))


    Yes, that's why I corrected my mis-statement that it was better
    than O(N).

    > As to comparing against the heap minimum: That would be the
    >first comparison in the sift-up, hence doing it an extra time
    >would not be an optimization.


    Ok, but the statement above (that it's O(N * log(1000))) suggested
    that you would need log(k) operations on average with a heap of size
    k, which in turn suggested that there wasn't an "early out" when the
    value was less than the minimum.

    -- Richard


    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 7, 2007
    #19
  20. Subra

    user923005 Guest

    On Mar 6, 2:52 pm, CBFalconer <> wrote:
    > user923005 wrote:
    > > On Mar 6, 10:46 am, Eric Sosman <> wrote:

    >
    > ... snip ...
    >
    > >> Observation #3: This is wrong. For (counter-)example,
    > >> I suggested using a heap, whose time complexity would be
    > >> O(N*log(1000))=O(N).

    >
    > > Using a heap is not O(n). If you consider the set a fixed size
    > > (e.g. one billion) and you consider the region of interest a
    > > fixed size (e.g. 1000) then true, we can call it O(n). But in
    > > algorithmic terms both of those things are variables. We could
    > > (in fact) bogosort the whole mess and then pick the top 1000
    > > items. Since 1e9 is a constant, and if N is a constant, then
    > > N! is a constant -- we could say that the whole thing is still
    > > O(1). Of course, it does not work like that.

    >
    > You are confusing heapsort with the action of building a heap.


    No I am not.

    > O(1) means that the time involved is constant. O(N) means that it
    > is proportional to N (not N!).


    Agreed.

    > All in the limit as N increases
    > indefinitely. Building a heap is O(N).


    Agreed.

    >Heapsort is O(NlogN).

    Agreed.

    You will have to adjust the (K-element) heap for every element in the
    list (e.g. one billion of them, less 1000).
    This involves at least two operations:
    1. Adding a new element. (Constant time, so this just affects the
    constant of proportionality)
    2. Removing the max element. (This requires a re-balance and is not
    O(1)).
    user923005, Mar 7, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    8
    Views:
    882
  2. Peter Ammon
    Replies:
    13
    Views:
    3,846
    Arin Chaudhuri
    Jun 15, 2004
  3. ramu

    finding largest numbers

    ramu, Jun 26, 2006, in forum: C Programming
    Replies:
    19
    Views:
    643
    Dann Corbit
    Jun 27, 2006
  4. Sullivan WxPyQtKinter
    Replies:
    18
    Views:
    569
    John J. Lee
    Aug 12, 2007
  5. pozz
    Replies:
    27
    Views:
    721
    Seebs
    Mar 4, 2011
Loading...

Share This Page