Algorithm to find nth largest or nth smallest in a range

Discussion in 'C++' started by Code4u, Jul 11, 2005.

  1. Code4u

    Code4u Guest

    I need to write an algorithm that sheds the outliers in a large data
    set, for example, I might want to ignore the smallest 2% of values and
    find the next smallest. Boost has a nth_element algorithm, however, it
    partially sorts the data. I have a requirement that the data remain in
    the orginal order. Of course I could make a copy of the vector or
    array and then apply nth_element, but this would be expensive in
    memory and would require another pass through the data. Does anyone
    know of a more efficient algorithm that meets my requirements?

    TIA
    Code4u, Jul 11, 2005
    #1
    1. Advertising

  2. Code4u

    Ian Guest

    Code4u wrote:
    > I need to write an algorithm that sheds the outliers in a large data
    > set, for example, I might want to ignore the smallest 2% of values and
    > find the next smallest. Boost has a nth_element algorithm, however, it
    > partially sorts the data. I have a requirement that the data remain in
    > the orginal order. Of course I could make a copy of the vector or
    > array and then apply nth_element, but this would be expensive in
    > memory and would require another pass through the data. Does anyone
    > know of a more efficient algorithm that meets my requirements?
    >

    I don't see how you could do this without two passes.

    You have to make one pass to determine whatever statistics you require
    from the data.

    Is the set that large and the overhead that great that you cant use
    nth_element?

    Ian
    Ian, Jul 11, 2005
    #2
    1. Advertising

  3. Code4u wrote:
    > I need to write an algorithm that sheds the outliers in a large data
    > set, for example, I might want to ignore the smallest 2% of values and
    > find the next smallest. Boost has a nth_element algorithm, however, it
    > partially sorts the data. I have a requirement that the data remain in
    > the orginal order. Of course I could make a copy of the vector or
    > array and then apply nth_element, but this would be expensive in
    > memory and would require another pass through the data. Does anyone
    > know of a more efficient algorithm that meets my requirements?
    >
    > TIA


    Because you have to trim a portion of the data based on the entire data
    set, as in the smallest 2%, you will have to have the data sorted (or
    partially sorted) to know what elements lie out of range.
    I think the algorithm you mention will work in linear time using a
    version of quicksort's partitioning algorithm, but it works by
    recursively shuffling elements around to find the nth smallest.

    If you know where the 2% mark is before you start looking at the data,
    then a linear search - single pass - ignoring elements out of range will
    work.

    Otherwise, you will have to do one pass to find the bounds and another
    to find the right element. This is still linear with respect to the
    input data size, complexity in the average case is 1.5 * N

    --Paul
    Paul Bilnoski, Jul 11, 2005
    #3
  4. Code4u

    Me Guest

    > I need to write an algorithm that sheds the outliers in a large data
    > set, for example, I might want to ignore the smallest 2% of values and
    > find the next smallest. Boost has a nth_element algorithm, however, it
    > partially sorts the data. I have a requirement that the data remain in
    > the orginal order. Of course I could make a copy of the vector or
    > array and then apply nth_element, but this would be expensive in
    > memory and would require another pass through the data. Does anyone
    > know of a more efficient algorithm that meets my requirements?


    You haven't described your data but if you can use a histogram, use it.
    It's one pass on the data and then another on the histogram while
    keeping a running total to get the index you want. Otherwise:

    size_t buf[2*n]
    buf[0..n-1] = 0..n-1
    sort buf[0..n-1]
    for (size_t i = n; i < whatever; i += n) {
    buf[n..2n-1] = i..2i-1
    sort buf[n..2n-1]
    inplace_merge buf, buf+n, buf+2n
    }
    index = buf[n]
    (this obviously assumes the data set is of some positive length m*n but
    it's trivial to fix it so it does the proper bounds checks)

    buf is an index buffer, so the sort/inplace_merge should do an
    indirection on it when doing the comparison. You don't even need to use
    an index buffer if the values are cheap to copy as opposed to doing an
    indirection. Not to mention the much better locality of reference
    copying the values has.
    Me, Jul 11, 2005
    #4
  5. Code4u

    Stephen Howe Guest

    >I need to write an algorithm that sheds the outliers in a large data
    > set, for example, I might want to ignore the smallest 2% of values and
    > find the next smallest. Boost has a nth_element algorithm, however, it
    > partially sorts the data. I have a requirement that the data remain in
    > the orginal order. Of course I could make a copy of the vector or
    > array and then apply nth_element, but this would be expensive in
    > memory and would require another pass through the data. Does anyone
    > know of a more efficient algorithm that meets my requirements?


    There is an algorithm that does this.
    It is much more expensive than nth_element as it must make multiple passes
    and do some book-keeping to keep track of the element you want because of
    the requirement of "remain" in sorted order. It zooms into the kth element
    in multiple passes.

    But if I was you, I would look into having an extra element in your data set
    called (intrusive)

    int originalOrder;

    When you want to eliminate the smallest 2%

    1. Iterate over your vector numbering elements originalOrder as 0 to
    size() - 1.
    2. Do nth_element and use erase(remove() ) to get rid of the bottom 2%
    3. Apply sort() with 3 parameter ordering by originalOrder.

    You now have your original order and eliminated smallest 2%.

    Stephen Howe
    Stephen Howe, Jul 13, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sanitarium
    Replies:
    2
    Views:
    530
    Whitecrest
    Dec 4, 2003
  2. Keke922
    Replies:
    2
    Views:
    3,082
    Shane Beasley
    Nov 2, 2003
  3. Peter Ammon
    Replies:
    13
    Views:
    3,845
    Arin Chaudhuri
    Jun 15, 2004
  4. Replies:
    9
    Views:
    437
  5. Dmitry Chichkov
    Replies:
    9
    Views:
    1,526
    Dmitry Chichkov
    Sep 3, 2010
Loading...

Share This Page