Algorithm to find nth largest or nth smallest in a range

Code4u · Jul 10, 2005

I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

TIA

Ian · Jul 10, 2005

Code4u said:
I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

I don't see how you could do this without two passes.

You have to make one pass to determine whatever statistics you require
from the data.

Is the set that large and the overhead that great that you cant use
nth_element?

Ian

Paul Bilnoski · Jul 10, 2005

Code4u said:
I need to write an algorithm that sheds the outliers in a large data
set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

TIA

Because you have to trim a portion of the data based on the entire data
set, as in the smallest 2%, you will have to have the data sorted (or
partially sorted) to know what elements lie out of range.
I think the algorithm you mention will work in linear time using a
version of quicksort's partitioning algorithm, but it works by
recursively shuffling elements around to find the nth smallest.

If you know where the 2% mark is before you start looking at the data,
then a linear search - single pass - ignoring elements out of range will
work.

Otherwise, you will have to do one pass to find the bounds and another
to find the right element. This is still linear with respect to the
input data size, complexity in the average case is 1.5 * N

--Paul

Me · Jul 11, 2005

I need to write an algorithm that sheds the outliers in a large data

set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

You haven't described your data but if you can use a histogram, use it.
It's one pass on the data and then another on the histogram while
keeping a running total to get the index you want. Otherwise:

size_t buf[2*n]
buf[0..n-1] = 0..n-1
sort buf[0..n-1]
for (size_t i = n; i < whatever; i += n) {
buf[n..2n-1] = i..2i-1
sort buf[n..2n-1]
inplace_merge buf, buf+n, buf+2n
}
index = buf[n]
(this obviously assumes the data set is of some positive length m*n but
it's trivial to fix it so it does the proper bounds checks)

buf is an index buffer, so the sort/inplace_merge should do an
indirection on it when doing the comparison. You don't even need to use
an index buffer if the values are cheap to copy as opposed to doing an
indirection. Not to mention the much better locality of reference
copying the values has.

Stephen Howe · Jul 12, 2005

I need to write an algorithm that sheds the outliers in a large data

set, for example, I might want to ignore the smallest 2% of values and
find the next smallest. Boost has a nth_element algorithm, however, it
partially sorts the data. I have a requirement that the data remain in
the orginal order. Of course I could make a copy of the vector or
array and then apply nth_element, but this would be expensive in
memory and would require another pass through the data. Does anyone
know of a more efficient algorithm that meets my requirements?

There is an algorithm that does this.
It is much more expensive than nth_element as it must make multiple passes
and do some book-keeping to keep track of the element you want because of
the requirement of "remain" in sorted order. It zooms into the kth element
in multiple passes.

But if I was you, I would look into having an extra element in your data set
called (intrusive)

int originalOrder;

When you want to eliminate the smallest 2%

1. Iterate over your vector numbering elements originalOrder as 0 to
size() - 1.
2. Do nth_element and use erase(remove() ) to get rid of the bottom 2%
3. Apply sort() with 3 parameter ordering by originalOrder.

You now have your original order and eliminated smallest 2%.

Stephen Howe

Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
how can I find the smallest item in a vector	1	Feb 20, 2007
Selecting k smallest or largest elements from a large list in python; (benchmarking)	9	Sep 1, 2010
find largest frame	7	Oct 19, 2007
how to find the last matching element in a container?	2	Oct 8, 2010
Algorithm for generating pitch-class sets in prime form	1	Jan 28, 2011
Program to find occurences of a word in a file	18	Nov 19, 2007
STL algorithm to find max value in a set or map?	5	Nov 2, 2005

Algorithm to find nth largest or nth smallest in a range

Code4u

Ian

Paul Bilnoski

Me

Stephen Howe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads