sorting just the largest values

A

Alex Hart

I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

Thanks in advance.

- Alex Hart
 
A

Arndt Jonasson

Alex Hart said:
I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

For very large lists of data: while reading in the lines, keep a
sorted array with the up to 50 largest seen values.

But 20000 is not very large, or even large. Read the whole file and
sort it.

If you say "sort" to search.cpan.org, it will show you some modules
that are likely to be useful.
 
G

Gregory Toomey

Alex said:
I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

Thanks in advance.

- Alex Hart

Selection sort - just do the outer loop 50 times.
http://en.wikipedia.org/wiki/Selection_sort

The time complexity is O(n) (sorting top k items ;k<<n)
compared to O(n log n) for sorting the whole list.

gtoomey
 
J

John W. Krahn

Alex said:
I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

my @largest = `sort -n yourfile | tail -50`;



John
 
E

Eric Bohlman

I have a very large list of data (up to 20,000 lines), and I just want
to print out the largest 50 values. What is the quickest way to sort
this using perl?

Most people wouldn't consider 20,000 items to be a "very large list."
I'd suggest first just trying the brute-force approach of reading them
into an array, sorting them, and taking the last 50 results. Only if
this turns out to be unacceptably slow or memory-consuming for your
application should you consider something different. Remember that
perl's sort routines are implemented in optimized C code, so even if you
implement a lower-time-complexity algorithm yourself, the value of N for
which it becomes faster may be quite large.

If the brute-force approach turns out to be unacceptable (again, based on
actual measurement, not guesswork), here's a linear-time (O(N)) selection
algorithm:

1) Read the first 50 values into the top 50 list.

2) For each subsequent value, find the smallest value in the top 50 list
that's less than it. If there isn't one, do nothing. If there is, kick
it out of the list and add the new value to the list.

Step 2 will be simpler to program if you sort the list once after step 1.
You'll find shift() and splice() particularly helpful.

[UNTESTED]

my @top;
while (<>) {
push @top,$_ if $.<=50;
@top=sort @top if $.=50;
if ($.>50) {
for (my $i=49;$i>=0;--$i) {
if ($top[$i] lt $_) {
# this is where the new value belongs
splice @top,$i,0,$_; # insert the new value
shift @top; # remove the smallest value
last;
}
}
}
}
 
A

Alex Hart

Most people wouldn't consider 20,000 items to be a "very large list."
I'd suggest first just trying the brute-force approach of reading them
into an array, sorting them, and taking the last 50 results. Only if

I'm developing a real-time application, and a particular sort was
taking several minutes when the list got up to 20,000. It might be a
problem with the data being too ordered to start with (or does per
shuffle before it sorts?). I already extract the sort key and just
sort on that. I can try the packed string sort-key, but I think even a
slow method of finding the top 50 must be faster than sorting the whole
list.

I'll try some of the suggestions here, and I'll be sure to benchmark
along the way.
 
X

xhoster

Alex Hart said:
I'm developing a real-time application, and a particular sort was
taking several minutes when the list got up to 20,000.

That is definitely pathological.
It might be a
problem with the data being too ordered to start with (or does per
shuffle before it sorts?).

Depends on the version. Which version are you using?
I'll try some of the suggestions here, and I'll be sure to benchmark
along the way.

Unfortunately, benchmarking is extremely difficult in cases where there are
unknown and unpredictable pathological patterns in the data. Be careful.

Xho
 
A

Anno Siegel

That is definitely pathological.

I agree. There's some gross inefficiency going on. We should see the
sort, it's certainly speed-up-able.

In general, the solution to the "bottom-n" problem is a heap. In pseudo-
code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;
my $h = Heap->new();
my $n = 50;
for ( @data ) {
$h->insert( $_);
$h->extract_maximum if $h->size > $n;
}
my @top_n = map $h->extract_maximum, 1 .. $h->size;

If the heap implementation doesn't have a ->size method, it is easy to
use an external counter instead.

If $n is larger than the data size, the algorithm does a heap sort on
the data. If $n is smaller, it is faster than the equivalent sort.

Anno
 
D

David Combs

...
In general, the solution to the "bottom-n" problem is a heap. In pseudo-
code (using a mythical module Heap, but there are real ones on CPAN):

use Heap;
my $h = Heap->new();
my $n = 50;
for ( @data ) {
$h->insert( $_);
$h->extract_maximum if $h->size > $n;
}
my @top_n = map $h->extract_maximum, 1 .. $h->size;

If the heap implementation doesn't have a ->size method, it is easy to
use an external counter instead.

If $n is larger than the data size, the algorithm does a heap sort on
the data. If $n is smaller, it is faster than the equivalent sort.

Anno

Perhaps (not sure about this) it would be faster (fewer steps?)
to use one of the *newer* (than heapsort) priority-queue methods,
eg splay trees and the like?

(Or maybe no real difference?)

David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top