Top 10 list algorithm

F

Fatted

Is there a better (faster) way of implementing my top_sort function?
(Creating a top 10 list of the highest numbers from a large set).
Or did I get lucky? :)

--
#!/usr/bin/perl
use strict;
use warnings;


my @top10 = (0,0,0,0,0,0,0,0,0,0);


for(my $i = 0;$i < 10000; $i++)
{
my $num = rand();
@top10 = @{top_sort(\@top10, $num)};
}


# PRINT RESULTS
foreach (@top10)
{
print $_."\n";
}


sub top_sort
{
my $array = shift;
my $val = shift;


my @top;


# IF THE VALUE IS LESS THAN THE LAST VALUE IN THE TOP LIST,
# NO PROCESSING REQUIRED
if(defined($val) && $val >= 0)
{
if($val < $array->[$#$array])
{
return($array);
}
}


for(my $i = 0; $i < $#$array; ++$i)
{
if($val > $array->[$i])
{
# MAKE NEW TOP10 BY TAKING HIGHER MATCHES PLUS
THE NEW
# VALUE, PLUS THE LOWER MATCHES UP TO THE LIMIT
OF THE
# ARRAY.
@top = @$array[0..($i-1)];
push(@top, $val);
push(@top, @$array[$i..($#$array-1)]);
last;
}
}


# IN CASE OF PROBLEMS
if(!defined($top[0]))
{
@top = @$array;
}
return(\@top);
}
 
P

Peter Hickman

Well I timed yours and wrote mine we have a draw timewise, but I would prefer to
maintain my code rather than yours.

peter@anise:~$ time -p perl tt1.pl (yours)
0.999883882371392
0.999865909542621
0.999743501048854
0.999707466553314
0.99966005237998
0.999624140212287
0.999602350183547
0.999588096251578
0.999575394398029
0.999448703053851
real 0.06
user 0.06
sys 0.00
peter@anise:~$ time -p perl tt2.pl (mine)
0.999866027199676
0.999793617152395
0.999740853320571
0.999674339504168
0.999568519910227
0.999387627277898
0.999359789594308
0.999242443185917
0.999208602751779
0.998986709806623
real 0.06
user 0.06
sys 0.00
peter@anise:~$


#!/usr/bin/perl

use strict;
use warnings;

my %data;

for ( my $i = 0 ; $i < 10000 ; $i++ ) {
$data{rand()}++;
}

my @keys = sort { $b <=> $a } (keys %data);

foreach (@keys[0..9]) {
print "$_\n";
}
 
F

Fatted

Peter said:
Well I timed yours and wrote mine we have a draw timewise, but I would
prefer to maintain my code rather than yours.

<snip>

Thanks for having a look! Although when I tried your code versus mine,
Mine appeared faster (Over 50000 iterations for each). I think you've
got a much faster machine than mine and might need to do a couple of
powers of 10 more iterations to make comparisons. (I agree your code is
much neater though) :

# YOURS

$ perl -d:DProf yours.pl
0.999979388766942
0.999954322899448
0.999919576561904
0.999919418581605
0.999911123899963
0.999901084906806
0.999897822368329
0.999895118668992
0.999870641207909
0.999817769402721
$ dprofpp
Total Elapsed Time = 1.27994 Seconds
User+System Time = 1.11994 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.89 0.010 0.010 2 0.0050 0.0050 main::BEGIN
0.00 - -0.000 1 - - strict::import
0.00 - -0.000 1 - - warnings::BEGIN
0.00 - -0.000 1 - - strict::bits
0.00 - -0.000 1 - - warnings::import

# MINE

$ perl -d:DProf mine.pl
0.999987822921746
0.999961276796125
0.999960878033757
0.999940209732916
0.999902694088071
0.999897664019262
0.99988878249334
0.999879222743452
0.999863848759027
0.999815730523473
$ dprofpp
Total Elapsed Time = 0.829862 Seconds
User+System Time = 0.809862 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
52.4 0.425 0.425 50000 0.0000 0.0000 main::top_sort
1.23 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN
1.23 0.010 0.020 2 0.0050 0.0100 main::BEGIN
0.00 - -0.000 1 - - strict::bits
0.00 - -0.000 1 - - strict::import
0.00 - -0.000 1 - - warnings::import
 
P

Peter Hickman

However as the numbers get bigger, from 10000 to 100000, yours is faster than
mine. So I wrote a version 3, having @top10 as a global variable rather than
passing it around helped alot.

peter@anise:~$ time -p perl tt1.pl
0.999996444354355
0.999980725756771
0.999961809075284
0.999949246149662
0.999939670484004
0.999931145711979
0.999920330943766
0.999915345843593
0.999905530188123
0.999901835815507
real 0.59
user 0.58
sys 0.00
peter@anise:~$ time -p perl tt2.pl
0.999996550854185
0.999988689081981
0.999954812938235
0.999954112486751
0.999952499012984
0.999916621841294
0.999913707186565
0.999912082149152
0.999911171437891
0.999908870450732
real 0.88
user 0.84
sys 0.03
peter@anise:~$ time -p perl tt3.pl
0.999996013738279
0.999986292514876
0.999969367845608
0.999963241575056
0.999954341617741
0.999953365735578
0.999946424706014
0.999936395023607
0.999934874017583
0.99993412098663
real 0.15
user 0.15
sys 0.00
peter@anise:~$

#!/usr/bin/perl

use strict;
use warnings;

my @top10 = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 );

for ( my $i = 0 ; $i < 100000 ; $i++ ) {
my $num = rand();
top_sort( $num );
}

# PRINT RESULTS
foreach ( reverse @top10 ) {
print $_. "\n";
}

sub top_sort {
my ( $value ) = @_;

if($value > $top10[0]) {
$top10[0] = $value;
@top10 = sort @top10;
}
}
 
F

Fatted

Christian said:
I'd get rid of all those shifts and if()s. I didn't time the
speed, but my solution would look like following:

Thanks for your solution. I timed yours against mine (but over 50000
iterations) and its a bit slower (but a hellva lot neater):

# WINTER

$ perl -d:DProf winter.pl
0.999946217800538
0.999944015909879
0.999942441099883
0.999925040395816
0.999911130801731
0.999891723472704
0.99987181193384
0.999861976302704
0.999800055313095
0.999793198711462
$ dprofpp
Total Elapsed Time = 1.639946 Seconds
User+System Time = 1.549946 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.65 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN
0.00 - -0.000 1 - - strict::bits
0.00 - -0.000 1 - - strict::import
0.00 - -0.000 1 - - warnings::import
0.00 - 0.010 2 - 0.0050 main::BEGIN

# MINE

# MINE

$ perl -d:DProf mine.pl
0.999987822921746
0.999961276796125
0.999960878033757
0.999940209732916
0.999902694088071
0.999897664019262
0.99988878249334
0.999879222743452
0.999863848759027
0.999815730523473
$ dprofpp
Total Elapsed Time = 0.829862 Seconds
User+System Time = 0.809862 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
52.4 0.425 0.425 50000 0.0000 0.0000 main::top_sort
1.23 0.010 0.010 1 0.0100 0.0100 warnings::BEGIN
1.23 0.010 0.020 2 0.0050 0.0100 main::BEGIN
0.00 - -0.000 1 - - strict::bits
0.00 - -0.000 1 - - strict::import
0.00 - -0.000 1 - - warnings::import
 
F

Fatted

Peter said:
However as the numbers get bigger, from 10000 to 100000, yours is faster
than mine. So I wrote a version 3, having @top10 as a global variable
rather than passing it around helped alot.

Yep, thats about 3 times faster (and tidier) over 50000 iterations than
mine using "time -p". ("perl -d:DProf" was giving me negative times :)
Thanks.
 
S

Shawn Corey

Christian said:
Fatted said:
Is there a better (faster) way of implementing my top_sort function?
(Creating a top 10 list of the highest numbers from a large set).
Or did I get lucky? :)


I'd get rid of all those shifts and if()s. I didn't time the
speed, but my solution would look like following:
------------------------------------------------------------
#!perl

use strict;
use warnings;

my @top10 = (0,0,0,0,0,0,0,0,0,0);

for (0..9999)
{
# Just sort random value into @top10 and get the
# 10 topmost values via hash slice notation.
@top10 = (sort { $b <=> $a } @top10, rand())[0..9];
}

foreach (@top10)
{
print $_."\n";
}

__END__

I wouldn't sort for every number added.

--- Shawn

#!/usr/bin/perl

use strict;
use warnings;

my @top10 = ( 0 ) x 10;

for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top10, $num;
# Change 1_000 to be as large as your memory can handle
if( @top10 > 1_000 ){
@top10 = sort { $b <=> $a } @top10;
$#top10 = 9;
}
}
$#top10 = 9;

print "$_\n" for @top10;

__END__
 
A

Anno Siegel

Fatted said:
Is there a better (faster) way of implementing my top_sort function?
(Creating a top 10 list of the highest numbers from a large set).
Or did I get lucky? :)

The standard approach to selection of the top k of n elements is
to use a heap.

A heap is a data structure that allows insertion of arbitrary elements
and extraction of the minimum element, both in log k time where k is
the size of the heap. Inspection of the minimum can be done in unit
time. There are a few implementations on CPAN.

The strategy is to insert elements in the heap that are larger than
the current minimum (or any element when the heap is empty). If
the heap size is greater than k, also remove the current minimum as
you insert a new element. In the end, the heap contains the top
k elements, which can be extracted in ascending order.

The process takes n*log k time instead of the n*log n needed to
sort all n values.

If n is much larger than k, you can even use a simple list instead
of a heap. You either sort it after each insertion, or insert each
element in its place (straight insertion sort). The resulting
n*k*log k time will still be better than n*log n if n is big and
k is small.

Anno
 
F

Fatted

Anno said:
The standard approach to selection of the top k of n elements is
to use a heap.

If n is much larger than k, you can even use a simple list instead
of a heap. You either sort it after each insertion, or insert each
element in its place (straight insertion sort).

What do you mean by a straight insertion sort?

(Thanks for explanation by the way).
 
J

John Bokma

Fatted said:
What do you mean by a straight insertion sort?

Since the list is sorted (because you insert in the right place) you can
use binary search to find the elements location, O(log k), and insert it.

Or I am wrong :)
 
J

John W. Krahn

Fatted said:
Is there a better (faster) way of implementing my top_sort function?
(Creating a top 10 list of the highest numbers from a large set).
Or did I get lucky? :)

--
#!/usr/bin/perl
use strict;
use warnings;

my @top10 = (0,0,0,0,0,0,0,0,0,0);

for(my $i = 0;$i < 10000; $i++)
{
my $num = rand();
@top10 = @{top_sort(\@top10, $num)};
}

# PRINT RESULTS
foreach (@top10)
{
print $_."\n";
}

sub top_sort
{
my $array = shift;
my $val = shift;

my @top;

# IF THE VALUE IS LESS THAN THE LAST VALUE IN THE TOP LIST,
# NO PROCESSING REQUIRED
if(defined($val) && $val >= 0)
{
if($val < $array->[$#$array])
{
return($array);
}
}

for(my $i = 0; $i < $#$array; ++$i)
{
if($val > $array->[$i])
{
# MAKE NEW TOP10 BY TAKING HIGHER MATCHES PLUS THE NEW
# VALUE, PLUS THE LOWER MATCHES UP TO THE LIMIT OF THE
# ARRAY.
@top = @$array[0..($i-1)];
push(@top, $val);
push(@top, @$array[$i..($#$array-1)]);
last;
}
}

Your algorithm is flawed because you don't compare $val to the last element of
@$array so under certain conditions the last element of @top will be incorrect.

What's that saying about premature optimization? :)
# IN CASE OF PROBLEMS
if(!defined($top[0]))
{
@top = @$array;
}
return(\@top);
}


John
 
F

Fatted

John said:
Your algorithm is flawed because you don't compare $val to the last
element of @$array so under certain conditions the last element of @top
will be incorrect.

Yeah, good spot. I created the above as a simplification of some of my
code for comments in c.l.p.m. Kinda got carried away with the
simplification...
What's that saying about premature optimization? :)

But it would be faster :)
..
..

:(
 
G

Gregory Toomey

Fatted said:
Is there a better (faster) way of implementing my top_sort function?

Your algorithm looks like a sort ie O(n log n).

There is an algorithm that will find the nth sorted element in a list on
O(n) time, so finding the top 10 would be O(n) as well.
But of course I can't google a reference to the algorithm.

gtoomey
 
A

Anno Siegel

John Bokma said:
Since the list is sorted (because you insert in the right place) you can
use binary search to find the elements location, O(log k), and insert it.

Or I am wrong :)

That's it.

It is not a good general-purpose sorting algorithm because insertion
means you must move n/2 elements on average to make room for a new
one, so the insertions alone make it an n**2 process.

On the other hand, if you must *keep* a list sorted while adding and
removing elements, it is practically your only choice. In this
particular case it isn't so bad because the list of top-k candidates
is short and remains short.

Anno
 
J

John Bokma

(e-mail address removed)-berlin.de (Anno Siegel) wrote in
John Bokma <[email protected]> wrote in comp.lang.perl.misc:

[ insertion sort ]
It is not a good general-purpose sorting algorithm because insertion
means you must move n/2 elements on average to make room for a new
one, so the insertions alone make it an n**2 process.

Depends on your datastructure of course.
On the other hand, if you must *keep* a list sorted while adding and
removing elements, it is practically your only choice. In this
particular case it isn't so bad because the list of top-k candidates
is short and remains short.

Bubble sort might out perform normal quicksort if k is small, I guess.
 
E

Eric Schwartz

It is not a good general-purpose sorting algorithm because insertion
means you must move n/2 elements on average to make room for a new
one, so the insertions alone make it an n**2 process.

But n=10, so in this case, that doesn't matter much. If you're
developing a general top-N algorithm, that's a much bigger
consideration.

-=Eric
 
A

Anno Siegel

John Bokma said:
(e-mail address removed)-berlin.de (Anno Siegel) wrote in
John Bokma <[email protected]> wrote in comp.lang.perl.misc:

[ insertion sort ]
It is not a good general-purpose sorting algorithm because insertion
means you must move n/2 elements on average to make room for a new
one, so the insertions alone make it an n**2 process.

Depends on your datastructure of course.

In general, yes. In this case, we want a compact list (at least of
pointers) to perform a binary search on.
Bubble sort might out perform normal quicksort if k is small, I guess.

Well, bubble sort avoids the overhead of insertion, it swaps elements.

Anno
 
C

ctcgag

John Bokma said:
(e-mail address removed)-berlin.de (Anno Siegel) wrote in
comp.lang.perl.misc:

[ insertion sort ]
It is not a good general-purpose sorting algorithm because insertion
means you must move n/2 elements on average to make room for a new
one, so the insertions alone make it an n**2 process.

Depends on your datastructure of course.

In general, yes. In this case, we want a compact list (at least of
pointers) to perform a binary search on.
Bubble sort might out perform normal quicksort if k is small, I guess.

Well, bubble sort avoids the overhead of insertion, it swaps elements.

Bubble sort does insertion by swapping elements. Which is a pretty aweful
way to do insertion. Especially when insertion is supported with very low
(even hardware-level) optimizations.

Xho
 
C

ctcgag

Fatted said:
Is there a better (faster) way of implementing my top_sort function?
(Creating a top 10 list of the highest numbers from a large set).
Or did I get lucky? :)

Your appear to be operating with a set of 10_000 numbers. In my book,
that is a couple orders of magnitude below "large".

For 10,000, I wouldn't even bother with anything except

@top10 = (sort {whatever} @data)[0..9];

--
#!/usr/bin/perl
use strict;
use warnings;

my @top10 = (0,0,0,0,0,0,0,0,0,0);

What if some of the top10 numbers are negative?
for(my $i = 0;$i < 10000; $i++)
{
my $num = rand();
@top10 = @{top_sort(\@top10, $num)};

Invoking the overhead of a subroutine call for reach member of the set
is going to kill your performance. If you are really worried about
performance, you should batch up your numbers.
for(my $i = 0; $i < $#$array; ++$i)

$i < @$array
or
$i<=$#$array


Xho
 
A

Anno Siegel

Abigail said:
Fatted ([email protected]) wrote on MMMMLV September MCMXCIII in
<URL:!! Is there a better (faster) way of implementing my top_sort function?
!! (Creating a top 10 list of the highest numbers from a large set).
!! Or did I get lucky? :)


#!/usr/bin/perl

use strict;
use warnings;
no warnings qw /syntax/;

my @heap = map {int rand 1_000_000} 1 .. 100_000;

sub heapify;
sub heapify {
my $index = shift;

my ($c1, $c2) = (2 * $index + 1, 2 * $index + 2);
my $max = $index;
$max = $c1 if $c1 < @heap && $heap [$c1] > $heap [$max];
$max = $c2 if $c2 < @heap && $heap [$c2] > $heap [$max];

return if $max == $index;

@heap [$index, $max] = @heap [$max, $index];
heapify $max;
}


for (my $i = int (@heap / 2); $i >= 0; $i --) {heapify $i}

if (@heap) {
for (1 .. 10) {
print $heap [0], "\n";
my $tmp = pop @heap;
last unless @heap;
$heap [0] = $tmp;
heapify 0;
}
}

Well, you're heapifying the entire list, which is still an n log n
operation. Might as well sort :)

With a bit more machinery heap size can be kept to 10 the whole time.
A new element is only inserted when it is larger than the current
minimum (you need a minimum-heap for top ten). If the heap is at
full size, the old minimum is deleted, keeping size constant. For
fixed values of ten, this gives linear time behavior (worst case) in
the size of the original list. Worst case would be that a new element
must be inserted into the top ten on every step. In practice the top
ten get rather stable, after a while there are fewer and fewer insertions.

Besides heapify() and min/max extraction (not named, but present in your
code), heap insertion is needed. That's the one non-trivial addition,
apparently it can't be based on heapify().

(Untested and incomplete)

my $h = Heap->new;
for ( map int rand 1_000_000, 1 .. 100_000 ) {
if ( $h->size < SIZE or $_ > $h->minimum ) {
$h->insert( $_);
$h->extract_min if $h->size > SIZE;
}
}

It runs in fact a little faster, though OO overhead eats a lot of
the advantage.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top