More math than perl...

B

Bill H

Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

$RATE[0] = 3;
$RATE[1] = 1;
$RATE[2] = 0;
$RATE[3] = 4;
$RATE[4] = 1;
$RATE[5] = 2;

Then my array would be:

@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

And the median would be $ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?

Any / all ideas are welcomed, but please remember that the example I
gave is just typed to give you an idea and is not any real code I am
using.

Bill H
 
C

Charlton Wilbur

BH> The problem I have (and I hope I am explaining this right), to
BH> calculate a median, I have to make an array that contains all
BH> the values, sorted from low to high, and then look at the
BH> value of the element in the middle to get the median.

So, let me see if I understand this right.

You have five variables, $RATE0 through $RATE5, and each contains a
count of how many people rated that page that number?

You could probably do something slick, but the brute-force method
looks like this:

my @array = ((0) x $RATE0, (1) x $RATE1, (2) x $RATE2,
(3) x $RATE3, (4) x $RATE4, (5) x $RATE5);

my $median;

if (@array % 2)
{
$median = ($array[(@array-1)/2] + $array[(@array+1)/2])/2;
}
else
{
$median = $array[@array/2];
}

Alternately, if you did the sensible thing and kept $RATE0 through
$RATE5 in an array, you could say, much more elegantly,

my @array = map { ($_) x $RATE[$_] } (0..5);

Charlton
 
M

Mumia W.

[...]
And the median would be $ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?
[...]

I would just build the array in memory. On any reasonably modern system,
you'll have to have millions of values before you run out of memory.

I know the mean can be calculated "on the fly"--without storing all of
the values to be examined, but I can't see how this is to be done with
the median; I don't think it's possible.

PS.
I would have given this post a more descriptive subject line like:
calculating median without using too much memory.
 
J

John W. Krahn

Bill said:
Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

$RATE[0] = 3;
$RATE[1] = 1;
$RATE[2] = 0;
$RATE[3] = 4;
$RATE[4] = 1;
$RATE[5] = 2;

Then my array would be:

@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

And the median would be $ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?

Any / all ideas are welcomed, but please remember that the example I
gave is just typed to give you an idea and is not any real code I am
using.

Perhaps this is close to what you require:

$ perl -le'
my @RATES = ( 3, 1, 0, 4, 1, 2 );
my $TOTAL = 11;

my $half = int( $TOTAL / 2 );
for my $i ( 0 .. $#RATES ) {
if ( ( $half -= $RATES[ $i ] ) < 0 ) {
print "Median = $i";
last;
}
}
'
Median = 3



John
 
B

Bill H

BH> The problem I have (and I hope I am explaining this right), to
BH> calculate a median, I have to make an array that contains all
BH> the values, sorted from low to high, and then look at the
BH> value of the element in the middle to get the median.

So, let me see if I understand this right.

You have five variables, $RATE0 through $RATE5, and each contains a
count of how many people rated that page that number?

You could probably do something slick, but the brute-force method
looks like this:

my @array = ((0) x $RATE0, (1) x $RATE1, (2) x $RATE2,
(3) x $RATE3, (4) x $RATE4, (5) x $RATE5);

my $median;

if (@array % 2)
{
$median = ($array[(@array-1)/2] + $array[(@array+1)/2])/2;}

else
{
$median = $array[@array/2];

}

Alternately, if you did the sensible thing and kept $RATE0 through
$RATE5 in an array, you could say, much more elegantly,

my @array = map { ($_) x $RATE[$_] } (0..5);

Charlton

Thanks Charlton, but would this not still make a large array if the
total number of people is high (unles I am missing something in it).

Bill H
 
X

xhoster

Bill H said:
Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

$RATE[0] = 3;
$RATE[1] = 1;
$RATE[2] = 0;
$RATE[3] = 4;
$RATE[4] = 1;
$RATE[5] = 2;

Compute the median directly from the structure you already have.

use List::Util qw(sum);

sub median_from_bins {
my ($bins,$total)=@_;
$total=sum @$bins unless defined $total;
my $sofar=0;
for (my $x=0; $x<=5; $x++) {
$sofar+=$bins->[$x];
return $x if $sofar>$total/2;
if ($sofar == $total/2) {
my $y=$x+1;
$y++ until $bins->[$y];
return ($x+$y)/2;
};
die "Should never get here $x $sum $total @$bins";
};

my $median = median_from_bins(\@RATE,$TOTAL);

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
T

Tad McClellan

Bill H said:
BH> The problem I have (and I hope I am explaining this right), to
BH> calculate a median, I have to make an array that contains all
BH> the values, sorted from low to high, and then look at the
BH> value of the element in the middle to get the median.
Alternately, if you did the sensible thing and kept $RATE0 through
$RATE5 in an array, you could say, much more elegantly,

my @array = map { ($_) x $RATE[$_] } (0..5);


[ it is bad 'net manners to quote .sigs ...]

Thanks Charlton, but would this not still make a large array if the
total number of people is high (unles I am missing something in it).


How many hundreds of thousands of people do you expect
will take your survey?
 
P

Peter Jamieson

Bill H said:
Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

$RATE[0] = 3;
$RATE[1] = 1;
$RATE[2] = 0;
$RATE[3] = 4;
$RATE[4] = 1;
$RATE[5] = 2;

Then my array would be:

@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

And the median would be $ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?

Any / all ideas are welcomed, but please remember that the example I
gave is just typed to give you an idea and is not any real code I am
using.

Bill H

Bill, If keeping memory use low is a priority and you indeed need the median
of thousands of ratings then for your data you can probably safely use the
mean value
since as your data count increases the mean and median will converge.
INT($mean) could give you a whole number if needed.
Cheers, Peter
 
M

Mumia W.

Bill said:
[ problem calculating the median without using too much memory ]
Bill H


use strict;
use warnings;
@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

This kind of array is what Bill wanted to avoid creating.
# using the same array since you are concerned about memory.
# need to load the array to handle sorting of 2 digit numbers.
@ARRAY = sort map {sprintf "%05d", $_} @ARRAY;

How is that simpler than this?

@ARRAY = sort { $a said:
$midPoint = $#ARRAY / 2;
$median = $ARRAY[int $midPoint];

if ($midPoint != int $midPoint) {
$upperPoint = $midPoint +1;
$median = ($median + $ARRAY[int $upperPoint]) / 2;
}

print "median = $median\n";

use POSIX 'ceil';
print "median = ", $ARRAY[ceil(@ARRAY/2)], "\n";
But this is why I use the Statistics::Descriptive::Discrete module to
calculate medians.

Bill said he didn't want to use any modules.
 
M

Michele Dondi

if (@array % 2)
{
$median = ($array[(@array-1)/2] + $array[(@array+1)/2])/2;
}
else
{
$median = $array[@array/2];
}

Actually, AIUI the index of latter should be (@array-1)/2 (or
$#array/2) and the two should calculations should be swapped. Of
course, this is IMHO a good place where to use the ternary conditional
operator.[*]


[*] On a second thought, do "of course" and "IMHO" clash?


Michele
 
B

Bill H

Bill said:
[ problem calculating the median without using too much memory ]
Bill H
use strict;
use warnings;
@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

This kind of array is what Bill wanted to avoid creating.


# using the same array since you are concerned about memory.
# need to load the array to handle sorting of 2 digit numbers.
@ARRAY = sort map {sprintf "%05d", $_} @ARRAY;

How is that simpler than this?

@ARRAY = sort { $a said:
$midPoint = $#ARRAY / 2;
$median = $ARRAY[int $midPoint];
if ($midPoint != int $midPoint) {
$upperPoint = $midPoint +1;
$median = ($median + $ARRAY[int $upperPoint]) / 2;
}
print "median = $median\n";

use POSIX 'ceil';
print "median = ", $ARRAY[ceil(@ARRAY/2)], "\n";


But this is why I use the Statistics::Descriptive::Discrete module to
calculate medians.

Bill said he didn't want to use any modules.

Thanks for the help guys. I ended up using a combination of the
examples given:


sub getMedian
{
my @values = @_;
my @median = map { ($_) x $values[$_] } (0..5);
my $m = int(@median / 2);
if ($m != @median / 2)
{
$m = int(($median[$m] + $median[$m + 1]) / 2);
}
else
{
$m = $median[$m];
}
return ($m);
}

where I call it with:

$median = getMedian(@RATE);

I do end up creating the array, but I think it will be ok.

Bill H
 
B

Bill H

On 10/05/2007 09:55 PM, l v wrote:
Bill H wrote:
[ problem calculating the median without using too much memory ]
Bill H
use strict;
use warnings;
@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
This kind of array is what Bill wanted to avoid creating.
How is that simpler than this?
@ARRAY = sort { $a <=> $b } @ARRAY;
$midPoint = $#ARRAY / 2;
$median = $ARRAY[int $midPoint];
if ($midPoint != int $midPoint) {
$upperPoint = $midPoint +1;
$median = ($median + $ARRAY[int $upperPoint]) / 2;
}
print "median = $median\n";
use POSIX 'ceil';
print "median = ", $ARRAY[ceil(@ARRAY/2)], "\n";
But this is why I use the Statistics::Descriptive::Discrete module to
calculate medians.
Bill said he didn't want to use any modules.

Thanks for the help guys. I ended up using a combination of the
examples given:

sub getMedian
{
my @values = @_;
my @median = map { ($_) x $values[$_] } (0..5);
my $m = int(@median / 2);
if ($m != @median / 2)
{
$m = int(($median[$m] + $median[$m + 1]) / 2);
}
else
{
$m = $median[$m];
}
return ($m);

}

where I call it with:

$median = getMedian(@RATE);

I do end up creating the array, but I think it will be ok.

Bill H- Hide quoted text -

- Show quoted text -

After playing with it for awhile I wonder if median is what I really
need. Logically, if you have 60 people rate the page at 0 and 30
people rate it at 5 then the page rating should be somewhere between 1
and 2, but using a median it would still be ranked at a 0 (middle
element in the array would be a 0). I know it aint strictly perl, but
any thoughts?

Bill H
 
Q

QoS

Bill H said:
On 10/05/2007 09:55 PM, l v wrote:
Bill H wrote:
[ problem calculating the median without using too much memory ]
Bill H
use strict;
use warnings;
@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
This kind of array is what Bill wanted to avoid creating.
# using the same array since you are concerned about memory.
# need to load the array to handle sorting of 2 digit numbers.
@ARRAY = sort map {sprintf "%05d", $_} @ARRAY;
How is that simpler than this?
@ARRAY = sort { $a <=> $b } @ARRAY;
$midPoint = $#ARRAY / 2;
$median = $ARRAY[int $midPoint];
if ($midPoint != int $midPoint) {
$upperPoint = $midPoint +1;
$median = ($median + $ARRAY[int $upperPoint]) / 2;
}
print "median = $median\n";
use POSIX 'ceil';
print "median = ", $ARRAY[ceil(@ARRAY/2)], "\n";
But this is why I use the Statistics::Descriptive::Discrete module to
calculate medians.
Bill said he didn't want to use any modules.

Thanks for the help guys. I ended up using a combination of the
examples given:

sub getMedian
{
my @values = @_;
my @median = map { ($_) x $values[$_] } (0..5);
my $m = int(@median / 2);
if ($m != @median / 2)
{
$m = int(($median[$m] + $median[$m + 1]) / 2);
}
else
{
$m = $median[$m];
}
return ($m);

}

where I call it with:

$median = getMedian(@RATE);

I do end up creating the array, but I think it will be ok.

Bill H- Hide quoted text -

- Show quoted text -

After playing with it for awhile I wonder if median is what I really
need. Logically, if you have 60 people rate the page at 0 and 30
people rate it at 5 then the page rating should be somewhere between 1
and 2, but using a median it would still be ranked at a 0 (middle
element in the array would be a 0). I know it aint strictly perl, but
any thoughts?

Bill H

Perhaps then you need the average; first create a total for each of
your possible values and then add these totals together, then divide
by six for the average.

Here is a simple example which needs some hardening but may show the
concept fairly clearly.

#!/usr/bin/perl/
use strict;
use warnings;

my @values = (2, 4, 3, 9, 4, 16);
my $total = 0;
my $average;

foreach my $i (0..5) {
$total += $values[$i] || 0;
}
$average = $total / 6;
print "The average response is: [$average]\n";
 
D

Doug Miller

After playing with it for awhile I wonder if median is what I really
need.

Probably not.
Logically, if you have 60 people rate the page at 0 and 30
people rate it at 5 then the page rating should be somewhere between 1
and 2,

((60 * 0) + (30 * 5)) / (60 + 30) = 150/90 = 1.667
but using a median it would still be ranked at a 0 (middle
element in the array would be a 0). I know it aint strictly perl, but
any thoughts?

Use the mean instead. Or you could display a full statistical report: mean,
median, mode, and standard deviation. :)
 
M

Michele Dondi

I know the mean can be calculated "on the fly"--without storing all of
the values to be examined, but I can't see how this is to be done with
the median; I don't think it's possible.

Sure it is possible:

#!/usr/bin/perl

use strict;
use warnings;
use List::Util 'sum';
use constant TESTS => 20;
use Test::More tests => TESTS;

sub naive {
my @arr = map +($_) x $_[$_], 0..$#_;
@arr % 2 ?
@arr[(@arr-1)/2] :
(@arr[@arr/2 - 1] + @arr[@arr/2])/2;
}

sub findidx {
my $i=shift;
($i -= $_[$_])<0 and return $_ for 0..$#_;
}

sub smart {
my $t=sum @_;
$t%2 ?
findidx +($t-1)/2, @_ :
(findidx($t/2-1, @_) + findidx($t/2, @_))/2;
}

for (1..TESTS) {
my @a=map int rand 10, 0..5;
is smart(@a), naive(@a), "Test @a";
}

__END__


Note: it is to be noted here that smart() is not very smart because I
feel the calculations performed by (findidx($t/2-1, @_) and
findidx($t/2, @_) are very much the same, but in the first attempt
with no helper sub I always got some failing error, so this one
however bloated at least shows as a proff of concept that it is not
necessary to go brute force.
PS.
I would have given this post a more descriptive subject line like:
calculating median without using too much memory.

Seconded.


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top