# More math than perl...

Discussion in 'Perl Misc' started by Bill H, Oct 5, 2007.

1. ### Bill HGuest

Background:

I have a routine I am writing in perl that will give me the median for
a 0 to 5 rating. The ratings are stored in a file and I load the
values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
When a person rates a page I increment one of the RATE variables
based on what they selected (0 - 5) and increment TOTAL so I have a
running count (which is really just the sum of RATE0 - RATE5).

The problem I have (and I hope I am explaining this right), to
calculate a median, I have to make an array that contains all the
values, sorted from low to high, and then look at the value of the
element in the middle to get the median. As an example if I have the
following (not real code, just an example of the logic):

\$RATE[0] = 3;
\$RATE[1] = 1;
\$RATE[2] = 0;
\$RATE[3] = 4;
\$RATE[4] = 1;
\$RATE[5] = 2;

Then my array would be:

@ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

And the median would be \$ARRAY[5] or 3. With an even number of
elements in @ARRAY I have to add the value below the middle and the
value above the middle, divide by 2 to get the median.

For a small sample this is no problem, but when the number of people
who have rated it get in to the 1000's this array is going to be too
cumbersome. Does anyone know of a simpler way to do it in perl without
adding in modules or using alot of memory?

Any / all ideas are welcomed, but please remember that the example I
gave is just typed to give you an idea and is not any real code I am
using.

Bill H

Bill H, Oct 5, 2007

2. ### Charlton WilburGuest

>>>>> "BH" == Bill H <> writes:

BH> The problem I have (and I hope I am explaining this right), to
BH> calculate a median, I have to make an array that contains all
BH> the values, sorted from low to high, and then look at the
BH> value of the element in the middle to get the median.

So, let me see if I understand this right.

You have five variables, \$RATE0 through \$RATE5, and each contains a
count of how many people rated that page that number?

You could probably do something slick, but the brute-force method
looks like this:

my @array = ((0) x \$RATE0, (1) x \$RATE1, (2) x \$RATE2,
(3) x \$RATE3, (4) x \$RATE4, (5) x \$RATE5);

my \$median;

if (@array % 2)
{
\$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;
}
else
{
\$median = \$array[@array/2];
}

Alternately, if you did the sensible thing and kept \$RATE0 through
\$RATE5 in an array, you could say, much more elegantly,

my @array = map { (\$_) x \$RATE[\$_] } (0..5);

Charlton

--
Charlton Wilbur

Charlton Wilbur, Oct 5, 2007

3. ### Mumia W.Guest

On 10/05/2007 04:12 PM, Bill H wrote:
> [...]
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
> [...]

I would just build the array in memory. On any reasonably modern system,
you'll have to have millions of values before you run out of memory.

I know the mean can be calculated "on the fly"--without storing all of
the values to be examined, but I can't see how this is to be done with
the median; I don't think it's possible.

PS.
I would have given this post a more descriptive subject line like:
calculating median without using too much memory.

Mumia W., Oct 5, 2007
4. ### John W. KrahnGuest

Bill H wrote:
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;
>
> Then my array would be:
>
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
>
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
>
> Any / all ideas are welcomed, but please remember that the example I
> gave is just typed to give you an idea and is not any real code I am
> using.

Perhaps this is close to what you require:

\$ perl -le'
my @RATES = ( 3, 1, 0, 4, 1, 2 );
my \$TOTAL = 11;

my \$half = int( \$TOTAL / 2 );
for my \$i ( 0 .. \$#RATES ) {
if ( ( \$half -= \$RATES[ \$i ] ) < 0 ) {
print "Median = \$i";
last;
}
}
'
Median = 3

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

John W. Krahn, Oct 5, 2007
5. ### Bill HGuest

On Oct 5, 5:44 pm, Charlton Wilbur <> wrote:
> >>>>> "BH" == Bill H <> writes:

>
> BH> The problem I have (and I hope I am explaining this right), to
> BH> calculate a median, I have to make an array that contains all
> BH> the values, sorted from low to high, and then look at the
> BH> value of the element in the middle to get the median.
>
> So, let me see if I understand this right.
>
> You have five variables, \$RATE0 through \$RATE5, and each contains a
> count of how many people rated that page that number?
>
> You could probably do something slick, but the brute-force method
> looks like this:
>
> my @array = ((0) x \$RATE0, (1) x \$RATE1, (2) x \$RATE2,
> (3) x \$RATE3, (4) x \$RATE4, (5) x \$RATE5);
>
> my \$median;
>
> if (@array % 2)
> {
> \$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;}
>
> else
> {
> \$median = \$array[@array/2];
>
> }
>
> Alternately, if you did the sensible thing and kept \$RATE0 through
> \$RATE5 in an array, you could say, much more elegantly,
>
> my @array = map { (\$_) x \$RATE[\$_] } (0..5);
>
> Charlton
>
> --
> Charlton Wilbur
>

Thanks Charlton, but would this not still make a large array if the
total number of people is high (unles I am missing something in it).

Bill H

Bill H, Oct 5, 2007
6. ### Guest

Bill H <> wrote:
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;

Compute the median directly from the structure you already have.

use List::Util qw(sum);

sub median_from_bins {
my (\$bins,\$total)=@_;
\$total=sum @\$bins unless defined \$total;
my \$sofar=0;
for (my \$x=0; \$x<=5; \$x++) {
\$sofar+=\$bins->[\$x];
return \$x if \$sofar>\$total/2;
if (\$sofar == \$total/2) {
my \$y=\$x+1;
\$y++ until \$bins->[\$y];
return (\$x+\$y)/2;
};
die "Should never get here \$x \$sum \$total @\$bins";
};

my \$median = median_from_bins(\@RATE,\$TOTAL);

Xho

--
The costs of publication of this article were defrayed in part by the
this fact.

, Oct 5, 2007

Bill H <> wrote:
> On Oct 5, 5:44 pm, Charlton Wilbur <> wrote:
>> >>>>> "BH" == Bill H <> writes:

>>
>> BH> The problem I have (and I hope I am explaining this right), to
>> BH> calculate a median, I have to make an array that contains all
>> BH> the values, sorted from low to high, and then look at the
>> BH> value of the element in the middle to get the median.

>> Alternately, if you did the sensible thing and kept \$RATE0 through
>> \$RATE5 in an array, you could say, much more elegantly,
>>
>> my @array = map { (\$_) x \$RATE[\$_] } (0..5);

>> --
>> Charlton Wilbur
>>

[ it is bad 'net manners to quote .sigs ...]

> Thanks Charlton, but would this not still make a large array if the
> total number of people is high (unles I am missing something in it).

How many hundreds of thousands of people do you expect

--
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

8. ### Peter JamiesonGuest

"Bill H" <> wrote in message
news:...
> Background:
>
> I have a routine I am writing in perl that will give me the median for
> a 0 to 5 rating. The ratings are stored in a file and I load the
> values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
> When a person rates a page I increment one of the RATE variables
> based on what they selected (0 - 5) and increment TOTAL so I have a
> running count (which is really just the sum of RATE0 - RATE5).
>
> The problem I have (and I hope I am explaining this right), to
> calculate a median, I have to make an array that contains all the
> values, sorted from low to high, and then look at the value of the
> element in the middle to get the median. As an example if I have the
> following (not real code, just an example of the logic):
>
> \$RATE[0] = 3;
> \$RATE[1] = 1;
> \$RATE[2] = 0;
> \$RATE[3] = 4;
> \$RATE[4] = 1;
> \$RATE[5] = 2;
>
> Then my array would be:
>
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);
>
> And the median would be \$ARRAY[5] or 3. With an even number of
> elements in @ARRAY I have to add the value below the middle and the
> value above the middle, divide by 2 to get the median.
>
> For a small sample this is no problem, but when the number of people
> who have rated it get in to the 1000's this array is going to be too
> cumbersome. Does anyone know of a simpler way to do it in perl without
> adding in modules or using alot of memory?
>
> Any / all ideas are welcomed, but please remember that the example I
> gave is just typed to give you an idea and is not any real code I am
> using.
>
> Bill H
>

Bill, If keeping memory use low is a priority and you indeed need the median
of thousands of ratings then for your data you can probably safely use the
mean value
since as your data count increases the mean and median will converge.
INT(\$mean) could give you a whole number if needed.
Cheers, Peter

Peter Jamieson, Oct 6, 2007
9. ### Mumia W.Guest

On 10/05/2007 09:55 PM, l v wrote:
> Bill H wrote:
>> [ problem calculating the median without using too much memory ]
>> Bill H
>>

>
>
> use strict;
> use warnings;
> @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

This kind of array is what Bill wanted to avoid creating.

>
> # using the same array since you are concerned about memory.
> # need to load the array to handle sorting of 2 digit numbers.
> @ARRAY = sort map {sprintf "%05d", \$_} @ARRAY;

How is that simpler than this?

@ARRAY = sort { \$a <=> \$b } @ARRAY;

> \$midPoint = \$#ARRAY / 2;
> \$median = \$ARRAY[int \$midPoint];
>
> if (\$midPoint != int \$midPoint) {
> \$upperPoint = \$midPoint +1;
> \$median = (\$median + \$ARRAY[int \$upperPoint]) / 2;
> }
>
> print "median = \$median\n";
>

use POSIX 'ceil';
print "median = ", \$ARRAY[ceil(@ARRAY/2)], "\n";

>
> But this is why I use the Statistics:escriptive:iscrete module to
> calculate medians.
>

Bill said he didn't want to use any modules.

Mumia W., Oct 6, 2007
10. ### Michele DondiGuest

On 05 Oct 2007 17:44:43 -0400, Charlton Wilbur
<> wrote:

>if (@array % 2)
>{
> \$median = (\$array[(@array-1)/2] + \$array[(@array+1)/2])/2;
>}
>else
>{
> \$median = \$array[@array/2];
>}

Actually, AIUI the index of latter should be (@array-1)/2 (or
\$#array/2) and the two should calculations should be swapped. Of
course, this is IMHO a good place where to use the ternary conditional
operator.[*]

[*] On a second thought, do "of course" and "IMHO" clash?

Michele
--
{\$_=pack'B8'x25,unpack'A8'x32,\$a^=sub{pop^pop}->(map substr
((\$a||=join'',map--\$|x\$_,(unpack'w',unpack'u','G^<R<Y]*YB='
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,\$_,
256),7,249);s/[^\w,]/ /g;\$ \=/^J/?\$/:"\r";print,redo}#JAPH,

Michele Dondi, Oct 6, 2007
11. ### Bill HGuest

On Oct 6, 1:56 am, "Mumia W." <paduille.4061.mumia.w
> wrote:
> On 10/05/2007 09:55 PM, l v wrote:
>
> > Bill H wrote:
> >> [ problem calculating the median without using too much memory ]
> >> Bill H

>
> > use strict;
> > use warnings;
> > @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

>
> This kind of array is what Bill wanted to avoid creating.
>
>
>
> > # using the same array since you are concerned about memory.
> > # need to load the array to handle sorting of 2 digit numbers.
> > @ARRAY = sort map {sprintf "%05d", \$_} @ARRAY;

>
> How is that simpler than this?
>
> @ARRAY = sort { \$a <=> \$b } @ARRAY;
>
> > \$midPoint = \$#ARRAY / 2;
> > \$median = \$ARRAY[int \$midPoint];

>
> > if (\$midPoint != int \$midPoint) {
> > \$upperPoint = \$midPoint +1;
> > \$median = (\$median + \$ARRAY[int \$upperPoint]) / 2;
> > }

>
> > print "median = \$median\n";

>
> use POSIX 'ceil';
> print "median = ", \$ARRAY[ceil(@ARRAY/2)], "\n";
>
>
>
> > But this is why I use the Statistics:escriptive:iscrete module to
> > calculate medians.

>
> Bill said he didn't want to use any modules.

Thanks for the help guys. I ended up using a combination of the
examples given:

sub getMedian
{
my @values = @_;
my @median = map { (\$_) x \$values[\$_] } (0..5);
my \$m = int(@median / 2);
if (\$m != @median / 2)
{
\$m = int((\$median[\$m] + \$median[\$m + 1]) / 2);
}
else
{
\$m = \$median[\$m];
}
return (\$m);
}

where I call it with:

\$median = getMedian(@RATE);

I do end up creating the array, but I think it will be ok.

Bill H

Bill H, Oct 6, 2007
12. ### Bill HGuest

On Oct 6, 7:01 am, Bill H <> wrote:
> On Oct 6, 1:56 am, "Mumia W." <paduille.4061.mumia.w
>
>
>
>
>
> > wrote:
> > On 10/05/2007 09:55 PM, l v wrote:

>
> > > Bill H wrote:
> > >> [ problem calculating the median without using too much memory ]
> > >> Bill H

>
> > > use strict;
> > > use warnings;
> > > @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

>
> > This kind of array is what Bill wanted to avoid creating.

>
> > > # using the same array since you are concerned about memory.
> > > # need to load the array to handle sorting of 2 digit numbers.
> > > @ARRAY = sort map {sprintf "%05d", \$_} @ARRAY;

>
> > How is that simpler than this?

>
> > @ARRAY = sort { \$a <=> \$b } @ARRAY;

>
> > > \$midPoint = \$#ARRAY / 2;
> > > \$median = \$ARRAY[int \$midPoint];

>
> > > if (\$midPoint != int \$midPoint) {
> > > \$upperPoint = \$midPoint +1;
> > > \$median = (\$median + \$ARRAY[int \$upperPoint]) / 2;
> > > }

>
> > > print "median = \$median\n";

>
> > use POSIX 'ceil';
> > print "median = ", \$ARRAY[ceil(@ARRAY/2)], "\n";

>
> > > But this is why I use the Statistics:escriptive:iscrete module to
> > > calculate medians.

>
> > Bill said he didn't want to use any modules.

>
> Thanks for the help guys. I ended up using a combination of the
> examples given:
>
> sub getMedian
> {
> my @values = @_;
> my @median = map { (\$_) x \$values[\$_] } (0..5);
> my \$m = int(@median / 2);
> if (\$m != @median / 2)
> {
> \$m = int((\$median[\$m] + \$median[\$m + 1]) / 2);
> }
> else
> {
> \$m = \$median[\$m];
> }
> return (\$m);
>
> }
>
> where I call it with:
>
> \$median = getMedian(@RATE);
>
> I do end up creating the array, but I think it will be ok.
>
> Bill H- Hide quoted text -
>
> - Show quoted text -

After playing with it for awhile I wonder if median is what I really
need. Logically, if you have 60 people rate the page at 0 and 30
people rate it at 5 then the page rating should be somewhere between 1
and 2, but using a median it would still be ranked at a 0 (middle
element in the array would be a 0). I know it aint strictly perl, but
any thoughts?

Bill H

Bill H, Oct 6, 2007
13. ### Guest

Bill H <> wrote in message-id: <>

>
> On Oct 6, 7:01 am, Bill H <> wrote:
> > On Oct 6, 1:56 am, "Mumia W." <paduille.4061.mumia.w
> >
> >
> >
> >
> >
> > > wrote:
> > > On 10/05/2007 09:55 PM, l v wrote:

> >
> > > > Bill H wrote:
> > > >> [ problem calculating the median without using too much memory ]
> > > >> Bill H

> >
> > > > use strict;
> > > > use warnings;
> > > > @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

> >
> > > This kind of array is what Bill wanted to avoid creating.

> >
> > > > # using the same array since you are concerned about memory.
> > > > # need to load the array to handle sorting of 2 digit numbers.
> > > > @ARRAY = sort map {sprintf "%05d", \$_} @ARRAY;

> >
> > > How is that simpler than this?

> >
> > > @ARRAY = sort { \$a <=> \$b } @ARRAY;

> >
> > > > \$midPoint = \$#ARRAY / 2;
> > > > \$median = \$ARRAY[int \$midPoint];

> >
> > > > if (\$midPoint != int \$midPoint) {
> > > > \$upperPoint = \$midPoint +1;
> > > > \$median = (\$median + \$ARRAY[int \$upperPoint]) / 2;
> > > > }

> >
> > > > print "median = \$median\n";

> >
> > > use POSIX 'ceil';
> > > print "median = ", \$ARRAY[ceil(@ARRAY/2)], "\n";

> >
> > > > But this is why I use the Statistics:escriptive:iscrete module to
> > > > calculate medians.

> >
> > > Bill said he didn't want to use any modules.

> >
> > Thanks for the help guys. I ended up using a combination of the
> > examples given:
> >
> > sub getMedian
> > {
> > my @values = @_;
> > my @median = map { (\$_) x \$values[\$_] } (0..5);
> > my \$m = int(@median / 2);
> > if (\$m != @median / 2)
> > {
> > \$m = int((\$median[\$m] + \$median[\$m + 1]) / 2);
> > }
> > else
> > {
> > \$m = \$median[\$m];
> > }
> > return (\$m);
> >
> > }
> >
> > where I call it with:
> >
> > \$median = getMedian(@RATE);
> >
> > I do end up creating the array, but I think it will be ok.
> >
> > Bill H- Hide quoted text -
> >
> > - Show quoted text -

>
> After playing with it for awhile I wonder if median is what I really
> need. Logically, if you have 60 people rate the page at 0 and 30
> people rate it at 5 then the page rating should be somewhere between 1
> and 2, but using a median it would still be ranked at a 0 (middle
> element in the array would be a 0). I know it aint strictly perl, but
> any thoughts?
>
> Bill H

Perhaps then you need the average; first create a total for each of
by six for the average.

Here is a simple example which needs some hardening but may show the
concept fairly clearly.

#!/usr/bin/perl/
use strict;
use warnings;

my @values = (2, 4, 3, 9, 4, 16);
my \$total = 0;
my \$average;

foreach my \$i (0..5) {
\$total += \$values[\$i] || 0;
}
\$average = \$total / 6;
print "The average response is: [\$average]\n";

, Oct 6, 2007
14. ### Doug MillerGuest

In article <>, Bill H <> wrote:

>After playing with it for awhile I wonder if median is what I really
>need.

Probably not.

>Logically, if you have 60 people rate the page at 0 and 30
>people rate it at 5 then the page rating should be somewhere between 1
>and 2,

((60 * 0) + (30 * 5)) / (60 + 30) = 150/90 = 1.667

>but using a median it would still be ranked at a 0 (middle
>element in the array would be a 0). I know it aint strictly perl, but
>any thoughts?

Use the mean instead. Or you could display a full statistical report: mean,
median, mode, and standard deviation.

--
Regards,
Doug Miller (alphageek at milmac dot com)

It's time to throw all their damned tea in the harbor again.

Doug Miller, Oct 6, 2007
15. ### Michele DondiGuest

On Fri, 05 Oct 2007 16:49:31 -0500, "Mumia W."
<> wrote:

>I know the mean can be calculated "on the fly"--without storing all of
>the values to be examined, but I can't see how this is to be done with
>the median; I don't think it's possible.

Sure it is possible:

#!/usr/bin/perl

use strict;
use warnings;
use List::Util 'sum';
use constant TESTS => 20;
use Test::More tests => TESTS;

sub naive {
my @arr = map +(\$_) x \$_[\$_], 0..\$#_;
@arr % 2 ?
@arr[(@arr-1)/2] :
(@arr[@arr/2 - 1] + @arr[@arr/2])/2;
}

sub findidx {
my \$i=shift;
(\$i -= \$_[\$_])<0 and return \$_ for 0..\$#_;
}

sub smart {
my \$t=sum @_;
\$t%2 ?
findidx +(\$t-1)/2, @_ :
(findidx(\$t/2-1, @_) + findidx(\$t/2, @_))/2;
}

for (1..TESTS) {
my @a=map int rand 10, 0..5;
is smart(@a), naive(@a), "Test @a";
}

__END__

Note: it is to be noted here that smart() is not very smart because I
feel the calculations performed by (findidx(\$t/2-1, @_) and
findidx(\$t/2, @_) are very much the same, but in the first attempt
with no helper sub I always got some failing error, so this one
however bloated at least shows as a proff of concept that it is not
necessary to go brute force.

>PS.
>I would have given this post a more descriptive subject line like:
>calculating median without using too much memory.

Seconded.

Michele
--
{\$_=pack'B8'x25,unpack'A8'x32,\$a^=sub{pop^pop}->(map substr
((\$a||=join'',map--\$|x\$_,(unpack'w',unpack'u','G^<R<Y]*YB='
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,\$_,
256),7,249);s/[^\w,]/ /g;\$ \=/^J/?\$/:"\r";print,redo}#JAPH,

Michele Dondi, Oct 6, 2007