More math than perl...

Discussion in 'Perl Misc' started by Bill H, Oct 5, 2007.

  1. Bill H

    Bill H Guest


    I have a routine I am writing in perl that will give me the median for
    a 0 to 5 rating. The ratings are stored in a file and I load the
    values into 7 different variables, RATE0 - RATE5 and one called TOTAL.
    When a person rates a page I increment one of the RATE variables
    based on what they selected (0 - 5) and increment TOTAL so I have a
    running count (which is really just the sum of RATE0 - RATE5).

    The problem I have (and I hope I am explaining this right), to
    calculate a median, I have to make an array that contains all the
    values, sorted from low to high, and then look at the value of the
    element in the middle to get the median. As an example if I have the
    following (not real code, just an example of the logic):

    $RATE[0] = 3;
    $RATE[1] = 1;
    $RATE[2] = 0;
    $RATE[3] = 4;
    $RATE[4] = 1;
    $RATE[5] = 2;

    Then my array would be:

    @ARRAY = (0,0,0,1,3,3,3,3,4,5,5);

    And the median would be $ARRAY[5] or 3. With an even number of
    elements in @ARRAY I have to add the value below the middle and the
    value above the middle, divide by 2 to get the median.

    For a small sample this is no problem, but when the number of people
    who have rated it get in to the 1000's this array is going to be too
    cumbersome. Does anyone know of a simpler way to do it in perl without
    adding in modules or using alot of memory?

    Any / all ideas are welcomed, but please remember that the example I
    gave is just typed to give you an idea and is not any real code I am

    Bill H
    Bill H, Oct 5, 2007
    1. Advertisements

  2. BH> The problem I have (and I hope I am explaining this right), to
    BH> calculate a median, I have to make an array that contains all
    BH> the values, sorted from low to high, and then look at the
    BH> value of the element in the middle to get the median.

    So, let me see if I understand this right.

    You have five variables, $RATE0 through $RATE5, and each contains a
    count of how many people rated that page that number?

    You could probably do something slick, but the brute-force method
    looks like this:

    my @array = ((0) x $RATE0, (1) x $RATE1, (2) x $RATE2,
    (3) x $RATE3, (4) x $RATE4, (5) x $RATE5);

    my $median;

    if (@array % 2)
    $median = ($array[(@array-1)/2] + $array[(@array+1)/2])/2;
    $median = $array[@array/2];

    Alternately, if you did the sensible thing and kept $RATE0 through
    $RATE5 in an array, you could say, much more elegantly,

    my @array = map { ($_) x $RATE[$_] } (0..5);

    Charlton Wilbur, Oct 5, 2007
    1. Advertisements

  3. Bill H

    Mumia W. Guest

    I would just build the array in memory. On any reasonably modern system,
    you'll have to have millions of values before you run out of memory.

    I know the mean can be calculated "on the fly"--without storing all of
    the values to be examined, but I can't see how this is to be done with
    the median; I don't think it's possible.

    I would have given this post a more descriptive subject line like:
    calculating median without using too much memory.
    Mumia W., Oct 5, 2007
  4. Perhaps this is close to what you require:

    $ perl -le'
    my @RATES = ( 3, 1, 0, 4, 1, 2 );
    my $TOTAL = 11;

    my $half = int( $TOTAL / 2 );
    for my $i ( 0 .. $#RATES ) {
    if ( ( $half -= $RATES[ $i ] ) < 0 ) {
    print "Median = $i";
    Median = 3

    John W. Krahn, Oct 5, 2007
  5. Bill H

    Bill H Guest

    Thanks Charlton, but would this not still make a large array if the
    total number of people is high (unles I am missing something in it).

    Bill H
    Bill H, Oct 5, 2007
  6. Bill H

    xhoster Guest

    Compute the median directly from the structure you already have.

    use List::Util qw(sum);

    sub median_from_bins {
    my ($bins,$total)[email protected]_;
    $total=sum @$bins unless defined $total;
    my $sofar=0;
    for (my $x=0; $x<=5; $x++) {
    return $x if $sofar>$total/2;
    if ($sofar == $total/2) {
    my $y=$x+1;
    $y++ until $bins->[$y];
    return ($x+$y)/2;
    die "Should never get here $x $sum $total @$bins";

    my $median = median_from_bins(\@RATE,$TOTAL);


    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    xhoster, Oct 5, 2007

  7. [ it is bad 'net manners to quote .sigs ...]

    How many hundreds of thousands of people do you expect
    will take your survey?
    Tad McClellan, Oct 6, 2007
  8. Bill, If keeping memory use low is a priority and you indeed need the median
    of thousands of ratings then for your data you can probably safely use the
    mean value
    since as your data count increases the mean and median will converge.
    INT($mean) could give you a whole number if needed.
    Cheers, Peter
    Peter Jamieson, Oct 6, 2007
  9. Bill H

    Mumia W. Guest

    This kind of array is what Bill wanted to avoid creating.
    How is that simpler than this?

    use POSIX 'ceil';
    print "median = ", $ARRAY[ceil(@ARRAY/2)], "\n";
    Bill said he didn't want to use any modules.
    Mumia W., Oct 6, 2007
  10. Actually, AIUI the index of latter should be (@array-1)/2 (or
    $#array/2) and the two should calculations should be swapped. Of
    course, this is IMHO a good place where to use the ternary conditional

    [*] On a second thought, do "of course" and "IMHO" clash?

    Michele Dondi, Oct 6, 2007
  11. Bill H

    Bill H Guest

    Thanks for the help guys. I ended up using a combination of the
    examples given:

    sub getMedian
    my @values = @_;
    my @median = map { ($_) x $values[$_] } (0..5);
    my $m = int(@median / 2);
    if ($m != @median / 2)
    $m = int(($median[$m] + $median[$m + 1]) / 2);
    $m = $median[$m];
    return ($m);

    where I call it with:

    $median = getMedian(@RATE);

    I do end up creating the array, but I think it will be ok.

    Bill H
    Bill H, Oct 6, 2007
  12. Bill H

    Bill H Guest

    After playing with it for awhile I wonder if median is what I really
    need. Logically, if you have 60 people rate the page at 0 and 30
    people rate it at 5 then the page rating should be somewhere between 1
    and 2, but using a median it would still be ranked at a 0 (middle
    element in the array would be a 0). I know it aint strictly perl, but
    any thoughts?

    Bill H
    Bill H, Oct 6, 2007
  13. Bill H

    QoS Guest

    Perhaps then you need the average; first create a total for each of
    your possible values and then add these totals together, then divide
    by six for the average.

    Here is a simple example which needs some hardening but may show the
    concept fairly clearly.

    use strict;
    use warnings;

    my @values = (2, 4, 3, 9, 4, 16);
    my $total = 0;
    my $average;

    foreach my $i (0..5) {
    $total += $values[$i] || 0;
    $average = $total / 6;
    print "The average response is: [$average]\n";
    QoS, Oct 6, 2007
  14. Bill H

    Doug Miller Guest

    Probably not.
    ((60 * 0) + (30 * 5)) / (60 + 30) = 150/90 = 1.667
    Use the mean instead. Or you could display a full statistical report: mean,
    median, mode, and standard deviation. :)
    Doug Miller, Oct 6, 2007
  15. Sure it is possible:


    use strict;
    use warnings;
    use List::Util 'sum';
    use constant TESTS => 20;
    use Test::More tests => TESTS;

    sub naive {
    my @arr = map +($_) x $_[$_], 0..$#_;
    @arr % 2 ?
    @arr[(@arr-1)/2] :
    (@arr[@arr/2 - 1] + @arr[@arr/2])/2;

    sub findidx {
    my $i=shift;
    ($i -= $_[$_])<0 and return $_ for 0..$#_;

    sub smart {
    my $t=sum @_;
    $t%2 ?
    findidx +($t-1)/2, @_ :
    (findidx($t/2-1, @_) + findidx($t/2, @_))/2;

    for (1..TESTS) {
    my @a=map int rand 10, 0..5;
    is smart(@a), naive(@a), "Test @a";


    Note: it is to be noted here that smart() is not very smart because I
    feel the calculations performed by (findidx($t/2-1, @_) and
    findidx($t/2, @_) are very much the same, but in the first attempt
    with no helper sub I always got some failing error, so this one
    however bloated at least shows as a proff of concept that it is not
    necessary to go brute force.

    Michele Dondi, Oct 6, 2007
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.