[SUMMARY] Sampling (#39)

Discussion in 'Ruby' started by Ruby Quiz, Jul 21, 2005.

  1. Ruby Quiz

    Ruby Quiz Guest

    To me, the solutions to this week's quiz are a fascinating combination of the
    various degrees of correct, dangerous, and fast. Let's examine a few different
    solutions and I'll try to show you why I think that.

    The fast characteristic seemed to be in abundant supply. Ezra Zygmuntowicz
    started scaring people with time readouts before solutions were even posted.
    Let's start with that code:

    members, limit, index = ARGV[0].to_i, ARGV[1].to_i, 0
    member_range = limit / members
    0.upto(members-1) do
    res = rand member_range
    puts res + index
    index += member_range

    How does this work?

    We can see that it reads in the two settings and initializes an index to zero.
    Next if calculates a member_range by dividing limit by members. That line is
    when this code started to smell fishy to me.

    The upto() iterator does all the work. It generates a random number in the
    member_range, adds that to the index and spits it out, then bumps index by
    member_range. This happens members times.

    I think it's important to think about how this works. Let's assume we want two
    members with a limit of ten. That's going to give us a member_range of five.
    So we'll chose a random number between zero and four, add zero to it (index),
    and spit that out. Then bump index by five and repeat. Put another way, we'll
    choose a random member of the first half of the range and then a random member
    from the second half of the range. We may get one and nine or even four and
    five, but we'll never see seven and eight since they're both in the same half of
    the range.

    That's a lightning quick solution. It solves the quiz example in 26 seconds on
    my box. However, the real question is, is it correct? Obviously, that's
    subjective. My opinion is that I can find combinations of the range that it
    will not choose and I don't consider that to be meeting the requirement of
    randomly selecting members from the (amended) quiz. Many of us are probably
    familiar with the famous quote:

    "Anyone who considers arithmetical methods of producing random digits is,
    of course, in a state of sin." --Von Neumann

    I think that's probably more applicable to the above code than it is to random
    number generation in general.

    Let's look at another solution. This one is from Dominik Bathon:


    if $DEBUG
    def ptime(evt)
    $ptimeg ||= Time.now.to_f
    STDERR.puts "#{evt} at #{Time.now.to_f - $ptimeg}"
    def ptime(evt)
    # noop

    # the actuall sampling, just store the seen values in a hash and return the
    # sorted hash keys
    def sample(cnt, lim)
    x = {}
    tmp = nil

    for i in 0...cnt
    while x.has_key?(tmp = rand(lim))
    x[tmp] = true
    ptime "rand"

    x = x.keys.sort
    ptime "sort"

    # this is the key to success, but needs lots of ram

    ptime "start"

    x = sample(cnt=ARGV[0].to_i, ARGV[1].to_i)

    # creating the newline string only once saves 5s
    nl = "\n"
    i = 0
    while i+10 <= cnt
    # this is saves about 1s
    print x, nl, x[i+1], nl, x[i+2], nl, x[i+3], nl, x[i+4],
    nl, x[i+5], nl, x[i+6], nl, x[i+7], nl, x[i+8], nl, x[i+9], nl
    i += 10
    for j in i...cnt
    print x[j], nl
    ptime "print"

    I don't want to focus too much on the details here. The first chunk of code
    just defines a timing helper method and the last chunk it some heavily optimized
    printing. The actual solution is the sample() method in the middle. The
    comment before the method tells you exactly how it works, so I'm not going to
    insult your intelligence by repeating it. This is a great time to bring up an
    interesting question that came up on Ruby Talk though:

    "How would someone not being aware of the advantages a Hash-lookup gives
    (compared to Array-lookup) choose to implement the problem with a Hash?
    It is not obvious for inexperienced programmers." --Stefan Mahlitz

    I think the keyword in what we're trying to find here is "unique". Whenever I
    think unique, I start thinking about a Hash. The keys of a Hash are unique, by
    definition. In order to find out if something is unique in an Array, you've got
    to walk it. In order to find out if something if something exists in a Hash,
    you just need to check for that one key. That's slower than a single Array
    lookup, but much faster than five million Array lookups. Get into the habit of
    making that quick mental translation: Unique (almost always) means Hash.

    Back to Dominik's code.

    Again, this code is fast. It solves the quiz example in 40 seconds for me. My
    main concern is what Dominik reported in the submission message:

    "But the "real solution" is GC.disable. That has a downside of course. The
    above run of sample.rb needs approx. 400MB of RAM. So, don't try this at
    home if you have less than 512MB ;-)" --Dominik Bathon

    I'm probably a little old school, but that warning scares me. What if we add a
    zero to both numbers in the quiz example? If I looked into some code and saw
    that it was using a lot of memory, I don't think I would try untying the memory
    safety net with GC.disable(). Clearly this solution works and is very fast for
    certain samples on certain hardware, but I think it needs some caution tape
    warning users and apparently Dominik does too.

    If you like the Hash idea, here's the cleanest version I saw from Jim Menard
    with a slight syntax change suggested by Daniel Brockman:

    #!/usr/bin/env ruby

    require 'set'

    def sample(members, range)
    selected = Set.new
    members.times {
    val = rand(range)
    end until selected.add?(val)

    puts sample(ARGV[0].to_i, ARGV[1].to_i).join("\n")

    This solution uses the Ruby set library. (Sets are implemented using Hashes,
    because of the aforementioned unique behavior.) This version just adds a random
    choice to the selected set, member times. The add?() method ensures that it was
    a new member for the set, causing the code to loop until it returns true. The
    results are then converted to an Array, sorted, and printed.

    This solution took two minutes and ten seconds with the quiz example on my box.
    It still allocates a good sized chunk of memory, but significantly less than
    Dominik's code (around 150 MB).

    The other issue with these kinds of solutions is collisions. In the quiz
    example the members requested are much less than the limit. However, as those
    two numbers get closer together, random selections will be repeats a lot more
    often. That can reverse the nice runtimes of these solutions. Here's a simple
    script to demonstrate collisions:

    #!/usr/local/bin/ruby -w

    members, limit = ARGV.map { |n| Integer(n) }

    choices = Hash.new
    collisions = 0

    until choices.size == members
    choice = rand(limit)
    if choices.include? choice
    collisions += 1
    choices[rand(limit)] = true

    warn "#{collisions} collisions"

    Now watch a few runs of that script

    $ ruby collisions.rb 2 10
    0 collisions
    $ ruby collisions.rb 2 10
    1 collisions
    $ ruby collisions.rb 2 10
    0 collisions
    $ ruby collisions.rb 2 10
    0 collisions
    $ ruby collisions.rb 9 10
    25 collisions
    $ ruby collisions.rb 9 10
    21 collisions
    $ ruby collisions.rb 9 10
    50 collisions
    $ ruby collisions.rb 9_999 10_000
    39205236 collisions
    $ time ruby collisions.rb 9_999 10_000
    35249691 collisions

    real 1m13.996s
    user 1m13.876s
    sys 0m0.084s

    As you can see the number of collisions can climb very fast and the runtime is
    quickly dropping off. Joost Diepenmaat suggested that it might be a good idea
    to switch algorithms when members > limit / 2. Joost's idea was simply to
    reverse the above algorithm, looking for what not to include, but here's a
    different option from Matthew D Moss:


    (k, n) = ARGV.map { |s| s.to_i }
    n.times do |i|
    r = rand * (n - i)
    unless (k < r)
    puts i
    k -= 1

    This solution does not use memory to store the numbers or even need to call
    sort() at any point. It works by walking the entire range and randomly
    selecting which numbers to include. You can see that it selects a random number
    in what's left of the range on each iteration. When the number of members
    needed is less than that choice, the code just moves on to the next number.
    However, when members is equal to that choice, the number is selected (printed)
    and the remaining members count is dropped. The behavior ensure that we get
    enough numbers, with each number having an equal chance at selection.

    The minus with this one is speed. Walking all those numbers takes time. This
    algorithm is very similar to my own, used to make the quiz. You saw how high
    that runtime got. However, the code's not gobbling up memory and it is going to
    find a proper sample, eventually.

    That concludes our tour of the solutions. As you can see, there are many
    trade-offs to be made over this problem. If you want super speed and can spare
    the memory, a Hash based solution is a good answer. Otherwise, something like
    Matthew's solution is probably the best choice. It's fast enough on smaller
    inputs, it doesn't keep eating memory, and it gives a good random sample.

    Many, many thanks to all who decided to jump in on this problem. The solutions
    and discussion were all vary insightful.

    Tomorrow we will find out if heaps grow on trees...
    Ruby Quiz, Jul 21, 2005
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ALuPin


    ALuPin, Mar 11, 2005, in forum: VHDL
  2. Albretch
    Nov 29, 2004
  3. steflhermitte
    Apr 21, 2005
  4. Steven Bethard

    sampling items from a nested list

    Steven Bethard, Feb 16, 2005, in forum: Python
    Felix Wiemann
    Feb 17, 2005
  5. Replies:
  6. Brian Quinlan

    Sampling a population

    Brian Quinlan, Jun 2, 2006, in forum: Python
    Roger Miller
    Jun 2, 2006
  7. Marco

    sampling rate

    Marco, Aug 21, 2006, in forum: VHDL
    Thomas Fischer
    Aug 24, 2006
  8. wanwan

    sampling rate problem

    wanwan, Nov 10, 2006, in forum: Java
    Karl Uppiano
    Nov 10, 2006