[SUMMARY] Sampling (#39)

R

Ruby Quiz

To me, the solutions to this week's quiz are a fascinating combination of the
various degrees of correct, dangerous, and fast. Let's examine a few different
solutions and I'll try to show you why I think that.

The fast characteristic seemed to be in abundant supply. Ezra Zygmuntowicz
started scaring people with time readouts before solutions were even posted.
Let's start with that code:

#!/usr/local/bin/ruby
members, limit, index = ARGV[0].to_i, ARGV[1].to_i, 0
member_range = limit / members
0.upto(members-1) do
res = rand member_range
puts res + index
index += member_range
end

How does this work?

We can see that it reads in the two settings and initializes an index to zero.
Next if calculates a member_range by dividing limit by members. That line is
when this code started to smell fishy to me.

The upto() iterator does all the work. It generates a random number in the
member_range, adds that to the index and spits it out, then bumps index by
member_range. This happens members times.

I think it's important to think about how this works. Let's assume we want two
members with a limit of ten. That's going to give us a member_range of five.
So we'll chose a random number between zero and four, add zero to it (index),
and spit that out. Then bump index by five and repeat. Put another way, we'll
choose a random member of the first half of the range and then a random member
from the second half of the range. We may get one and nine or even four and
five, but we'll never see seven and eight since they're both in the same half of
the range.

That's a lightning quick solution. It solves the quiz example in 26 seconds on
my box. However, the real question is, is it correct? Obviously, that's
subjective. My opinion is that I can find combinations of the range that it
will not choose and I don't consider that to be meeting the requirement of
randomly selecting members from the (amended) quiz. Many of us are probably
familiar with the famous quote:

"Anyone who considers arithmetical methods of producing random digits is,
of course, in a state of sin." --Von Neumann

I think that's probably more applicable to the above code than it is to random
number generation in general.

Let's look at another solution. This one is from Dominik Bathon:

#!/usr/local/bin/ruby

if $DEBUG
def ptime(evt)
$ptimeg ||= Time.now.to_f
STDERR.puts "#{evt} at #{Time.now.to_f - $ptimeg}"
end
else
def ptime(evt)
# noop
end
end

# the actuall sampling, just store the seen values in a hash and return the
# sorted hash keys
def sample(cnt, lim)
x = {}
tmp = nil

for i in 0...cnt
while x.has_key?(tmp = rand(lim))
end
x[tmp] = true
end
ptime "rand"

x = x.keys.sort
ptime "sort"
x
end

# this is the key to success, but needs lots of ram
GC.disable

ptime "start"

x = sample(cnt=ARGV[0].to_i, ARGV[1].to_i)

# creating the newline string only once saves 5s
nl = "\n"
i = 0
while i+10 <= cnt
# this is saves about 1s
print x, nl, x[i+1], nl, x[i+2], nl, x[i+3], nl, x[i+4],
nl, x[i+5], nl, x[i+6], nl, x[i+7], nl, x[i+8], nl, x[i+9], nl
i += 10
end
for j in i...cnt
print x[j], nl
end
ptime "print"

I don't want to focus too much on the details here. The first chunk of code
just defines a timing helper method and the last chunk it some heavily optimized
printing. The actual solution is the sample() method in the middle. The
comment before the method tells you exactly how it works, so I'm not going to
insult your intelligence by repeating it. This is a great time to bring up an
interesting question that came up on Ruby Talk though:

"How would someone not being aware of the advantages a Hash-lookup gives
(compared to Array-lookup) choose to implement the problem with a Hash?
It is not obvious for inexperienced programmers." --Stefan Mahlitz

I think the keyword in what we're trying to find here is "unique". Whenever I
think unique, I start thinking about a Hash. The keys of a Hash are unique, by
definition. In order to find out if something is unique in an Array, you've got
to walk it. In order to find out if something if something exists in a Hash,
you just need to check for that one key. That's slower than a single Array
lookup, but much faster than five million Array lookups. Get into the habit of
making that quick mental translation: Unique (almost always) means Hash.

Back to Dominik's code.

Again, this code is fast. It solves the quiz example in 40 seconds for me. My
main concern is what Dominik reported in the submission message:

"But the "real solution" is GC.disable. That has a downside of course. The
above run of sample.rb needs approx. 400MB of RAM. So, don't try this at
home if you have less than 512MB ;-)" --Dominik Bathon

I'm probably a little old school, but that warning scares me. What if we add a
zero to both numbers in the quiz example? If I looked into some code and saw
that it was using a lot of memory, I don't think I would try untying the memory
safety net with GC.disable(). Clearly this solution works and is very fast for
certain samples on certain hardware, but I think it needs some caution tape
warning users and apparently Dominik does too.

If you like the Hash idea, here's the cleanest version I saw from Jim Menard
with a slight syntax change suggested by Daniel Brockman:

#!/usr/bin/env ruby

require 'set'

def sample(members, range)
selected = Set.new
members.times {
begin
val = rand(range)
end until selected.add?(val)
}
selected.to_a.sort
end

puts sample(ARGV[0].to_i, ARGV[1].to_i).join("\n")

This solution uses the Ruby set library. (Sets are implemented using Hashes,
because of the aforementioned unique behavior.) This version just adds a random
choice to the selected set, member times. The add?() method ensures that it was
a new member for the set, causing the code to loop until it returns true. The
results are then converted to an Array, sorted, and printed.

This solution took two minutes and ten seconds with the quiz example on my box.
It still allocates a good sized chunk of memory, but significantly less than
Dominik's code (around 150 MB).

The other issue with these kinds of solutions is collisions. In the quiz
example the members requested are much less than the limit. However, as those
two numbers get closer together, random selections will be repeats a lot more
often. That can reverse the nice runtimes of these solutions. Here's a simple
script to demonstrate collisions:

#!/usr/local/bin/ruby -w

members, limit = ARGV.map { |n| Integer(n) }

choices = Hash.new
collisions = 0

until choices.size == members
choice = rand(limit)
if choices.include? choice
collisions += 1
else
choices[rand(limit)] = true
end
end

warn "#{collisions} collisions"

Now watch a few runs of that script

$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 2 10
1 collisions
$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 2 10
0 collisions
$ ruby collisions.rb 9 10
25 collisions
$ ruby collisions.rb 9 10
21 collisions
$ ruby collisions.rb 9 10
50 collisions
$ ruby collisions.rb 9_999 10_000
39205236 collisions
$ time ruby collisions.rb 9_999 10_000
35249691 collisions

real 1m13.996s
user 1m13.876s
sys 0m0.084s

As you can see the number of collisions can climb very fast and the runtime is
quickly dropping off. Joost Diepenmaat suggested that it might be a good idea
to switch algorithms when members > limit / 2. Joost's idea was simply to
reverse the above algorithm, looking for what not to include, but here's a
different option from Matthew D Moss:

#!/usr/local/bin/ruby

(k, n) = ARGV.map { |s| s.to_i }
n.times do |i|
r = rand * (n - i)
unless (k < r)
puts i
k -= 1
end
end

This solution does not use memory to store the numbers or even need to call
sort() at any point. It works by walking the entire range and randomly
selecting which numbers to include. You can see that it selects a random number
in what's left of the range on each iteration. When the number of members
needed is less than that choice, the code just moves on to the next number.
However, when members is equal to that choice, the number is selected (printed)
and the remaining members count is dropped. The behavior ensure that we get
enough numbers, with each number having an equal chance at selection.

The minus with this one is speed. Walking all those numbers takes time. This
algorithm is very similar to my own, used to make the quiz. You saw how high
that runtime got. However, the code's not gobbling up memory and it is going to
find a proper sample, eventually.

That concludes our tour of the solutions. As you can see, there are many
trade-offs to be made over this problem. If you want super speed and can spare
the memory, a Hash based solution is a good answer. Otherwise, something like
Matthew's solution is probably the best choice. It's fast enough on smaller
inputs, it doesn't keep eating memory, and it gives a good random sample.

Many, many thanks to all who decided to jump in on this problem. The solutions
and discussion were all vary insightful.

Tomorrow we will find out if heaps grow on trees...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top