[QUIZ] Sampling (#39)

Jim Menard · Jul 18, 2005

A fun quiz. Thanks, James.

I've posted my solution at http://www.io.com/~jimm/rubyquiz/quiz39/, but
here's the whole page.

Here is my solution to Ruby Quiz 39: Sampling. This solution, which takes
roughly 1.75 minutes to run realtime (0.010 sec user time) is so short, I
present it here in its entirety:

#! /usr/bin/env ruby

require 'set'

def sample(members, range)
selected = Set.new
members.times {
val = rand(range)
val = rand(range) until selected.add?(val)
}
selected.to_a.sort
end

puts sample(ARGV[0].to_i, ARGV[1].to_i).join("\n")

This solution is completely out-of-the-box. I played with a few optimizations,
without looking for this algorithm anywhere online. Instead of using a Set
(which is backed by a Hash), I wanted to create a range-length array and mark
each selected value using that array, but ran out of memory with a range of 10e9.

Disk I/O took about 45 seconds of the total time. There's probably a way to
speed that up.

Jim

James Edward Gray II · Jul 18, 2005

However I've tended to avoid that approach,
since when the number of samples desired approaches the
upper_bound, e.g. 5_000_000 5_000_000 worst case, I've been
burned in the past by such algorithms working very hard to find
the last few empty slots.

Good thinking.

James Edward Gray II

James Edward Gray II · Jul 18, 2005

I wrote a simple class to hold the different methods I used, where
@count is the number of members to be reported, @limit is the upper
bound of the range. @numbers is an Array and @numberHash a - guess
- Hash.

If you post the complete code, including the mentioned class, it
would make it easier for us to play with your solutions.

James Edward Gray II

James Edward Gray II · Jul 18, 2005

b) As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

That's a massive slowdown though, right? Because you now need to
count to 1_000_000_000 (for the quiz example.)

I agree that switching strategies is a good idea, but there are
better strategies for the high members counts and/or memory
problems. Take a look through the submissions...

James Edward Gray II

James Edward Gray II · Jul 18, 2005

I have wondered about this, since I first read this week's quiz:
Did your solution really need half an hour? What is it doing that
long?

Yes, mine really took half an hour. I assume it spent the time
counting to 1_000_000_000. Here's the code:

#!/usr/local/bin/ruby

usage = <<END_USAGE
Usage: #{File.basename(__FILE__)} MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
END_USAGE

unless ARGV.size == 2 and ARGV.all? { |num| num =~ /^[\d_]+$/ }
puts usage
exit
end

members, limit = ARGV.map { |num| Integer(num) }

unless members < limit
puts usage
exit
end

0.upto(limit - 1) do |num|
if rand(limit) % (limit - num) < members
puts num
members -= 1
end
end

__END__

Yes, I (and most of the others) didn't handle that special case. As
Joost Diepenmaat said in his reply to Bill Kelly's solution:

As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

See my reply to that message.

James Edward Gray II

James Edward Gray II · Jul 18, 2005

I have wondered about this, since I first read this week's quiz:
Did your solution really need half an hour? What is it doing that
long?

Yes, mine really took half an hour. I assume it spent the time
counting to 1_000_000_000. Here's the code:

#!/usr/local/bin/ruby

usage = <<END_USAGE
Usage: #{File.basename(__FILE__)} MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
END_USAGE

unless ARGV.size == 2 and ARGV.all? { |num| num =~ /^[\d_]+$/ }
puts usage
exit
end

members, limit = ARGV.map { |num| Integer(num) }

unless members < limit
puts usage
exit
end

0.upto(limit - 1) do |num|
if rand(limit) % (limit - num) < members
puts num
members -= 1
end
end

__END__

Yes, I (and most of the others) didn't handle that special case. As
Joost Diepenmaat said in his reply to Bill Kelly's solution:

As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

See my reply to that message.

James Edward Gray II

Daniel Brockman · Jul 18, 2005

Jim Menard said:
def sample(members, range)
selected = Set.new
members.times {
val = rand(range)
val = rand(range) until selected.add?(val)
}
selected.to_a.sort
end

I just wanted to point out that this,

val = rand(range)
val = rand(range) until selected.add?(val)

can also be written like this:

begin
val = rand(range)
end until selected.add?(val)

Jim Menard · Jul 18, 2005

Daniel said:
I just wanted to point out that this,

can also be written like this:

begin
val = rand(range)
end until selected.add?(val)

Much better. More readable. Thanks, Daniel.

Jim "Sentence Fragments" Menard

Paolo Capriotti · Jul 18, 2005

This is my solution. It's slow, but uses O(1) memory.

Paolo

---

# Idea of the algorithm:
# Let X_1, ..., X_m be random variables representing a
# uniform sampling of the interval [0, n[, with
# X_1 <=3D X_2 <=3D ... <=3D X_m.
# One can show that the discrete density of X_1 is the function
# p(k) =3D m/n * \prod_{i=3D0}^{k-1} (n-m-i)/(n-1-i).
# A random integer with this law is calculated, giving a value
# of, say, k_1.
# With respect to the conditional probability of {X_1 =3D k_1},
# (X_2, ..., X_n) is a uniform sampling of the interval [k_1 + 1, n[,
# hence the same argument can be applied, continuing in this fashion=20
# until all m members have been produced.

def sample(m, n)
c =3D 0
i =3D rand
p =3D ps =3D m / n.to_f
delta =3D n - m
m =3D m.to_f
=20
while n>0
n-=3D1
if ps > i
i =3D rand
m -=3D 1
p =3D ps =3D m / n
puts c
else
p =3D p * delta / n
ps +=3D p
delta -=3D 1
end
=20
c +=3D 1
end
end

if ARGV.size =3D=3D 2
m, n =3D ARGV.map(&method

Integer))
sample(m, n)
end

Stefan Mahlitz · Jul 19, 2005

Olaf said:
Stefan Mahlitz said:

Why use a Hash?

Click to expand...

Efficiency. Array#include? runs in linear time (i.e. the running time
is proportional to the size of the array because all elements have to
be tested sequentially), while Hash#[] runs in constant time (i.e. the
running time is independent of the size of the hash). So the
array-based solution will scale very badly for large numbers.

I knew that after the benchmarks.

To ask the same question in another way: How would someone not being
aware of the advantages a Hash-lookup gives (compared to Array-lookup)
choose to implement the problem with a Hash? It is not obvious for
inexperienced programmers.

Well, I remembered after seeing the first solution with Hash (it has
been a long time since I wrote the last program).

Stefan Mahlitz

Stefan Mahlitz · Jul 19, 2005

James said:
On Jul 17, 2005, at 1:05 PM, Stefan Mahlitz wrote:

If you post the complete code, including the mentioned class, it would
make it easier for us to play with your solutions.

Sorry, forgot that.

I attached the file, hope it comes through.

# ruby quiz #39
#
# The challenge is to implement a program called "sample" that takes exactly two
# integer inputs. The first of those should be the number of members you would
# like included in the sample. The second is the upper boundary (exclusive) of
# the range of integers members can be selected from. The lower boundary is zero
# (inclusive).
#
# Your program should output exactly the requested number of members from the
# defined range to STDOUT, one number per line. Each member must be unique and
# they should appear in sorted order.

require 'benchmark'
include Benchmark

class Sampler
def initialize(count, limit)
@count, @limit = count, limit
if @count > @limit
raise(ArgumentError, "Can not generate unique numbers if count is bigger than limit.")
end
@numbers = []
@numberHash = {}
end

# generates @count random numbers
#
# after first generation looks whether uniq! deleted some numbers
#
# enters loop where each new iteration calls uniq!
def generate_with_uniq_first
@numbers.clear
@count.times do
@numbers << rand(@limit)
end

@numbers.uniq!
# there is a chance, that there are no unique number in
# @numbers, let's see whether our array contains @count elements

while @numbers.length < @count
# bad luck, there have been duplicate numbers
# try adding a new number and look again
@numbers << rand(@limit)
@numbers.uniq!
end
end

# generates @count random numbers
#
# generates missing count of numbers at once and calls
# uniq! to eliminate duplicates
def generate_with_uniq_second
@numbers.clear
while @numbers.length < @count
# generate missing numbers
(@count - @numbers.length).times do
@numbers << rand(@limit)
end
@numbers.uniq!
end
end

# generates @count random numbers
# checks for each whether it is already included in array
def generate_with_array_include
@numbers.clear
@count.times do
newNumber = rand(@limit)
if @numbers.include?(newNumber)
redo
else
@numbers << newNumber
end
end
end

# generates @count random numbers
# checks for each whether it is already included in hash (thanks Dominik Bathon)
def generate_with_hash_has_key
@numberHash.clear
@count.times do
newNumber = rand(@limit)
if @numberHash.has_key?(newNumber)
redo
else
@numberHash[newNumber] = nil
end
end
@numbers = @numberHash.keys
end

# Hash gives you uniq! for free, just fill the Hash until its length
# reaches the @count (thanks Joost Diepenmaat)
def generate_with_hash_length
@numberHash.clear
while @numberHash.length < @count
@numberHash[rand(@limit)] = nil
end
@numbers = @numberHash.keys
end

# benchmarks every method that contains "generate_"
def benchmark_generate
generateMethods = self.methods.grep(/generate_/).sort
bmbm(0) do |bench|
generateMethods.each {|aMethodName|
bench.report(aMethodName) {instance_eval(aMethodName)}
}
end
end

# output sorted
def to_s
@numbers.sort.join("\n")
end
end

sampler = Sampler.new(ARGV[0].to_i, ARGV[1].to_i)
sampler.benchmark_generate

Stefan Mahlitz · Jul 19, 2005

Stefan said:
Sorry, forgot that.

I attached the file, hope it comes through.

Comments - even on style - are very welcome!

[SOLUTION] [QUIZ] Sampling (#39)	1	Jul 18, 2005
[SUMMARY] Sampling (#39)	0	Jul 21, 2005
Minimum Total Difficulty	0	Nov 15, 2023
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
[QUIZ] Befunge (#184)	24	Nov 22, 2008
[QUIZ] Counting Toothpicks (#111)	47	Jan 26, 2007
[QUIZ] Symbolify (#169)	74	Jul 11, 2008
[QUIZ] Bowling Scores (#181)	8	Oct 24, 2008

[QUIZ] Sampling (#39)

Jim Menard

James Edward Gray II

James Edward Gray II

James Edward Gray II

James Edward Gray II

James Edward Gray II

Daniel Brockman

Jim Menard

Paolo Capriotti

Stefan Mahlitz

Stefan Mahlitz

Stefan Mahlitz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads