[QUIZ] Sampling (#39)

J

Jim Menard

A fun quiz. Thanks, James.

I've posted my solution at http://www.io.com/~jimm/rubyquiz/quiz39/, but
here's the whole page.

Here is my solution to Ruby Quiz 39: Sampling. This solution, which takes
roughly 1.75 minutes to run realtime (0.010 sec user time) is so short, I
present it here in its entirety:

#! /usr/bin/env ruby

require 'set'

def sample(members, range)
selected = Set.new
members.times {
val = rand(range)
val = rand(range) until selected.add?(val)
}
selected.to_a.sort
end

puts sample(ARGV[0].to_i, ARGV[1].to_i).join("\n")

This solution is completely out-of-the-box. I played with a few optimizations,
without looking for this algorithm anywhere online. Instead of using a Set
(which is backed by a Hash), I wanted to create a range-length array and mark
each selected value using that array, but ran out of memory with a range of 10e9.

Disk I/O took about 45 seconds of the total time. There's probably a way to
speed that up.

Jim
 
J

James Edward Gray II

However I've tended to avoid that approach,
since when the number of samples desired approaches the
upper_bound, e.g. 5_000_000 5_000_000 worst case, I've been
burned in the past by such algorithms working very hard to find
the last few empty slots.

Good thinking. ;)

James Edward Gray II
 
J

James Edward Gray II

I wrote a simple class to hold the different methods I used, where
@count is the number of members to be reported, @limit is the upper
bound of the range. @numbers is an Array and @numberHash a - guess
- Hash.

If you post the complete code, including the mentioned class, it
would make it easier for us to play with your solutions.

James Edward Gray II
 
J

James Edward Gray II

b) As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

That's a massive slowdown though, right? Because you now need to
count to 1_000_000_000 (for the quiz example.)

I agree that switching strategies is a good idea, but there are
better strategies for the high members counts and/or memory
problems. Take a look through the submissions...

James Edward Gray II
 
J

James Edward Gray II

I have wondered about this, since I first read this week's quiz:
Did your solution really need half an hour? What is it doing that
long?

Yes, mine really took half an hour. I assume it spent the time
counting to 1_000_000_000. Here's the code:

#!/usr/local/bin/ruby

usage = <<END_USAGE
Usage: #{File.basename(__FILE__)} MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
END_USAGE

unless ARGV.size == 2 and ARGV.all? { |num| num =~ /^[\d_]+$/ }
puts usage
exit
end

members, limit = ARGV.map { |num| Integer(num) }

unless members < limit
puts usage
exit
end

0.upto(limit - 1) do |num|
if rand(limit) % (limit - num) < members
puts num
members -= 1
end
end

__END__
Yes, I (and most of the others) didn't handle that special case. As
Joost Diepenmaat said in his reply to Bill Kelly's solution:

As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

See my reply to that message.

James Edward Gray II
 
J

James Edward Gray II

I have wondered about this, since I first read this week's quiz:
Did your solution really need half an hour? What is it doing that
long?

Yes, mine really took half an hour. I assume it spent the time
counting to 1_000_000_000. Here's the code:

#!/usr/local/bin/ruby

usage = <<END_USAGE
Usage: #{File.basename(__FILE__)} MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
END_USAGE

unless ARGV.size == 2 and ARGV.all? { |num| num =~ /^[\d_]+$/ }
puts usage
exit
end

members, limit = ARGV.map { |num| Integer(num) }

unless members < limit
puts usage
exit
end

0.upto(limit - 1) do |num|
if rand(limit) % (limit - num) < members
puts num
members -= 1
end
end

__END__


Yes, I (and most of the others) didn't handle that special case. As
Joost Diepenmaat said in his reply to Bill Kelly's solution:

As soon as samples.size > total.size / 2 it's trivial to turn the
algorithm on its head (i.e. randomly select items that you'll skip,
then loop through the set and print all items that aren't selected).

See my reply to that message.

James Edward Gray II
 
D

Daniel Brockman

Jim Menard said:
def sample(members, range)
selected = Set.new
members.times {
val = rand(range)
val = rand(range) until selected.add?(val)
}
selected.to_a.sort
end

I just wanted to point out that this,
val = rand(range)
val = rand(range) until selected.add?(val)

can also be written like this:

begin
val = rand(range)
end until selected.add?(val)
 
J

Jim Menard

Daniel said:
I just wanted to point out that this,




can also be written like this:

begin
val = rand(range)
end until selected.add?(val)

Much better. More readable. Thanks, Daniel.

Jim "Sentence Fragments" Menard
 
P

Paolo Capriotti

This is my solution. It's slow, but uses O(1) memory.

Paolo

---

# Idea of the algorithm:
# Let X_1, ..., X_m be random variables representing a
# uniform sampling of the interval [0, n[, with
# X_1 <=3D X_2 <=3D ... <=3D X_m.
# One can show that the discrete density of X_1 is the function
# p(k) =3D m/n * \prod_{i=3D0}^{k-1} (n-m-i)/(n-1-i).
# A random integer with this law is calculated, giving a value
# of, say, k_1.
# With respect to the conditional probability of {X_1 =3D k_1},
# (X_2, ..., X_n) is a uniform sampling of the interval [k_1 + 1, n[,
# hence the same argument can be applied, continuing in this fashion=20
# until all m members have been produced.

def sample(m, n)
c =3D 0
i =3D rand
p =3D ps =3D m / n.to_f
delta =3D n - m
m =3D m.to_f
=20
while n>0
n-=3D1
if ps > i
i =3D rand
m -=3D 1
p =3D ps =3D m / n
puts c
else
p =3D p * delta / n
ps +=3D p
delta -=3D 1
end
=20
c +=3D 1
end
end

if ARGV.size =3D=3D 2
m, n =3D ARGV.map(&method:)Integer))
sample(m, n)
end
 
S

Stefan Mahlitz

Olaf said:
Stefan Mahlitz said:
Why use a Hash?

Efficiency. Array#include? runs in linear time (i.e. the running time
is proportional to the size of the array because all elements have to
be tested sequentially), while Hash#[] runs in constant time (i.e. the
running time is independent of the size of the hash). So the
array-based solution will scale very badly for large numbers.

I knew that after the benchmarks. ;)

To ask the same question in another way: How would someone not being
aware of the advantages a Hash-lookup gives (compared to Array-lookup)
choose to implement the problem with a Hash? It is not obvious for
inexperienced programmers.

Well, I remembered after seeing the first solution with Hash (it has
been a long time since I wrote the last program).

Stefan Mahlitz
 
S

Stefan Mahlitz

James said:
On Jul 17, 2005, at 1:05 PM, Stefan Mahlitz wrote:

If you post the complete code, including the mentioned class, it would
make it easier for us to play with your solutions.

Sorry, forgot that.

I attached the file, hope it comes through.

# ruby quiz #39
#
# The challenge is to implement a program called "sample" that takes exactly two
# integer inputs. The first of those should be the number of members you would
# like included in the sample. The second is the upper boundary (exclusive) of
# the range of integers members can be selected from. The lower boundary is zero
# (inclusive).
#
# Your program should output exactly the requested number of members from the
# defined range to STDOUT, one number per line. Each member must be unique and
# they should appear in sorted order.

require 'benchmark'
include Benchmark

class Sampler
def initialize(count, limit)
@count, @limit = count, limit
if @count > @limit
raise(ArgumentError, "Can not generate unique numbers if count is bigger than limit.")
end
@numbers = []
@numberHash = {}
end

# generates @count random numbers
#
# after first generation looks whether uniq! deleted some numbers
#
# enters loop where each new iteration calls uniq!
def generate_with_uniq_first
@numbers.clear
@count.times do
@numbers << rand(@limit)
end

@numbers.uniq!
# there is a chance, that there are no unique number in
# @numbers, let's see whether our array contains @count elements

while @numbers.length < @count
# bad luck, there have been duplicate numbers
# try adding a new number and look again
@numbers << rand(@limit)
@numbers.uniq!
end
end

# generates @count random numbers
#
# generates missing count of numbers at once and calls
# uniq! to eliminate duplicates
def generate_with_uniq_second
@numbers.clear
while @numbers.length < @count
# generate missing numbers
(@count - @numbers.length).times do
@numbers << rand(@limit)
end
@numbers.uniq!
end
end

# generates @count random numbers
# checks for each whether it is already included in array
def generate_with_array_include
@numbers.clear
@count.times do
newNumber = rand(@limit)
if @numbers.include?(newNumber)
redo
else
@numbers << newNumber
end
end
end

# generates @count random numbers
# checks for each whether it is already included in hash (thanks Dominik Bathon)
def generate_with_hash_has_key
@numberHash.clear
@count.times do
newNumber = rand(@limit)
if @numberHash.has_key?(newNumber)
redo
else
@numberHash[newNumber] = nil
end
end
@numbers = @numberHash.keys
end

# Hash gives you uniq! for free, just fill the Hash until its length
# reaches the @count (thanks Joost Diepenmaat)
def generate_with_hash_length
@numberHash.clear
while @numberHash.length < @count
@numberHash[rand(@limit)] = nil
end
@numbers = @numberHash.keys
end

# benchmarks every method that contains "generate_"
def benchmark_generate
generateMethods = self.methods.grep(/generate_/).sort
bmbm(0) do |bench|
generateMethods.each {|aMethodName|
bench.report(aMethodName) {instance_eval(aMethodName)}
}
end
end

# output sorted
def to_s
@numbers.sort.join("\n")
end
end

sampler = Sampler.new(ARGV[0].to_i, ARGV[1].to_i)
sampler.benchmark_generate
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top