[QUIZ] Sampling (#39)

R

Ruby Quiz

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby quiz is a classic sampling problem that I've seen many
interesting solutions for in the past.

The challenge is to implement a program called "sample" that takes exactly two
integer inputs. The first of those should be the number of members you would
like included in the sample. The second is the upper boundary (exclusive) of
the range of integers members can be selected from. The lower boundary is zero
(inclusive).

Your program should output exactly the requested number of members from the
defined range to STDOUT, one number per line. Each member must be unique and
they should appear in sorted order.

Here are some sample runs of my solution to illustrate the task:

$ ./sample
Usage: sample MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
$ ./sample 3 10
0
1
5
$ ./sample 3 10
1
2
8
$ ./sample 3 10
2
3
9
$ ./sample 9 10
0
2
3
4
5
6
7
8
9
$ ./sample 20 400
1
3
4
25
32
36
42
56
93
111
137
139
153
213
222
226
263
289
293
314

Now for all the algorithm junkies out there that enjoy a little friendly
competition, here's the time to beat:

$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt

real 30m10.593s
user 29m54.544s
sys 0m4.736s
$ ls -l big_sample.txt
-rw-r--r-- 1 james staff 49011436 Jul 11 15:26 big_sample.txt
$ head big_sample.txt
112
221
450
655
900
1216
1399
1469
1494
1628
$ tail big_sample.txt
999995325
999996330
999996552
999996682
999997426
999997676
999998253
999999153
999999658
999999693
 
J

Jim Menard

Ruby said:
The challenge is to implement a program called "sample" that takes exactly two
integer inputs. The first of those should be the number of members you would
like included in the sample. The second is the upper boundary (exclusive) of
the range of integers members can be selected from. The lower boundary is zero
(inclusive).

Your program should output exactly the requested number of members from the
defined range to STDOUT, one number per line. Each member must be unique and
they should appear in sorted order.

The description does not say that the output needs to be randomly selected
from the input range. Without that, printing the first <members> numbers would
be sufficient---and really fast.

(0...members).each { | i | puts i }

Should the output be randomly selected from the range of input values?

Jim
 
J

James Edward Gray II

Now for all the algorithm junkies out there that enjoy a little
friendly
competition, here's the time to beat:

$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt

real 30m10.593s
user 29m54.544s
sys 0m4.736s

P.S. Taunting each other with fast time readouts is not a spoiler.
Feel free! ;)

James Edward Gray II
 
J

James Edward Gray II

The description does not say that the output needs to be randomly
selected from the input range. Without that, printing the first
<members> numbers would be sufficient---and really fast.

(0...members).each { | i | puts i }

Should the output be randomly selected from the range of input values?

Egad, yes. Massive oversight on my part. Sorry!

James Edward Gray II
 
W

Wybo Dekker

$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end' >big_sample.txt
elap 16.612
user 16.421
syst 0.177
CPU 99.91%

$ wc big_sample.txt
5000000 5000000 49446636 big_sample.txt

or do I miss something??
 
B

Belorion

$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end' >big_sam= ple.txt
elap 16.612
user 16.421
syst 0.177
CPU 99.91%
=20
$ wc big_sample.txt
5000000 5000000 49446636 big_sample.txt
=20
or do I miss something??

The above example will result in duplicate members. Every number in
the list must be unique.

Matt
 
J

jason r tibbetts

James,

What are your machine's specs, just for comparison's sake? I don't want
to get too discouraged by my 5+ year-old Sun beater workstation. :)

I've also found that the printing of the results (in the form of an
array) is taking several minutes in and of itself. Since optimizing the
printout probably isn't the point of the quiz, can anyone recommend
something faster than arr.each{ | elem | p elem } ?
 
S

Stefan Lang

James,

What are your machine's specs, just for comparison's sake? I don't want
to get too discouraged by my 5+ year-old Sun beater workstation. :)

I've also found that the printing of the results (in the form of an
array) is taking several minutes in and of itself. Since optimizing the
printout probably isn't the point of the quiz, can anyone recommend
something faster than arr.each{ | elem | p elem } ?

I guess "puts arr" is faster.
 
J

jason r tibbetts

jason said:
James,

What are your machine's specs, just for comparison's sake? I don't want
to get too discouraged by my 5+ year-old Sun beater workstation. :)

I've also found that the printing of the results (in the form of an
array) is taking several minutes in and of itself. Since optimizing the
printout probably isn't the point of the quiz, can anyone recommend
something faster than arr.each{ | elem | p elem } ?

With the aforementioned old Sun box (SunOS dakota 5.8 Generic_108528-27
sun4u sparc SUNW,Ultra-5_10, 440MHz):

Attempt #1:
time ./sample.rb 5_000_000 1_000_000_000 > big_sample.txt
955.00u 46.42s 18:41.51 89.2%

Attempt #2:
time ./sample-logn2.rb 5_000_000 1_000_000_000 > big_sample.txt
388.37u 50.92s 9:40.11 75.7%

tail big_sample.txt
999998043
999998053
999998441
999998442
999998587
999999146
999999158
999999418
999999439
999999517
 
J

Jim Menard

tmp> time ruby sampling.rb 5_000_000 1_000_000_000 >big_sample.txt

real 2m36.816s
user 0m0.010s
sys 0m0.000s
tmp> head big_sample.txt
221
248
508
769
776
857
1186
1395
1434
1868
tmp> tail big_sample.txt
999997722
999998019
999998183
999998422
999998885
999998919
999999068
999999338
999999660
999999806

Jim
--
Jim Menard, (e-mail address removed), http://www.io.com/~jimm
Dash dash space newline
Four-line witty quotation
Perfect message end
-- Donald Welsh in rec.humor.oracle.d
 
E

Ezra Zygmuntowicz

P.S. Taunting each other with fast time readouts is not a
spoiler. Feel free! ;)

James Edward Gray II
ezra:~/Sites ez$ time ./sample.rb 5_000_000 1_000_000_000 >
big_sample.txt

real 0m43.838s
user 0m35.820s
sys 0m1.360s
ezra:~/Sites ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444445 Jul 15 10:46 big_sample.txt
ezra:~/Sites ez$ head big_sample.txt
168
285
566
604
912
1183
1335
1473
1728
1919
ezra:~/Sites ez$ tail big_sample.txt
999998155
999998313
999998484
999998680
999998825
999999151
999999330
999999465
999999621
999999877
ezra:~/Sites ez$

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
(e-mail address removed)
 
C

Cassio Pennachin

P.S. Taunting each other with fast time readouts is not a spoiler.
Feel free! ;)


galadriel:~->time ./sampler.rb 5000000 1000000000 >! tmp3.txt
38.450u 1.005s 0:48.85 80.7% 0+0k 0+0io 0pf+0w
galadriel:~->head tmp3.txt
11
160
206
307
467
504
647
743
890
936
galadriel:~->tail tmp3.txt
462469526
462469642
462469741
462469894
462470051
462470190
462470246
462470344
462470366
462470382

--=20
Cassio Pennachin
 
J

Jim Freeze

$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
Here are my results on a lowly powerbook:

time ./sample 5_000_000 1_000_000_000 > /dev/null
136.990u 1.120s 2:50.91 80.8% 0+0k 0+0io 0pf+0w

time ./sample 5_000_000 1_000_000_000 > big_sample.txt
222.100u 2.500s 3:53.70 96.1% 0+0k 1+28io 0pf+0w

% head big_sample.txt
402
421
549
571
757
1025
1224
1604
2064
2602
% tail big_sample.txt
999998571
999998613
999998644
999998648
999998841
999999116
999999414
999999609
999999609
999999915

Here's my typical interaction with a program.
(I usually seem to be able to find all the ways
that don't work first.)
I've included a source listing of the application
part of the code to demonstrate commandline.

% ./sample
Usage: sample [-h] number limit

% ./sample -h
NAME

sample - Generate a list of samples

SYNOPSIS

sample [-h] number limit

OPTIONS

--help,-h
Displays help page.

% ./sample 10 9
ERROR: Number must be less than or equal to limit.

% ./sample 10 11.1
ERROR: Argument 'limit' with value '11.1' must be an integer.
Usage: sample [-h] number limit

% ./sample 10.1 11
ERROR: Argument 'number' with value '10.1' must be an integer.
Usage: sample [-h] number limit

% ./sample 3 10
3
4

% ./sample 5 20
6
7
9
14
18


% cat sample
#!/usr/bin/env ruby

begin
require 'commandline'
rescue LoadError
require 'rubygems'
retry
end

class App < CommandLine::Application

def initialize
synopsis "[-h] number limit"
short_description "Generate a list of samples"
option :help
expected_args :number, :limit
end

def convert_to_integer(*args)
arg, val = nil, nil
args.each { |arg|
val = instance_variable_get("@#{arg}")
instance_variable_set("@#{arg}", Integer(val))
}
rescue
raise "Argument '#{arg}' with value '#{val}' must be an integer.\n"+
usage
end

def main
convert_to_integer :number, :limit

puts Sample.create_sample_list(@number, @limit)
end

end#class App


class Sample
def self.create_sample_list(number, limit)
Sample.new(number, limit).samples
end
#.... no peeking...
#.... sorry, you won't find a spoiler here...
 
J

Jim Freeze

* Cassio Pennachin said:
galadriel:~->time ./sampler.rb 5000000 1000000000 >! tmp3.txt
38.450u 1.005s 0:48.85 80.7% 0+0k 0+0io 0pf+0w
galadriel:~->tail tmp3.txt
462469526
462469642
462469741
462469894
462470051
462470190
462470246
462470344
462470366
462470382

Shouldn't those number be more like

999999xxx
 
C

Cassio Pennachin

Shouldn't those number be more like
=20
999999xxx

As far as I know, they just have to be between 0 and LIMIT-1. In
fact, if the tail end of the sample is heavily concentrated as you
suggest, it's probably a badly biased sample. Unfortunately, my
sampler is biased as well, just not in this particular way.

It's not hard to remove the bias, but then the script would be 50-60%
slower, which makes it significantly less useful for taunting others
;-).

Cassio
--=20
Cassio Pennachin
 
B

Belorion

=20
As far as I know, they just have to be between 0 and LIMIT-1. In
fact, if the tail end of the sample is heavily concentrated as you
suggest, it's probably a badly biased sample. Unfortunately, my
sampler is biased as well, just not in this particular way.
=20
It's not hard to remove the bias, but then the script would be 50-60%
slower, which makes it significantly less useful for taunting others
;-).
=20
Cassio

I think Jim's point was that if your last digit is 462470382 then you
don't have a random sample... which is specified as a criteria for the
quiz. Even if you don't have a lot of Otherwise one could just do

SAMPLE_SIZE.times{ |ii| print ii }

And be done with it. Clearly if all of your members are under
462470382 then you haven't truly "sampled" the population of
1_000_000_000.
 
C

Cassio Pennachin

I think Jim's point was that if your last digit is 462470382 then you
don't have a random sample...=20

If by definition a random sample has to be uniform over the
range/population it samples, you are correct. Although even by such
definition it's possible that the last number would be this low in a
uniform sampling, the odds are very low. And I know that my script
doesn't uniformly sample the range, as I said before. I coded it to
be fast yet still meet the letter of the requirements.

The quiz seems to require a script that outputs a sorted series of
random numbers, with a given size and all between 0 and a given limit.
My program meets all these criteria. Clearly, it is not a uniform
sample of the population -- and neither is any algorithm that
frequently generates 999999xxx for the last several entries. There is
no reason to prefer one or the other, is there?

So feel free call my solution a mild abuse of the *spirit* of the
requirements in the name of entertainment ;-).

Cassio
 
J

Jacob Fugal

=20
As far as I know, they just have to be between 0 and LIMIT-1. In
fact, if the tail end of the sample is heavily concentrated as you
suggest, it's probably a badly biased sample. Unfortunately, my
sampler is biased as well, just not in this particular way.
=20
It's not hard to remove the bias, but then the script would be 50-60%
slower, which makes it significantly less useful for taunting others
;-).

Probability of a random sample X from population (0...1_000_000_000)
being less than or equal to 462_470_382 is 0.462470382. Probability of
ALL 5_000_000 random samples from same population being less than or
equal to 462_470_382 is (0.462470382 ^ 5_000_000) < 1.0e-1_674_580.
Those are VERY slim chances. Additionally, that figure assumes repeats
are allowed. Since they're not, after each sample below 462_470_382,
it will be even LESS likely that the next sample will also be below
462_470_382 since there are fewer low numbers to choose from. I
wouldn't be surprised if the actual chance were less than
1.0e-2_000_000. It IS possible, but not likely.

Now on the other hand, the probability of the last number being >=3D
999_999_000 is the probability of ANY being >=3D 999_999_000 which is
exactly the same as the negation of the probability that ALL are <
999_999_000. Using the same procedure as above, that probability is
(0.999999 ^ 5_000_000) =3D=3D 0.006738. So the probability of ANY of the
samples being at least 999_999_000 is (1.0 - 0.006738) =3D 0.993262, or
99.3262%. Pretty good chances.

Jacob Fugal
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top