# [QUIZ] Sampling (#39)

Discussion in 'Ruby' started by Ruby Quiz, Jul 15, 2005.

1. ### Ruby QuizGuest

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby quiz is a classic sampling problem that I've seen many
interesting solutions for in the past.

The challenge is to implement a program called "sample" that takes exactly two
integer inputs. The first of those should be the number of members you would
like included in the sample. The second is the upper boundary (exclusive) of
the range of integers members can be selected from. The lower boundary is zero
(inclusive).

Your program should output exactly the requested number of members from the
defined range to STDOUT, one number per line. Each member must be unique and
they should appear in sorted order.

Here are some sample runs of my solution to illustrate the task:

\$ ./sample
Usage: sample MEMBERS LIMIT

MEMBERS: The number of members you want in the sample. (Integer)
LIMIT: The upper limit for the sample range. All (Integer)
members will between 0 and LIMIT - 1.

Note that MEMBERS must be less than LIMIT.
\$ ./sample 3 10
0
1
5
\$ ./sample 3 10
1
2
8
\$ ./sample 3 10
2
3
9
\$ ./sample 9 10
0
2
3
4
5
6
7
8
9
\$ ./sample 20 400
1
3
4
25
32
36
42
56
93
111
137
139
153
213
222
226
263
289
293
314

Now for all the algorithm junkies out there that enjoy a little friendly
competition, here's the time to beat:

\$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt

real 30m10.593s
user 29m54.544s
sys 0m4.736s
\$ ls -l big_sample.txt
-rw-r--r-- 1 james staff 49011436 Jul 11 15:26 big_sample.txt
112
221
450
655
900
1216
1399
1469
1494
1628
\$ tail big_sample.txt
999995325
999996330
999996552
999996682
999997426
999997676
999998253
999999153
999999658
999999693

Ruby Quiz, Jul 15, 2005

2. ### Jim MenardGuest

Ruby Quiz wrote:

> The challenge is to implement a program called "sample" that takes exactly two
> integer inputs. The first of those should be the number of members you would
> like included in the sample. The second is the upper boundary (exclusive) of
> the range of integers members can be selected from. The lower boundary is zero
> (inclusive).
>
> Your program should output exactly the requested number of members from the
> defined range to STDOUT, one number per line. Each member must be unique and
> they should appear in sorted order.

The description does not say that the output needs to be randomly selected
from the input range. Without that, printing the first <members> numbers would
be sufficient---and really fast.

(0...members).each { | i | puts i }

Should the output be randomly selected from the range of input values?

Jim
--
Jim Menard, , http://www.io.com/~jimm
"Ask not a question of USENET, for it will answer both 'Yea.' and 'Nay.'
and 'Ask in another group.'" -- Simon Slavin

Jim Menard, Jul 15, 2005

3. ### James Edward Gray IIGuest

On Jul 15, 2005, at 8:00 AM, Ruby Quiz wrote:

> Now for all the algorithm junkies out there that enjoy a little
> friendly
> competition, here's the time to beat:
>
> \$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
>
> real 30m10.593s
> user 29m54.544s
> sys 0m4.736s

P.S. Taunting each other with fast time readouts is not a spoiler.
Feel free!

James Edward Gray II

James Edward Gray II, Jul 15, 2005
4. ### James Edward Gray IIGuest

On Jul 15, 2005, at 8:10 AM, Jim Menard wrote:

> The description does not say that the output needs to be randomly
> selected from the input range. Without that, printing the first
> <members> numbers would be sufficient---and really fast.
>
> (0...members).each { | i | puts i }
>
> Should the output be randomly selected from the range of input values?

Egad, yes. Massive oversight on my part. Sorry!

James Edward Gray II

James Edward Gray II, Jul 15, 2005
5. ### Wybo DekkerGuest

On Fri, 15 Jul 2005, James Edward Gray II wrote:

> On Jul 15, 2005, at 8:00 AM, Ruby Quiz wrote:
>
> > Now for all the algorithm junkies out there that enjoy a little friendly
> > competition, here's the time to beat:
> >
> > \$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
> >
> > real 30m10.593s
> > user 29m54.544s
> > sys 0m4.736s

>

\$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end' >big_sample.txt
elap 16.612
user 16.421
syst 0.177
CPU 99.91%

\$ wc big_sample.txt
5000000 5000000 49446636 big_sample.txt

or do I miss something??
--
Wybo

Wybo Dekker, Jul 15, 2005
6. ### BelorionGuest

On 7/15/05, Wybo Dekker <> wrote:
> \$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end' >big_sam=

ple.txt
> elap 16.612
> user 16.421
> syst 0.177
> CPU 99.91%
>=20
> \$ wc big_sample.txt
> 5000000 5000000 49446636 big_sample.txt
>=20
> or do I miss something??

The above example will result in duplicate members. Every number in
the list must be unique.

Matt

Belorion, Jul 15, 2005
7. ### jason r tibbettsGuest

Belorion wrote:
> On 7/15/05, Wybo Dekker <> wrote:
>
>>\$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end' >big_sample.txt
>> elap 16.612
>> user 16.421
>> syst 0.177
>> CPU 99.91%
>>
>>\$ wc big_sample.txt
>> 5000000 5000000 49446636 big_sample.txt
>>
>>or do I miss something??

>
>
> The above example will result in duplicate members. Every number in
> the list must be unique.

And sorted.

jason r tibbetts, Jul 15, 2005
8. ### jason r tibbettsGuest

James,

What are your machine's specs, just for comparison's sake? I don't want
to get too discouraged by my 5+ year-old Sun beater workstation.

I've also found that the printing of the results (in the form of an
array) is taking several minutes in and of itself. Since optimizing the
printout probably isn't the point of the quiz, can anyone recommend
something faster than arr.each{ | elem | p elem } ?

Ruby Quiz wrote:
> The three rules of Ruby Quiz:
>
> 1. Please do not post any solutions or spoiler discussion for this quiz until
> 48 hours have passed from the time on this message.
>
> 2. Support Ruby Quiz by submitting ideas as often as you can:
>
> http://www.rubyquiz.com/
>
> 3. Enjoy!
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> This week's Ruby quiz is a classic sampling problem that I've seen many
> interesting solutions for in the past.
>
> The challenge is to implement a program called "sample" that takes exactly two
> integer inputs. The first of those should be the number of members you would
> like included in the sample. The second is the upper boundary (exclusive) of
> the range of integers members can be selected from. The lower boundary is zero
> (inclusive).
>
> Your program should output exactly the requested number of members from the
> defined range to STDOUT, one number per line. Each member must be unique and
> they should appear in sorted order.
>
> Here are some sample runs of my solution to illustrate the task:
>
> \$ ./sample
> Usage: sample MEMBERS LIMIT
>
> MEMBERS: The number of members you want in the sample. (Integer)
> LIMIT: The upper limit for the sample range. All (Integer)
> members will between 0 and LIMIT - 1.
>
> Note that MEMBERS must be less than LIMIT.
> \$ ./sample 3 10
> 0
> 1
> 5
> \$ ./sample 3 10
> 1
> 2
> 8
> \$ ./sample 3 10
> 2
> 3
> 9
> \$ ./sample 9 10
> 0
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> \$ ./sample 20 400
> 1
> 3
> 4
> 25
> 32
> 36
> 42
> 56
> 93
> 111
> 137
> 139
> 153
> 213
> 222
> 226
> 263
> 289
> 293
> 314
>
> Now for all the algorithm junkies out there that enjoy a little friendly
> competition, here's the time to beat:
>
> \$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
>
> real 30m10.593s
> user 29m54.544s
> sys 0m4.736s
> \$ ls -l big_sample.txt
> -rw-r--r-- 1 james staff 49011436 Jul 11 15:26 big_sample.txt
> 112
> 221
> 450
> 655
> 900
> 1216
> 1399
> 1469
> 1494
> 1628
> \$ tail big_sample.txt
> 999995325
> 999996330
> 999996552
> 999996682
> 999997426
> 999997676
> 999998253
> 999999153
> 999999658
> 999999693
>
>
>

jason r tibbetts, Jul 15, 2005
9. ### Stefan LangGuest

On Friday 15 July 2005 18:38, jason r tibbetts wrote:
> James,
>
> What are your machine's specs, just for comparison's sake? I don't want
> to get too discouraged by my 5+ year-old Sun beater workstation.
>
> I've also found that the printing of the results (in the form of an
> array) is taking several minutes in and of itself. Since optimizing the
> printout probably isn't the point of the quiz, can anyone recommend
> something faster than arr.each{ | elem | p elem } ?

I guess "puts arr" is faster.

--
Stefan

Stefan Lang, Jul 15, 2005
10. ### jason r tibbettsGuest

jason r tibbetts wrote:
> James,
>
> What are your machine's specs, just for comparison's sake? I don't want
> to get too discouraged by my 5+ year-old Sun beater workstation.
>
> I've also found that the printing of the results (in the form of an
> array) is taking several minutes in and of itself. Since optimizing the
> printout probably isn't the point of the quiz, can anyone recommend
> something faster than arr.each{ | elem | p elem } ?

With the aforementioned old Sun box (SunOS dakota 5.8 Generic_108528-27
sun4u sparc SUNW,Ultra-5_10, 440MHz):

Attempt #1:
time ./sample.rb 5_000_000 1_000_000_000 > big_sample.txt
955.00u 46.42s 18:41.51 89.2%

Attempt #2:
time ./sample-logn2.rb 5_000_000 1_000_000_000 > big_sample.txt
388.37u 50.92s 9:40.11 75.7%

tail big_sample.txt
999998043
999998053
999998441
999998442
999998587
999999146
999999158
999999418
999999439
999999517

> Ruby Quiz wrote:
>
>> The three rules of Ruby Quiz:
>>
>> 1. Please do not post any solutions or spoiler discussion for this
>> quiz until
>> 48 hours have passed from the time on this message.
>>
>> 2. Support Ruby Quiz by submitting ideas as often as you can:
>>
>> http://www.rubyquiz.com/
>>
>> 3. Enjoy!
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>
>>
>> This week's Ruby quiz is a classic sampling problem that I've seen many
>> interesting solutions for in the past.
>>
>> The challenge is to implement a program called "sample" that takes
>> exactly two
>> integer inputs. The first of those should be the number of members
>> you would
>> like included in the sample. The second is the upper boundary
>> (exclusive) of
>> the range of integers members can be selected from. The lower
>> boundary is zero
>> (inclusive).
>>
>> Your program should output exactly the requested number of members
>> from the
>> defined range to STDOUT, one number per line. Each member must be
>> unique and
>> they should appear in sorted order.
>>
>> Here are some sample runs of my solution to illustrate the task:
>>
>> \$ ./sample Usage: sample MEMBERS LIMIT
>>
>> MEMBERS: The number of members you want in the sample. (Integer)
>> LIMIT: The upper limit for the sample range. All (Integer)
>> members will between 0 and LIMIT - 1.
>>
>> Note that MEMBERS must be less than LIMIT.
>> \$ ./sample 3 10
>> 0
>> 1
>> 5
>> \$ ./sample 3 10
>> 1
>> 2
>> 8
>> \$ ./sample 3 10
>> 2
>> 3
>> 9
>> \$ ./sample 9 10
>> 0
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> \$ ./sample 20 400
>> 1
>> 3
>> 4
>> 25
>> 32
>> 36
>> 42
>> 56
>> 93
>> 111
>> 137
>> 139
>> 153
>> 213
>> 222
>> 226
>> 263
>> 289
>> 293
>> 314
>>
>> Now for all the algorithm junkies out there that enjoy a little friendly
>> competition, here's the time to beat:
>>
>> \$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
>>
>> real 30m10.593s
>> user 29m54.544s
>> sys 0m4.736s
>> \$ ls -l big_sample.txt -rw-r--r-- 1 james staff 49011436
>> Jul 11 15:26 big_sample.txt
>> 221
>> 450
>> 655
>> 900
>> 1216
>> 1399
>> 1469
>> 1494
>> 1628
>> \$ tail big_sample.txt 999995325
>> 999996330
>> 999996552
>> 999996682
>> 999997426
>> 999997676
>> 999998253
>> 999999153
>> 999999658
>> 999999693
>>
>>
>>

>
>
>
>

jason r tibbetts, Jul 15, 2005
11. ### Jim MenardGuest

tmp> time ruby sampling.rb 5_000_000 1_000_000_000 >big_sample.txt

real 2m36.816s
user 0m0.010s
sys 0m0.000s
221
248
508
769
776
857
1186
1395
1434
1868
tmp> tail big_sample.txt
999997722
999998019
999998183
999998422
999998885
999998919
999999068
999999338
999999660
999999806

Jim
--
Jim Menard, , http://www.io.com/~jimm
Dash dash space newline
Four-line witty quotation
Perfect message end
-- Donald Welsh in rec.humor.oracle.d

Jim Menard, Jul 15, 2005
12. ### Ezra ZygmuntowiczGuest

On Jul 15, 2005, at 6:10 AM, James Edward Gray II wrote:
>>

>
> P.S. Taunting each other with fast time readouts is not a
> spoiler. Feel free!
>
> James Edward Gray II
>
>

ezra:~/Sites ez\$ time ./sample.rb 5_000_000 1_000_000_000 >
big_sample.txt

real 0m43.838s
user 0m35.820s
sys 0m1.360s
ezra:~/Sites ez\$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444445 Jul 15 10:46 big_sample.txt
168
285
566
604
912
1183
1335
1473
1728
1919
ezra:~/Sites ez\$ tail big_sample.txt
999998155
999998313
999998484
999998680
999998825
999999151
999999330
999999465
999999621
999999877
ezra:~/Sites ez\$

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732

Ezra Zygmuntowicz, Jul 15, 2005
13. ### Cassio PennachinGuest

On 7/15/05, James Edward Gray II <> wrote:
> P.S. Taunting each other with fast time readouts is not a spoiler.
> Feel free!

galadriel:~->time ./sampler.rb 5000000 1000000000 >! tmp3.txt
38.450u 1.005s 0:48.85 80.7% 0+0k 0+0io 0pf+0w
11
160
206
307
467
504
647
743
890
936
462469526
462469642
462469741
462469894
462470051
462470190
462470246
462470344
462470366
462470382

--=20
Cassio Pennachin

Cassio Pennachin, Jul 15, 2005
14. ### Jim FreezeGuest

> > \$ time ./sample 5_000_000 1_000_000_000 > big_sample.txt
> >
> > real 30m10.593s
> > user 29m54.544s
> > sys 0m4.736s

Here are my results on a lowly powerbook:

time ./sample 5_000_000 1_000_000_000 > /dev/null
136.990u 1.120s 2:50.91 80.8% 0+0k 0+0io 0pf+0w

time ./sample 5_000_000 1_000_000_000 > big_sample.txt
222.100u 2.500s 3:53.70 96.1% 0+0k 1+28io 0pf+0w

402
421
549
571
757
1025
1224
1604
2064
2602
% tail big_sample.txt
999998571
999998613
999998644
999998648
999998841
999999116
999999414
999999609
999999609
999999915

Here's my typical interaction with a program.
(I usually seem to be able to find all the ways
that don't work first.)
I've included a source listing of the application
part of the code to demonstrate commandline.

% ./sample
Usage: sample [-h] number limit

% ./sample -h
NAME

sample - Generate a list of samples

SYNOPSIS

sample [-h] number limit

OPTIONS

--help,-h
Displays help page.

% ./sample 10 9
ERROR: Number must be less than or equal to limit.

% ./sample 10 11.1
ERROR: Argument 'limit' with value '11.1' must be an integer.
Usage: sample [-h] number limit

% ./sample 10.1 11
ERROR: Argument 'number' with value '10.1' must be an integer.
Usage: sample [-h] number limit

% ./sample 3 10
3
4

% ./sample 5 20
6
7
9
14
18

% cat sample
#!/usr/bin/env ruby

begin
require 'commandline'
require 'rubygems'
retry
end

class App < CommandLine::Application

def initialize
synopsis "[-h] number limit"
short_description "Generate a list of samples"
option :help
expected_args :number, :limit
end

def convert_to_integer(*args)
arg, val = nil, nil
args.each { |arg|
val = instance_variable_get("@#{arg}")
instance_variable_set("@#{arg}", Integer(val))
}
rescue
raise "Argument '#{arg}' with value '#{val}' must be an integer.\n"+
usage
end

def main
convert_to_integer :number, :limit

puts Sample.create_sample_list(@number, @limit)
end

end#class App

class Sample
def self.create_sample_list(number, limit)
Sample.new(number, limit).samples
end
#.... no peeking...
#.... sorry, you won't find a spoiler here...
--
Jim Freeze

Jim Freeze, Jul 15, 2005
15. ### Jim FreezeGuest

* Cassio Pennachin <> [2005-07-16 03:04:12 +0900]:

> galadriel:~->time ./sampler.rb 5000000 1000000000 >! tmp3.txt
> 38.450u 1.005s 0:48.85 80.7% 0+0k 0+0io 0pf+0w
> 462469526
> 462469642
> 462469741
> 462469894
> 462470051
> 462470190
> 462470246
> 462470344
> 462470366
> 462470382

Shouldn't those number be more like

999999xxx

--
Jim Freeze

Jim Freeze, Jul 15, 2005
16. ### Cassio PennachinGuest

> Shouldn't those number be more like
>=20
> 999999xxx

As far as I know, they just have to be between 0 and LIMIT-1. In
fact, if the tail end of the sample is heavily concentrated as you
suggest, it's probably a badly biased sample. Unfortunately, my
sampler is biased as well, just not in this particular way.

It's not hard to remove the bias, but then the script would be 50-60%
slower, which makes it significantly less useful for taunting others
;-).

Cassio
--=20
Cassio Pennachin

Cassio Pennachin, Jul 15, 2005
17. ### BelorionGuest

On 7/15/05, Cassio Pennachin <> wrote:
> > Shouldn't those number be more like
> >
> > 999999xxx

>=20
> As far as I know, they just have to be between 0 and LIMIT-1. In
> fact, if the tail end of the sample is heavily concentrated as you
> suggest, it's probably a badly biased sample. Unfortunately, my
> sampler is biased as well, just not in this particular way.
>=20
> It's not hard to remove the bias, but then the script would be 50-60%
> slower, which makes it significantly less useful for taunting others
> ;-).
>=20
> Cassio
> --
> Cassio Pennachin

I think Jim's point was that if your last digit is 462470382 then you
don't have a random sample... which is specified as a criteria for the
quiz. Even if you don't have a lot of Otherwise one could just do

SAMPLE_SIZE.times{ |ii| print ii }

And be done with it. Clearly if all of your members are under
462470382 then you haven't truly "sampled" the population of
1_000_000_000.

Belorion, Jul 15, 2005
18. ### Cassio PennachinGuest

> I think Jim's point was that if your last digit is 462470382 then you
> don't have a random sample...=20

If by definition a random sample has to be uniform over the
range/population it samples, you are correct. Although even by such
definition it's possible that the last number would be this low in a
uniform sampling, the odds are very low. And I know that my script
doesn't uniformly sample the range, as I said before. I coded it to
be fast yet still meet the letter of the requirements.

The quiz seems to require a script that outputs a sorted series of
random numbers, with a given size and all between 0 and a given limit.
My program meets all these criteria. Clearly, it is not a uniform
sample of the population -- and neither is any algorithm that
frequently generates 999999xxx for the last several entries. There is
no reason to prefer one or the other, is there?

So feel free call my solution a mild abuse of the *spirit* of the
requirements in the name of entertainment ;-).

Cassio

Cassio Pennachin, Jul 15, 2005
19. ### Jacob FugalGuest

On 7/15/05, Cassio Pennachin <> wrote:
> > Shouldn't those number be more like
> >
> > 999999xxx

>=20
> As far as I know, they just have to be between 0 and LIMIT-1. In
> fact, if the tail end of the sample is heavily concentrated as you
> suggest, it's probably a badly biased sample. Unfortunately, my
> sampler is biased as well, just not in this particular way.
>=20
> It's not hard to remove the bias, but then the script would be 50-60%
> slower, which makes it significantly less useful for taunting others
> ;-).

Probability of a random sample X from population (0...1_000_000_000)
being less than or equal to 462_470_382 is 0.462470382. Probability of
ALL 5_000_000 random samples from same population being less than or
equal to 462_470_382 is (0.462470382 ^ 5_000_000) < 1.0e-1_674_580.
Those are VERY slim chances. Additionally, that figure assumes repeats
are allowed. Since they're not, after each sample below 462_470_382,
it will be even LESS likely that the next sample will also be below
462_470_382 since there are fewer low numbers to choose from. I
wouldn't be surprised if the actual chance were less than
1.0e-2_000_000. It IS possible, but not likely.

Now on the other hand, the probability of the last number being >=3D
999_999_000 is the probability of ANY being >=3D 999_999_000 which is
exactly the same as the negation of the probability that ALL are <
999_999_000. Using the same procedure as above, that probability is
(0.999999 ^ 5_000_000) =3D=3D 0.006738. So the probability of ANY of the
samples being at least 999_999_000 is (1.0 - 0.006738) =3D 0.993262, or
99.3262%. Pretty good chances.

Jacob Fugal

Jacob Fugal, Jul 15, 2005
20. ### Wybo DekkerGuest

On Sat, 16 Jul 2005, jason r tibbetts wrote:

> Belorion wrote:
> > On 7/15/05, Wybo Dekker <> wrote:
> >
> > > \$ time ruby -e '5_000_000.times do puts rand(1_000_000_000) end'
> > > >big_sample.txt
> > > elap 16.612
> > > user 16.421
> > > syst 0.177
> > > CPU 99.91%
> > >
> > > \$ wc big_sample.txt
> > > 5000000 5000000 49446636 big_sample.txt
> > >
> > > or do I miss something??

> >
> >
> > The above example will result in duplicate members. Every number in
> > the list must be unique.

>
> And sorted.

I'll just wait until the last and final specification comes out...

--
Wybo

Wybo Dekker, Jul 15, 2005