[QUIZ] Sampling (#39)

Jacob Fugal · Jul 15, 2005

=20
If by definition a random sample has to be uniform over the
range/population it samples, you are correct. Although even by such
definition it's possible that the last number would be this low in a
uniform sampling, the odds are very low. And I know that my script
doesn't uniformly sample the range, as I said before. I coded it to
be fast yet still meet the letter of the requirements.

We can concede this. I still believe that the *spirit* of the
requirements was a Uniform Random Distribution.

... and neither is any algorithm that
frequently generates 999999xxx for the last several entries.

This statement I contest. I showed in my other post that the
probability the last number is 999999xxx is greater than 99%.
Expanding that calculation (I can show the work upon request) the
probability all the last ten numbers being 999999xxx is 93.5%. Getting
small enough that you'll see some 999998xxx numbers in the last ten
every now and then, but still quite likely. Chances that you'll see a
999997xxx number in the last ten: 0.05% (about once out of every 2000
trials)

Jacob Fugal

jason r tibbetts · Jul 15, 2005

Wybo said:
I'll just wait until the last and final specification comes out...

From the original posting:

Your program should output exactly the requested number of members from the
defined range to STDOUT, one number per line. Each member must be unique and
they should appear in sorted order.

The only thing that was left out of the original was the requirement
that the values be random.

David Brady · Jul 15, 2005

Belorion said:
I think Jim's point was that if your last digit is 462470382 then you
don't have a random sample... which is specified as a criteria for the
quiz.

"People who think about this topic almost invariably get into
philosophical discussions about what the word "random" means. In a
sense, there is no such thing as a random number; for example, is 2 a
random number?"
-- Donald Knuth, The Art of Computer Programming, Vol II (2nd Edition),
page 2.

Knuth, in turn, quotes John Von Neumann, on the preceding page:
"Any one who considers arithmetical methods of producing random digits
is, of course, in a state of sin."

Knuth goes on to quantify various properties of randomness; Cassio's
method fails the very first one (uniform distribution). But meh: Cassio
has just steeped in the evil a little longer.

I would be interested to see a uniform solution approach Cassio's solution.

This is a GREAT Quiz, by the way. I've keyed in a standard, textbook
solution to this problem and my runtime is 9.5 minutes for ONE TENTH of
the sample (sampling 500K from 100M). Assuming purely O(n), that's 1.6
*hours* to run the full sample on my Athlon 2200. So, I have a
solution, and I could quit now. But now I want to try my hand playing
with the profiler.

-dB

jason r tibbetts · Jul 15, 2005

David said:
"People who think about this topic almost invariably get into
philosophical discussions about what the word "random" means. In a
sense, there is no such thing as a random number; for example, is 2 a
random number?"
-- Donald Knuth, The Art of Computer Programming, Vol II (2nd Edition),
page 2.

Knuth, in turn, quotes John Von Neumann, on the preceding page:
"Any one who considers arithmetical methods of producing random digits
is, of course, in a state of sin."

Knuth goes on to quantify various properties of randomness; Cassio's
method fails the very first one (uniform distribution). But meh: Cassio
has just steeped in the evil a little longer.

I would be interested to see a uniform solution approach Cassio's solution.

This is a GREAT Quiz, by the way. I've keyed in a standard, textbook
solution to this problem and my runtime is 9.5 minutes for ONE TENTH of
the sample (sampling 500K from 100M). Assuming purely O(n), that's 1.6
*hours* to run the full sample on my Athlon 2200. So, I have a
solution, and I could quit now. But now I want to try my hand playing
with the profiler.

Ezra's output looks like what you're looking for, but it seems to have
gone unnoticed in the wake of Cassio's post.

ezra:~/Sites ez$ time ./sample.rb 5_000_000 1_000_000_000 > big_sample.txt

real 0m43.838s
user 0m35.820s
sys 0m1.360s
ezra:~/Sites ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444445 Jul 15 10:46 big_sample.txt
ezra:~/Sites ez$ head big_sample.txt
168
285
566
604
912
1183
1335
1473
1728
1919
ezra:~/Sites ez$ tail big_sample.txt
999998155
999998313
999998484
999998680
999998825
999999151
999999330
999999465
999999621
999999877
ezra:~/Sites ez$

I /really/ want to know how this is possible. Multi-processor solution?

Jacob Fugal · Jul 15, 2005

=20
Ezra's output looks like what you're looking for, but it seems to have
gone unnoticed in the wake of Cassio's post.
=20
=20
I /really/ want to know how this is possible. Multi-processor solution?

The thing that really gets me is the disk latency. Here's my trial for
just writing 5_000_000 numbers to disk:

$ cat test.rb
5_000_000.times { |i| puts i }

$ time ruby test.rb > big_file.txt=20

real 0m41.781s
user 0m26.436s
sys 0m2.626s

That's 41 seconds just to write to disk. Ezra's disk may be
faster/better, but still I'd bet at least half the time reported is
spent in I/O. His algorithm must be phenomenal to get down to that
speed.

As a side note (not saying Ezra did this, but I am tempted to do it),
would a program utilizing RubyInline be permitted?

Jacob Fugal

Cassio Pennachin · Jul 15, 2005

Whoa! You're obviously right, my bias is immensely more severe. I
should have made
these simple checks before...

Cassio

=20
Probability of a random sample X from population (0...1_000_000_000)
being less than or equal to 462_470_382 is 0.462470382. Probability of
ALL 5_000_000 random samples from same population being less than or
equal to 462_470_382 is (0.462470382 ^ 5_000_000) < 1.0e-1_674_580.
Those are VERY slim chances. Additionally, that figure assumes repeats
are allowed. Since they're not, after each sample below 462_470_382,
it will be even LESS likely that the next sample will also be below
462_470_382 since there are fewer low numbers to choose from. I
wouldn't be surprised if the actual chance were less than
1.0e-2_000_000. It IS possible, but not likely.
=20
Now on the other hand, the probability of the last number being >=3D
999_999_000 is the probability of ANY being >=3D 999_999_000 which is
exactly the same as the negation of the probability that ALL are <
999_999_000. Using the same procedure as above, that probability is
(0.999999 ^ 5_000_000) =3D=3D 0.006738. So the probability of ANY of the
samples being at least 999_999_000 is (1.0 - 0.006738) =3D 0.993262, or
99.3262%. Pretty good chances.
=20
Jacob Fugal
=20
=20

--=20
Cassio Pennachin

Jim Menard · Jul 15, 2005

I've managed to shave off one minute. Still nowhere near the 43 seconds posted
by Ezra.

tmp> time ruby sampling.rb 5_000_000 1_000_000_000 >big_sample.txt

real 1m46.449s
user 0m0.010s
sys 0m0.030s

Jim

jason r tibbetts · Jul 15, 2005

Jacob said:
The thing that really gets me is the disk latency. Here's my trial for
just writing 5_000_000 numbers to disk:

$ cat test.rb
5_000_000.times { |i| puts i }

$ time ruby test.rb > big_file.txt

real 0m41.781s
user 0m26.436s
sys 0m2.626s

That's 41 seconds just to write to disk. Ezra's disk may be
faster/better, but still I'd bet at least half the time reported is
spent in I/O. His algorithm must be phenomenal to get down to that
speed.

Try Jim Freeze's example of doing a run that writes to /dev/null to
remove disk speed from the equation.

Ezra Zygmuntowicz · Jul 15, 2005

Ezra's output looks like what you're looking for, but it seems to
have gone unnoticed in the wake of Cassio's post.

I /really/ want to know how this is possible. Multi-processor
solution?

Nope This is running on a single 1.6Ghz G5 tower with about 15 other
programs running including photoshop and dreamweaver. And my solution
is only 9 lines of code ;-) Here's some more results with a little
more time shaved off:
ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000
1_000_000_000 > big_sample.txt

real 0m39.860s
user 0m34.343s
sys 0m1.370s
ezra:~/Sites/______raill_stuff ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444444 Jul 15 14:25 big_sample.txt
ezra:~/Sites/______raill_stuff ez$ head big_sample.txt
61
260
555
666
816
1126
1349
1552
1740
1862
ezra:~/Sites/______raill_stuff ez$ tail big_sample.txt
999998018
999998306
999998426
999998735
999998949
999999011
999999340
999999480
999999643
999999840
ezra:~/Sites/______raill_stuff ez$

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
(e-mail address removed)

Ara.T.Howard · Jul 15, 2005

Nope This is running on a single 1.6Ghz G5 tower with about 15 other programs
running including photoshop and dreamweaver. And my solution is only 9 lines
of code ;-) Here's some more results with a little more time shaved off:
ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000 1_000_000_000 >
big_sample.txt

real 0m39.860s
user 0m34.343s
sys 0m1.370s
ezra:~/Sites/______raill_stuff ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444444 Jul 15 14:25 big_sample.txt

this means your __average__ line size is 9. that means your average number
has 8 digits (ignoring newline). a uniform sampling should have an average
line size of about 5 digits (6 chars counting newline). seems like a heavy
bias towards the top. do other runs yield similar file sizes or is this
merely chance?

i was getting this kind of result by picking value in a range that ensured i'd
have enough left to fufill the remainder... it tends to skew high.

on a side note - on the G5 here it seems to take me 54 seconds to even write a
5,000,000 entry array full of zeros:

~ ahoward$ time ruby -e' puts(Array::new(5_000_000, 0)) ' > out

real 0m54.361s
user 0m53.100s
sys 0m0.560s

our dell 2650s give similar times... odd...

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================

Jacob Fugal · Jul 15, 2005

this means your __average__ line size is 9. that means your average numb= er
has 8 digits (ignoring newline). a uniform sampling should have an avera= ge
line size of about 5 digits (6 chars counting newline). seems like a hea= vy
bias towards the top. do other runs yield similar file sizes or is this
merely chance?

Not necessarily. Consider, numbers from 100_000_000 up have 9
characters. Those number comprise 90% of the population space. Numbers
from 10_000_000 to 99_999_999 have 8 digits. These number comprise 90%
of the remaining space, or 9% of the total population space. Numbers
under 10_000_000 comprise only 1% of the population space. Obviously,
you're going to have many, many more 8 and 9 digit numbers than 7
digits or under. It doesn't surprise me at all that the average line
length is *at least* 9 characters (with the new line). I'd expect the
full floating point average to be closer to 10 characters.

And, indeed: 49444444 / 5000000 =3D 9.8888888

Jacob Fugal

Mauricio Fernández · Jul 15, 2005

Nope This is running on a single 1.6Ghz G5 tower with about 15 other
programs running including photoshop and dreamweaver. And my solution
is only 9 lines of code ;-) Here's some more results with a little
more time shaved off:
ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000
1_000_000_000 > big_sample.txt

real 0m39.860s
user 0m34.343s
sys 0m1.370s

My first take is also 9L long.

Timed on a 1536.849MHz K7 with only a couple processes awake, out of 150+.

batsman@tux-chan:/tmp$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample.txt

real 0m33.880s
user 0m32.750s
sys 0m0.610s
batsman@tux-chan:/tmp$ ls -l big_sample.txt
-rw-r--r-- 1 batsman batsman 49442396 2005-07-16 00:19 big_sample.txt
batsman@tux-chan:/tmp$ tail big_sample.txt
999997948
999998026
999998293
999998375
999998678
999998872
999999237
999999555
999999632
999999719
batsman@tux-chan:/tmp$ head big_sample.txt
103
260
521
857
1009
1395
1505
1896
2140
2302

Jacob Fugal · Jul 15, 2005

And, indeed: 49444444 / 5000000 =3D 9.8888888

On further examination, the expected length of a sample value would be[1]:

(1..9).sum { |n| n * 9 * 10 ** (n - 10) }

This evaluates to 8.888888889, or 9.888888889 characters per line, or
an expected file size of 49444444.445 bytes. I think Ezra's output is
right on.

Jacob Fugal

[1] My implementation of sum, for those interested.

module Enumerable
def sum
self.inject(0) { |sum,n| sum + yield( n ) }
end
end

Ezra Zygmuntowicz · Jul 15, 2005

Nope This is running on a single 1.6Ghz G5 tower with about 15
other programs running including photoshop and dreamweaver. And my
solution is only 9 lines of code ;-) Here's some more results with
a little more time shaved off:
ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000
1_000_000_000 > big_sample.txt

real 0m39.860s
user 0m34.343s
sys 0m1.370s
ezra:~/Sites/______raill_stuff ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444444 Jul 15 14:25 big_sample.txt

Click to expand...

this means your __average__ line size is 9. that means your
average number
has 8 digits (ignoring newline). a uniform sampling should have an
average
line size of about 5 digits (6 chars counting newline). seems like
a heavy
bias towards the top. do other runs yield similar file sizes or is
this
merely chance?

i was getting this kind of result by picking value in a range that
ensured i'd
have enough left to fufill the remainder... it tends to skew high.

on a side note - on the G5 here it seems to take me 54 seconds to
even write a
5,000,000 entry array full of zeros:

~ ahoward$ time ruby -e' puts(Array::new(5_000_000, 0)) ' > out

real 0m54.361s
user 0m53.100s
sys 0m0.560s

our dell 2650s give similar times... odd...

-a
--
======================================================================
=========
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
======================================================================
=========

I get pretty much the same size of the file every time I run this.
And if you look at the original quiz posting by James Edward Grey you
will see that the sample run data that he said was the time to beat
had a size of 49011436 whereas mine is 49444444. So yeah mine is a
bit skewed higher I guess. But when I look through the results it
looks like a nice random spread covering all the numbers in the
limit. My solution is really very simple and almost all the time it
takes to run is spent on IO because when I run the following:
ez$ time ruby -e'1.upto(5_000_000) {|i| puts i} ' > out

real 0m27.236s
user 0m23.802s
sys 0m0.998s

So it takes about 13 seconds for the math and 27 seconds for the IO.

Cheers-
-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
(e-mail address removed)

Ara.T.Howard · Jul 15, 2005

I get pretty much the same size of the file every time I run this. And if you
look at the original quiz posting by James Edward Grey you will see that the
sample run data that he said was the time to beat had a size of 49011436
whereas mine is 49444444. So yeah mine is a bit skewed higher I guess. But
when I look through the results it looks like a nice random spread covering
all the numbers in the limit. My solution is really very simple and almost
all the time it takes to run is spent on IO because when I run the following:
ez$ time ruby -e'1.upto(5_000_000) {|i| puts i} ' > out

real 0m27.236s
user 0m23.802s
sys 0m0.998s

So it takes about 13 seconds for the math and 27 seconds for the IO.

wow. you are right there. i thought for sure doing a single io call (puts
array) would be faster. i know ruby buffers, but i thought the function calls
alone would kill you doing many puts.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================

Bill Kelly · Jul 16, 2005

From: "Mauricio Fern=E1ndez said:
My first take is also 9L long.

Timed on a 1536.849MHz K7 with only a couple processes awake, out of 15= 0+.

batsman@tux-chan:/tmp$ time ruby sample.rb 5_000_000 1_000_000_000 > bi= g_sample.txt

real 0m33.880s
user 0m32.750s
sys 0m0.610s

This is on a 2.4GHz Celeron, otherwise idle... The solution is
10L (including two lines for retrieving the command line arguments,
which could be one line...)

$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample.txt
real 0m34.022s
user 0m33.290s
sys 0m0.590s

$ wc -l big_sample.txt
5000000 big_sample.txt

$ head big_sample.txt
5
304
423
627
820
1173
1282
1575
1639
1897

$ tail big_sample.txt
999998156
999998353
999998481
999998737
999998991
999999041
999999367
999999475
999999630
999999996

Removing the puts that outputs each sample, but leaving the
rest of the logic, the time is:

real 0m11.793s
user 0m11.790s
sys 0m0.000s

Regards,

Bill

Ezra Zygmuntowicz · Jul 16, 2005

This is on a 2.4GHz Celeron, otherwise idle... The solution is
10L (including two lines for retrieving the command line arguments,
which could be one line...)

$ time ruby sample.rb 5_000_000 1_000_000_000 > big_sample.txt
real 0m34.022s
user 0m33.290s
sys 0m0.590s

$ wc -l big_sample.txt
5000000 big_sample.txt

$ head big_sample.txt
5
304
423
627
820
1173
1282
1575
1639
1897

$ tail big_sample.txt
999998156
999998353
999998481
999998737
999998991
999999041
999999367
999999475
999999630
999999996

Heres my final solutions time. again on a 1.6Ghz G5. I got the code =20
down to 7 lines and I don't think I can get it any faster.

ezra:~/Sites/______raill_stuff ez$ time ./sample.rb 5_000_000 =20
1_000_000_000 > big_sample.txt

real 0m31.023s
user 0m29.406s
sys 0m0.857s

ezra:~/Sites/______raill_stuff ez$ ls -l big_sample.txt
-rw-r--r-- 1 ez ez 49444445 Jul 15 16:36 big_sample.txt
ezra:~/Sites/______raill_stuff ez$ head big_sample.txt
157
330
558
693
946
1024
1231
1529
1644
1872
ezra:~/Sites/______raill_stuff ez$ tail big_sample.txt
999998135
999998267
999998411
999998702
999998998
999999071
999999231
999999441
999999653
999999876
ezra:~/Sites/______raill_stuff ez$ wc big_sample.txt
5000000 big_sample.txt

-Ezra Zygmuntowicz
Yakima Herald-Republic
WebMaster
509-577-7732
(e-mail address removed)

James Edward Gray II · Jul 16, 2005

This is a GREAT Quiz, by the way.

Certainly seems to be popular.

Thanks. I'm glad you, and others, are enjoying it.

James Edward Gray II

James Edward Gray II · Jul 16, 2005

As a side note (not saying Ezra did this, but I am tempted to do it),
would a program utilizing RubyInline be permitted?

Use whatever tools you like, as always.

James Edward Gray II

James Edward Gray II · Jul 16, 2005

James,

What are your machine's specs, just for comparison's sake? I don't
want to get too discouraged by my 5+ year-old Sun beater
workstation.

Dual 2.0 Ghx G5. 2 Gigs of RAM.

I've also found that the printing of the results (in the form of an
array) is taking several minutes in and of itself. Since optimizing
the printout probably isn't the point of the quiz, can anyone
recommend something faster than arr.each{ | elem | p elem }

IO is certainty a part of this quiz. The benchmark library is your
friend...

James Edward Gray II

[SOLUTION] [QUIZ] Sampling (#39)	1	Jul 18, 2005
[SUMMARY] Sampling (#39)	0	Jul 21, 2005
Minimum Total Difficulty	0	Nov 15, 2023
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
[QUIZ] Befunge (#184)	24	Nov 22, 2008
[QUIZ] Counting Toothpicks (#111)	47	Jan 26, 2007
[QUIZ] Symbolify (#169)	74	Jul 11, 2008
[QUIZ] Bowling Scores (#181)	8	Oct 24, 2008

[QUIZ] Sampling (#39)

Jacob Fugal

jason r tibbetts

David Brady

jason r tibbetts

Jacob Fugal

Cassio Pennachin

Jim Menard

jason r tibbetts

Ezra Zygmuntowicz

Ara.T.Howard

Jacob Fugal

Mauricio Fernández

Jacob Fugal

Ezra Zygmuntowicz

Ara.T.Howard

Bill Kelly

Ezra Zygmuntowicz

James Edward Gray II

James Edward Gray II

James Edward Gray II

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads