need some Ruby magic

A

ara.t.howard

=2E

I did some numbers and found the following probability estimates:

array size P(#rand() collision)
1000 5.54558e-11
10000 5.55056e-09
1000000 5.55096e-05
10000000 5.53574e-03

A collision implies that the associated pair of elements has got the same
ordering in the output array, instead of 50-50 chances of it being revers= ed.

More info, including the Ruby code I used, available at
http://eigenclass.org/hiki.rb?sort_by+rand+is+biased


I'm not totally sure it's really unbiased, but at first sight it looks go=
od.

the big difference is that a value will have rand called multiple times for=
it
- in otherwords an element compareds differently every time. this is bound=
to
help shuffle more uniformly it seems:

harp:~ > cat a.rb
%w( a b c d ).sort{|a,b| ra =3D rand; rb =3D rand; p a =3D> ra; p b =3D>=
rb; ra <=3D> rb }

harp:~ > ruby a.rb
{"a"=3D>0.181439380218727}
{"c"=3D>0.579403886629126}
{"c"=3D>0.713513989939958}
{"d"=3D>0.573687508540169}
{"a"=3D>0.851534740906118}
{"d"=3D>0.00156074151427565}
{"b"=3D>0.803591007691647}
{"a"=3D>0.494066110872563}
{"b"=3D>0.834676789626288}
{"c"=3D>0.792106134460754}

so, note that "a" was sorted using a different value each time.

can anyone who remembers stats and knows the sort algorithm weigh in?

-a
--=20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
| ara [dot] t [dot] howard [at] noaa [dot] gov
| all happiness comes from the desire for others to be happy. all misery
| comes from the desire for oneself to be happy.
| -- bodhicaryavatara
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
 
M

Mauricio Fernández

the big difference is that a value will have rand called multiple times for

That might be the very reason why it is ultimately much more biased than
sort_by{ rand }... I'm keeping the latter.
If the limited precision of rand() became a problem, I could always use
arr.sort_by{ [rand, rand] } where collisions happen with a probability
around 6.16298e-19 for a 10_000_000 element array, or more generally
Array.new(N){rand} (but then we'd have to look into MT for another
possible source of residual bias).
it - in otherwords an element compareds differently every time. this is
bound to help shuffle more uniformly it seems:

We cannot trust our intuition in these cases; in the following example, the
tenth element is more likely to stay where it was (look at pos[9]):

a = (1..100); pos = Array.new(100, 0)

t = Time.new; 100000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1 }
Time.new - t
# => 223.891608

pos
# => [1176, 1208, 1083, 1153, 1191, 1130, 1098, 1133, 1189, 1438, 1150, 1043,
========
# 1121, 1067, 1076, 1069, 1078, 1113, 1129, 1143,1115, 1074, 1081, 1069, 1057,
# [...]

t = Time.new; 10000000.times{ pos[a.sort{rand <=> rand}.index(10)] += 1 }
Time.new - t
# => 22781.664168

pos
# => [105944, 105861, 105546, 105681, 104059, 101196, 99080, 100675, 110851,
# 132223, 101670, 96018, 96147, 96688, 97709, 99051, 97823, 97345, 98799,
========
# [...]

The bias can be observed easily with smaller arrays:

a = (1..20); pos = Array.new(20, 0)
t = Time.new
100000.times{ pos[a.sort{rand <=> rand}.index(3)] += 1 }
Time.new - t # => 20.848851
pos
# => [5084, 5694, 6561, 4326, 4205, 4792, 4913, 4827, 4791, 4571, 4560, 4707,
# 4863, 4934, 5112, 5011, 5071, 5326, 5482, 5170]

a = (1..10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort{rand <=> rand}.index(5)] += 1 }
Time.new - t # => 77.438912
pos
# => [113836, 103238, 99442, 88682, 101306, 83294, 93296, 100830, 100744, 115332]

Compare the latter to

a = (1..10); pos = Array.new(10, 0)
t = Time.new
1_000_000.times{ pos[a.sort_by{rand}.index(5)] += 1 }
Time.new - t # => 32.163499
pos
# => [99822, 100020, 100328, 100040, 99603, 100098, 100195, 100023, 99875, 99996]
 
U

Uwe Schmitt

||=20

|| > More info, including the Ruby code I used, available at
|| > http://eigenclass.org/hiki.rb?sort_by+rand+is+biased
|| >
|| >> does
|| >>
|| >> %w( a b c d ).sort{ rand <=3D> rand }
|| >>
|| >> help?

Hi, I'm new to ruby, but I use Python for some years now.
So I suggest the following 'pythonic' solution ;-)

def shuffle(a)
=20
b=3D[]
# decorate
a.each { |x| b << [rand,x]; }

# sort
b.sort! { |x,y| x[0] <=3D> y[0] }

# undecorate
b.collect { |x| x[1] }
end

a=3D%w{a b c d e f}
puts shuffle a =20

Is there a shorter or more elegant way to do this=20
'decorate-sort-undecorate' in ruby ?
In Python it is quite easy by using list comprehensions...


Greetings, Uwe
 
M

Michael Ulm

Mauricio said:
On Tue, Dec 06, 2005 at 08:23:52AM +0900, (e-mail address removed) wrote:
=20 for
=20
=20
That might be the very reason why it is ultimately much more biased tha= n
sort_by{ rand }... I'm keeping the latter.

Indeed,

ar =3D [0, 1, 2]; result =3D Hash.new(0)
100000.times {result[ar.sort {rand <=3D> rand}] +=3D 1}
p result
# =3D> {[2, 1, 0]=3D>25235,
[2, 0, 1]=3D>12285,
[1, 0, 2]=3D>12426,
[0, 1, 2]=3D>25111,
[1, 2, 0]=3D>12468,
[0, 2, 1]=3D>12475}

Michael

--=20
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: (e-mail address removed)
Visit our Website: www.isis-papyrus.com
 
M

Mauricio Fernández

Hi, I'm new to ruby, but I use Python for some years now.
So I suggest the following 'pythonic' solution ;-) [...]
Is there a shorter or more elegant way to do this
'decorate-sort-undecorate' in ruby ?
arr.sort_by{rand}

In Python it is quite easy by using list comprehensions...

What does that look like in Python?
Is it shorter than the above one? }:)
 
U

Uwe Schmitt

||=20
|| On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe Schmitt wrote:
|| > Hi, I'm new to ruby, but I use Python for some years now.
|| > So I suggest the following 'pythonic' solution ;-)
|| [...]
|| > Is there a shorter or more elegant way to do this=20
|| > 'decorate-sort-undecorate' in ruby ?
||=20
|| arr.sort_by{rand}

that is not equivalent to my solution.
the python equivalent to yours is

arr.sort(key =3D lambda x: random())

||=20
|| > In Python it is quite easy by using list comprehensions...
|| What does that look like in Python?=20

from random import random

def shuffle(a):
b =3D [ (random(), i) for i in a]
b.sort()
return [ x[1] for x in b ]

print shuffle(range(10))

|| Is it shorter than the above one? }:)
||=20
|| --=20
|| Mauricio Fernandez

greetings, Uwe


||=20
 
M

Michael Ulm

Uwe said:
|| arr.sort_by{rand}

that is not equivalent to my solution.
the python equivalent to yours is

arr.sort(key = lambda x: random())

||
|| > In Python it is quite easy by using list comprehensions...
|| What does that look like in Python?

from random import random

def shuffle(a):
b = [ (random(), i) for i in a]
b.sort()
return [ x[1] for x in b ]

print shuffle(range(10))

I would have thought those two functions do the same thing
with the exception that shuffle does not sort in place.
I'm probably missing something obvious. Could you explain
the difference between the two versions?

Michael

--
Michael Ulm
R&D Team
ISIS Information Systems Austria
tel: +43 2236 27551-219, fax: +43 2236 21081
e-mail: (e-mail address removed)
Visit our Website: www.isis-papyrus.com
 
U

Uwe Schmitt

|| > || arr.sort_by{rand}
|| > that is not equivalent to my solution.
|| > the python equivalent to yours is
|| > arr.sort(key = lambda x: random())

|| > || > In Python it is quite easy by using list comprehensions...
|| > || What does that look like in Python?
|| >
|| > from random import random
|| >
|| > def shuffle(a):
|| > b = [ (random(), i) for i in a]
|| > b.sort()
|| > return [ x[1] for x in b ]
|| >
|| > print shuffle(range(10))

|| I would have thought those two functions do the same thing
|| with the exception that shuffle does not sort in place.
|| I'm probably missing something obvious. Could you explain
|| the difference between the two versions?

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this 'decoration' afterwards.

Pythonprogrammers call this pattern "decorate-sort-undecorate".

As I do not now anything about the ruby interanls of sort,
I prefer my version.

Greetings, Uwe.
 
T

ts

U> If you use random as sorting key, the result of a<=>b
U> will differ each time you evaluate a<=>b.
U> In my case I attach a random number to each item,
U> then I sort my list according to this number and remove
U> this 'decoration' afterwards.

run

ri Enum#sort_by

and see what it do.


Guy Decoux
 
B

Bill Kelly

Hi,

From: "Uwe Schmitt said:
||
|| On Tue, Dec 06, 2005 at 07:32:34PM +0900, Uwe Schmitt wrote:
|| > Is there a shorter or more elegant way to do this
|| > 'decorate-sort-undecorate' in ruby ?
||
|| arr.sort_by{rand}

that is not equivalent to my solution.
the python equivalent to yours is

arr.sort(key = lambda x: random())

No, Array#sort_by does the 'decorate-sort-undecorate', aka.
Schwartzian Transform.

I don't read Python very well, but I think the ruby equivalent
to 'arr.sort(key = lambda x: random())' might be:

arr.sort { rand <=> rand }

But that is different from arr.sort_by {rand}


Regards,

Bill
 
U

Uwe Schmitt

||=20
||=20
|| U> If you use random as sorting key, the result of a<=3D>b
|| U> will differ each time you evaluate a<=3D>b.=20
|| U> In my case I attach a random number to each item,
|| U> then I sort my list according to this number and remove
|| U> this 'decoration' afterwards.
||=20
|| run
||=20
|| ri Enum#sort_by
||=20
|| and see what it do.

Thanks. As I told you before I'm new to ruby ;-)

Greetings, Uwe
 
M

Mauricio Fernández

=20
And that is a good thing? How can the sequence be "random" if it
never contains the same two numbers in row? (Remember Enigma...)

Nah, I misread genrand_real; the above shows it is possible.

--=20
Mauricio Fernandez
 
G

gabriele renzi

Uwe Schmitt ha scritto:
|| > || arr.sort_by{rand}
|| > that is not equivalent to my solution.
|| > the python equivalent to yours is
|| > arr.sort(key = lambda x: random())

|| > || > In Python it is quite easy by using list comprehensions...
|| > || What does that look like in Python?
|| >
|| > from random import random
|| >
|| > def shuffle(a):
|| > b = [ (random(), i) for i in a]
|| > b.sort()
|| > return [ x[1] for x in b ]
|| >
|| > print shuffle(range(10))

|| I would have thought those two functions do the same thing
|| with the exception that shuffle does not sort in place.
|| I'm probably missing something obvious. Could you explain
|| the difference between the two versions?

If you use random as sorting key, the result of a<=>b
will differ each time you evaluate a<=>b.
In my case I attach a random number to each item,
then I sort my list according to this number and remove
this 'decoration' afterwards.

Pythonprogrammers call this pattern "decorate-sort-undecorate".

Are you sure of what you say? AFAIK
sort(key=lambda x: somefunc..))

will do decorate/sort/undecorate under the hood, behaving like ruby's
#sort_by and like your version.
OTOH
sort(cmp=lambda x,y: something.. ) will behave like ruby's #sort.
 
S

Stephen Waits

Ok, I think I've beat sort_by{ rand } on large Arrays , by about 30%. I
ended up with this:

h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }

Explanation..

* Hash works better than an Array of Arrays
[[rand,x[0]],[rand,x[1]],...] because Array#<=> seems to be slow.

* Got some speedup by avoiding Float#<=>, hence the Integer version of rand.

* On rand(1000000000); chose 1billion rather arbitrarily to keep things
out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show
up. Ideally we'd use the largest FixNum, but I don't know how to
retrieve that.

* Bias should be minimal for most applications, even at rand(1Billion).

* Can it be improved further???

Find my benchmark results, and script below.

--Steve


user system total real
shuffle (10) 0.016000 0.000000 0.016000 ( 0.016000)
shuffle2 (10) 0.000000 0.000000 0.000000 ( 0.000000)
shuffle3 (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort_by (10) 0.000000 0.000000 0.000000 ( 0.000000)

user system total real
shuffle (100) 0.156000 0.000000 0.156000 ( 0.156000)
shuffle2 (100) 0.047000 0.000000 0.047000 ( 0.046000)
shuffle3 (100) 0.031000 0.000000 0.031000 ( 0.032000)
sort (100) 0.125000 0.000000 0.125000 ( 0.125000)
sort_by (100) 0.031000 0.000000 0.031000 ( 0.031000)

user system total real
shuffle (1000) 1.625000 0.000000 1.625000 ( 1.625000)
shuffle2 (1000) 0.656000 0.000000 0.656000 ( 0.672000)
shuffle3 (1000) 0.266000 0.000000 0.266000 ( 0.281000)
sort (1000) 2.094000 0.000000 2.094000 ( 2.093000)
sort_by (1000) 0.453000 0.000000 0.453000 ( 0.453000)

user system total real
shuffle (10000) 16.469000 0.032000 16.501000 ( 16.578000)
shuffle2 (10000) 8.156000 0.000000 8.156000 ( 8.203000)
shuffle3 (10000) 3.344000 0.047000 3.391000 ( 3.391000)
sort (10000) 25.453000 0.000000 25.453000 ( 25.593000)
sort_by (10000) 4.765000 0.000000 4.765000 ( 4.796000)



#!/usr/bin/env ruby

require 'benchmark'

class Array

def shuffle
newarr = self.dup
2.times do
self.length.times do |x|
y = rand(self.length)
newarr[x], newarr[y] = newarr[y], newarr[x]
end
end
newarr
end

def shuffle2
tmparr = self.collect { |x| [rand(1000000000),x] }
tmparr.sort.collect { |x| x[1] }
end

def shuffle3
h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }
end

end

[10, 100, 1000, 10000].each do |n|

# populate an Array size n
a = Array.new(n) { |i| i }

# run our benchmarks
Benchmark.bm(16) do |bb|

# my method
bb.report("shuffle ("+n.to_s+")") do
100.times do
a.shuffle
end
end

# decorate, sort, undecorate
bb.report("shuffle2 ("+n.to_s+")") do
100.times do
a.shuffle2
end
end

# decorate, sort, undecorate with a hash
bb.report("shuffle3 ("+n.to_s+")") do
100.times do
a.shuffle3
end
end

# Array's sort
bb.report("sort ("+n.to_s+")") do
100.times do
a.sort { rand(10000) <=> 5000 }
end
end

# Enumerable's sort_by
bb.report("sort_by ("+n.to_s+")") do
100.times do
a.sort_by { rand }
end
end

printf("\n")

end

end
 
S

Stephen Waits

Stephen said:
h = Hash.new
self.each { |v| h[rand(1000000000)] = v }
h.keys.sort.collect { |k| h[k] }

OOPS!

I forgot to deal with collisions in the hash! :(

--Steve
 
S

Stephen Waits

Stephen said:
I forgot to deal with collisions in the hash! :(

Here's another shot.. still faster than sort_by{rand} on my system, and
now should be unbiased:

def shuffle3
h = Hash.new
self.each do |v|
k = rand(1000000000) until !h.has_key?(k)
h[k] = v
end
end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)

--Steve
 
M

Mauricio Fernández

Here's another shot.. still faster than sort_by{rand} on my system, and
now should be unbiased:

def shuffle3
h = Hash.new
self.each do |v|
k = rand(1000000000) until !h.has_key?(k)

This is not equivalent to
begin
k = rand(1000000000)
end until !h.has_key?


Compare
k = 1 until true
k # => nil
to
begin; k = 1 end until true
k # => 1


You forgot
h.keys.sort.collect { |k| h[k] }
end
end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)


After fixing shuffle3, I get:

module Enumerable
def shuffle3
h = Hash.new
k = rand(1000000000)
self.each do |v|
k = rand(1000000000) until !h.has_key?(k)
h[k] = v
end
h.keys.sort.collect { |k| h[k] }
end

def shuffle4
h = {}
k = rand(1000000000)
self.sort_by{ k = rand(1000000000) while h.has_key?(k); h[k] = k }
end
end

GC.disable
(1..10).shuffle4 # => [8, 3, 10, 2, 4, 6, 7, 5, 9, 1]
a = 1..100000
t0 = Time.new # => Tue Dec 06 22:13:56 CET 2005
a.shuffle4
(t1 = Time.new) - t0 # => 0.766411
a.shuffle3
(t2 = Time.new) - t1 # => 0.575967
a.sort_by{ rand }
Time.new - t2 # => 0.637288
 
S

Stephen Waits

Mauricio said:
=20
Compare
k =3D 1 until true
k # =3D> nil
to
begin; k =3D 1 end until true
k # =3D> 1

Oops.. I thought too much like C. Now I understand that the loop is=20
called 0 or more times, except in the case of "begin; end while" where=20
it's 1 or more. Thank you for teaching me!
After fixing shuffle3, I get:

I'm getting slightly different results, still better than sort_by{rand}=20
and better than your new shuffle4, but now only on larger Arrays. I=20
wonder too how performance might vary when Enumerable is mixed into ADTs=20
other than Array.

--Steve


user system total real
shuffle3 (1000) 0.375000 0.000000 0.375000 ( 0.375000)
shuffle4 (1000) 0.484000 0.000000 0.484000 ( 0.485000)
sort_by{rand} (1000) 0.328000 0.016000 0.344000 ( 0.344000)

user system total real
shuffle3 (10000) 3.906000 0.031000 3.937000 ( 3.953000)
shuffle4 (10000) 5.531000 0.047000 5.578000 ( 5.594000)
sort_by{rand} (10000) 4.672000 0.109000 4.781000 ( 4.812000)

user system total real
shuffle3 (100000) 49.703000 0.328000 50.031000 ( 50.280000)
shuffle4 (100000) 71.938000 0.453000 72.391000 ( 73.061000)
sort_by{rand} (100000) 62.000000 0.782000 62.782000 ( 63.451000)


module Enumerable

def shuffle3
h =3D {}
self.each do |v|
begin
k =3D rand(1000000000)
end while h.has_key?(k)
h[k] =3D v
end
h.keys.sort.collect { |k| h[k] }
end

def shuffle4
h =3D {}
k =3D rand(1000000000)
self.sort_by{ k =3D rand(1000000000) while h.has_key?(k); h[k] =3D k=
}
end

end
 
H

hmassa

Why not
a.dup.sort { 2*rand(12345) <=> 12345 }

? that way, we'd have NO colisions ever.
 
B

Brian Schröder

Why not
a.dup.sort { 2*rand(12345) <=3D> 12345 }

? that way, we'd have NO colisions ever.

What are you all trying to achieve by variations of
rand <=3D> rand,

why not simply use
rand(3) - 1
?

which is not in danger of decreasing randomness by combining random numbers=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top