Indexing hash with longer strings faster?

J

jablan

Hi, I have some wierd results measuring Ruby's hash operations,
particulary indexing hash using two member string array. Have a look:
require 'benchmark' => false
puts(Benchmark.measure do ?> h = {}
1000000.times do |i| ?> i1 = rand(100)
i2 = rand(100)
a = [i1.to_s, i2.to_s]
h[a] ||= 0
h[a] += 1
end
puts h.size
end)
10000
12.900000 0.050000 12.950000 ( 13.088721)
=> nil
puts(Benchmark.measure do ?> h = {}
1000000.times do |i| ?> i1 = rand(100)
i2 = rand(100)
a = [i1.to_s + '00000', i2.to_s + '00000']
h[a] ||= 0
h[a] += 1
end
puts h.size
end)
10000
7.700000 0.040000 7.740000 ( 7.858381)
=> nil

In other words, I just made array members slightly longer and working
with hash was considerably faster.

Why's that happening?

BTW,

jablan-mbp:~ $ ruby -v
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]
 
R

Robert Dober

<snip>
Benchmarking is tricky business. I am not sure my benchmarks are worth
much, but they do not confirm your results. BTW using: ruby 1.9.1p243
(2009-07-16 revision 24175) [i686-linux]

First you really should not bm inside irb. Second you should use bmbm
for rehearsals, but even with that you could get bitten by tons of
things happening on your machine.
Nevertheless for whats worth it - and because I adore benchmarking ;)
- here is my code and results:

require 'benchmark'

N =3D 100_000
5.times do
Benchmark.bmbm do | bm |
h =3D Hash::new( 0 )
bm.report( "2 chars" ) do
N.times do
h[ [rand(100).to_s, rand(100).to_s] ] +=3D 1
end
end
h =3D Hash::new( 0 )
bm.report( "7 chars" ) do
N.times do
h[ [rand(100).to_s + "00000", rand(100).to_s + "00000" ] ] +=3D 1
end
end
end
end
robert@siena:~/log/ruby/ML 19:50:13
519/25 > ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 0.953483)
7 chars 0.960000 0.010000 0.970000 ( 1.367117)
---------------------------------- total: 1.680000sec

user system total real
2 chars 0.740000 0.020000 0.760000 ( 1.031102)
7 chars 1.010000 0.020000 1.030000 ( 1.323072)
Rehearsal -------------------------------------------
2 chars 0.720000 0.000000 0.720000 ( 0.955180)
7 chars 0.980000 0.010000 0.990000 ( 1.320693)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.923019)
7 chars 0.900000 0.010000 0.910000 ( 1.262338)
Rehearsal -------------------------------------------
2 chars 0.700000 0.000000 0.700000 ( 0.926117)
7 chars 0.990000 0.020000 1.010000 ( 1.356475)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.710000 0.000000 0.710000 ( 0.940901)
7 chars 0.880000 0.020000 0.900000 ( 1.269465)
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 1.016537)
7 chars 0.990000 0.010000 1.000000 ( 1.360516)
---------------------------------- total: 1.710000sec

user system total real
2 chars 0.720000 0.010000 0.730000 ( 0.920533)
7 chars 0.920000 0.000000 0.920000 ( 1.198681)
Rehearsal -------------------------------------------
2 chars 0.720000 0.010000 0.730000 ( 0.946056)
7 chars 1.020000 0.010000 1.030000 ( 1.343575)
---------------------------------- total: 1.760000sec

user system total real
2 chars 0.720000 0.010000 0.730000 ( 1.023185)
7 chars 0.930000 0.000000 0.930000 ( 1.211184)




--=20
Toutes les grandes personnes ont d=92abord =E9t=E9 des enfants, mais peu
d=92entre elles s=92en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=E9ry]
 
M

Mladen Jablanoviæ

First, thanks a lot for your benchmarking tips, I'm new in the
business; and thanks for your BM-friendly rewrite of my code.

However, I still keep getting unexpected results:

jablan-mbp:dev $ ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.840000 0.000000 0.840000 ( 0.863001)
7 chars 0.590000 0.010000 0.600000 ( 0.597103)
---------------------------------- total: 1.440000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909053)
7 chars 0.620000 0.000000 0.620000 ( 0.644989)
Rehearsal -------------------------------------------
2 chars 0.910000 0.000000 0.910000 ( 0.924786)
7 chars 0.620000 0.010000 0.630000 ( 0.631793)
---------------------------------- total: 1.540000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.909935)
7 chars 0.570000 0.000000 0.570000 ( 0.578445)
Rehearsal -------------------------------------------
2 chars 0.880000 0.010000 0.890000 ( 0.892464)
7 chars 0.650000 0.000000 0.650000 ( 0.672226)
---------------------------------- total: 1.540000sec

user system total real
2 chars 0.910000 0.010000 0.920000 ( 0.947815)
7 chars 0.580000 0.000000 0.580000 ( 0.593960)
Rehearsal -------------------------------------------
2 chars 0.890000 0.010000 0.900000 ( 0.896618)
7 chars 0.650000 0.000000 0.650000 ( 0.660125)
---------------------------------- total: 1.550000sec

user system total real
2 chars 0.900000 0.000000 0.900000 ( 0.919042)
7 chars 0.590000 0.010000 0.600000 ( 0.595202)
Rehearsal -------------------------------------------
2 chars 0.900000 0.000000 0.900000 ( 0.913919)
7 chars 0.660000 0.010000 0.670000 ( 0.682484)
---------------------------------- total: 1.570000sec

user system total real
2 chars 0.920000 0.000000 0.920000 ( 0.950286)
7 chars 0.590000 0.010000 0.600000 ( 0.601025)

I will try to run the code on other platforms and/or ruby versions and
I will post the findings... The bad thing is that I'm not doing this
just for kicks, I need to optimise a nasty script... :(
 
R

Robert Dober

2009/7/28 Mladen Jablanovi=C4=87 <[email protected]>:
Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*"00".."99"].map( &:hash ).uniq.size
=3D> 100

can you try this in 1.8
...map{ |x| x.hash }.uniq.size

?

HTH
Robert
--=20
Toutes les grandes personnes ont d=E2=80=99abord =C3=A9t=C3=A9 des enfants,=
mais peu
d=E2=80=99entre elles s=E2=80=99en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=C3=A9ry]
 
R

Robert Dober

2009/7/28 Mladen Jablanovi=C4=87 <[email protected]>:
Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for "00".."99" have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
=C2=A0[*"00".."99"].map( &:hash ).uniq.size
=3D> 100

can you try this in 1.8
=C2=A0...map{ |x| x.hash }.uniq.size
Sorry forgot two things
you did not produce 00, 01, thus
[*0..99].map( &:to_s ).map( a:hash )
and secondly,
a negative result does not say anything about collisions, only a
positive (result < 100) would. In that case we would need to look into
the source, ( I do not think profiling might show internal hash
behavior ).
Cheers
Robert
?

HTH
Robert
--
Toutes les grandes personnes ont d=E2=80=99abord =C3=A9t=C3=A9 des enfant= s, mais peu
d=E2=80=99entre elles s=E2=80=99en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=C3=A9ry]



--=20
Toutes les grandes personnes ont d=E2=80=99abord =C3=A9t=C3=A9 des enfants,=
mais peu
d=E2=80=99entre elles s=E2=80=99en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=C3=A9ry]
 
M

Mladen Jablanović

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:
[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size => 1756
[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].hash}}.flatten.uniq.size
=> 10000

Can you please just post the ruby 1.9 results here?

Thanks!
 
R

Robert Dober

2009/7/29 Mladen Jablanovi=C4=87 said:
You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here's what suggests that hash collision indeed is the reason
of the slowness:
[*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.s=
ize
=3D> 1756
[*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s + '00000'].h=
ash}}.flatten.uniq.size
=3D> 10000

OMG my code was sloppy, but you got it right, here is the 1.9 code
506/18 > cat hashes.rb && ruby -v hashes.rb
#!/usr/local/bin/ruby -w
# encoding: utf-8
# file: /home/robert/log/ruby/ML/hashes.rb
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.siz=
e
p [*0..99].map{|i| [*0..99].map{|j| [i.to_s + '00000',j.to_s +
'00000'].hash}}.flatten.uniq.size

# vim: sts=3D2 sw=3D2 ft=3Druby expandtab nu :
ruby 1.9.1p243 (2009-07-16 revision 24175) [i686-linux]
10000
10000

as expected :).
Cheers
Robert
Can you please just post the ruby 1.9 results here?

Thanks!



--=20
Toutes les grandes personnes ont d=E2=80=99abord =C3=A9t=C3=A9 des enfants,=
mais peu
d=E2=80=99entre elles s=E2=80=99en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exup=C3=A9ry]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top