Inserting hash value slows down as table gets larger

P

Philip Rhoades

People,

In response to people's suggestions about speeding up my script by
replacing output to many small files with output to one large file I
have implemented a hash table I can write out with YAML. However, I
find as the hash table gets larger, the script slows down . . but when I
try and work out what is happening by producing a small test script that
does more or less the same thing, I can't reproduce the problem . .

The test script is:


#!/usr/bin/ruby

h1 = Hash.new( 0 )
srand = 0

# seeds = [ '01', '01', '01', '22' ]
# seeds = [ '01', '01', '20', '22' ]
# seeds = [ '01', '32', '20', '22' ]
seeds = [ '50', '32', '20', '22' ]

for a in '01' .. seeds[0]
start = Time.now
# puts a
for b in '01' .. seeds[1]
# puts b
for c in '01' .. seeds[2]
# puts c
for d in '01' .. seeds[3]
# print "#{d} "
h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }
end
# puts
end
# puts
end
# puts
stop = Time.now
puts stop - start
end


The script is faster with the hash insertion commented out of course and
the time between iterations of the outer loop are constant in both
scripts - but they are longer and STILL constant with the insertion not
commented out in the test script. In my actual script, when the
insertion is not commented out the time between iterations in the outer
loop gets longer and longer eg 36 sec -> a few minutes before I kill it
about half way through . .

Can anyone suggest a way of working out why the production hash
insertion behaves differently and somewhat unexpectedly?

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
D

Dave Baldwin

I suspect is it the memory related to object creation for Array.new(2){ =
Array.new(1){ Array.new( 20, rand(1) ) } }

With this line as originally
h1[ "#{a}.#{b}.#{c}.#{d}" ] =3D Array.new(2){ Array.new(1){ =
Array.new( 20, rand(1) ) } }
I get 20.8 seconds

set to=20
h1[ "#{a}.#{b}.#{c}.#{d}" ] =3D 3
I get 2.54 seconds

set to=20
Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to=20
# h1[ "#{a}.#{b}.#{c}.#{d}" ] =3D Array.new(2){ Array.new(1){ =
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Dave.

People,
=20
In response to people's suggestions about speeding up my script by =
replacing output to many small files with output to one large file I =
have implemented a hash table I can write out with YAML. However, I =
find as the hash table gets larger, the script slows down . . but when I =
try and work out what is happening by producing a small test script that =
does more or less the same thing, I can't reproduce the problem . .
=20
The test script is:
=20
=20
#!/usr/bin/ruby
=20
h1 =3D Hash.new( 0 )
srand =3D 0
=20
# seeds =3D [ '01', '01', '01', '22' ]
# seeds =3D [ '01', '01', '20', '22' ]
# seeds =3D [ '01', '32', '20', '22' ]
seeds =3D [ '50', '32', '20', '22' ]
=20
for a in '01' .. seeds[0]
start =3D Time.now
# puts a
for b in '01' .. seeds[1]
# puts b
for c in '01' .. seeds[2]
# puts c
for d in '01' .. seeds[3]
# print "#{d} "
h1[ "#{a}.#{b}.#{c}.#{d}" ] =3D Array.new(2){ Array.new(1){ = Array.new( 20, rand(1) ) } }
end
# puts
end
# puts
end
# puts
stop =3D Time.now
puts stop - start
end
=20
=20
The script is faster with the hash insertion commented out of course =
and the time between iterations of the outer loop are constant in both =
scripts - but they are longer and STILL constant with the insertion not =
commented out in the test script. In my actual script, when the =
insertion is not commented out the time between iterations in the outer =
loop gets longer and longer eg 36 sec -> a few minutes before I kill it =
about half way through . .
=20
Can anyone suggest a way of working out why the production hash =
insertion behaves differently and somewhat unexpectedly?
 
B

Brian Candler

Dave Baldwin wrote in post #987924:
set to
Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
# h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Try these too:

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc

It's a foible of Ruby that if you use a String as a hash key, it is
dup'd and frozen - unless it was frozen already, which is what the above
does.

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.

If it is, then you can look at alternative YAML libraries, or at JSON.
 
P

Philip Rhoades

Brian,


Dave Baldwin wrote in post #987924:
set to
Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
# h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Try these too:

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc

It's a foible of Ruby that if you use a String as a hash key, it is
dup'd and frozen - unless it was frozen already, which is what the above
does.


Adding "freeze" improves the situation by about a second on my machine:

18.3 -> 17.3

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.


Yes that is very slow too but that is another issue - I can live with
that for the convenience of having human readable data . .

If it is, then you can look at alternative YAML libraries, or at JSON.


The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don't imagine it would be faster than
a memory based hash table?

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
B

Brian Candler

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don't imagine it would be faster than
a memory based hash table?

Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.

In your application, do you really need to use Strings?

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.
 
P

Philip Rhoades

Brian,


Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.


Yes, using a db was much slower

In your application, do you really need to use Strings?


I guess not but I preferred a tag of:

01.01.01.01

to:

1010101

Your sample code looks like it's handling numeric-style data (although I
realise this is just a test case for the problems you're having).
Integers in the range -2^30..+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.


Are you talking about the hash key or the hash values? - the values in
the real script will all be floats . .

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.


Not sure what you mean - example?

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id


I ran that code but I don't understand how it helps . .

Although that's more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.


Still not clear - examples?

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

BTW, I did end up changing to JSON - a YAML dump on a 32000x22 hash
table was deadly . .

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
B

Brian Candler

Philip Rhoades wrote in post #988264:
Are you talking about the hash key or the hash values?
Either.

- the values in
the real script will all be floats . .

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.
=> false
Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id
Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id


I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings. It
may help garbage collection performance, but not much else.
Still not clear - examples?

Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.
The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

Regards,

Brian.
 
P

Philip Rhoades

Brian,


Philip Rhoades wrote in post #988264:

Either.


Right - for the following in my test script:

h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20,
rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
4.724s
4.896s

So I guess I should live with the slight decrease of readability when
searching for particular results in the JSON output file by using ints
instead of strings for the hash keys.

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

=> false


I suppose I could convert them all to six or eight digit ints . . they
are measures of biological diversity and changing them backwards and
forwards is a bit of a hassle but maybe it is worth doing for the speed
advantage? - I will try my test script with floats and see what happens.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id


The reason the numbers are called "seeds" is that they correspond to the
seed for the random dumber generator in the C/C++ simulation program -
so they are all unique for each of the 32,000 simulations.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id


I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings.


Not the case for the keys and unlikely for the values.

It
may help garbage collection performance, but not much else


Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.


Unfortunately this doesn't correspond to my case . .

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.


Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Right.


Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.


OK - oh well it was worth a thought!

Many thanks!

Regards,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
B

Brian Candler

"#{a}.#{b}.#{c}.#{d}".to_i.freeze

I don't think that does what you expect. Try it in irb.
 
P

Philip Rhoades

Brian,


I don't think that does what you expect. Try it in irb.


Oops! . . forgot to take the dots out . .

In any case, in the actual live script (as opposed to the test script) I
did the same comparison of:

$recs[ "#{$seed_pre}.#{$seed_post}.#{$seed}.#{sgen}".freeze ] = $iarr
$recs[ "#{$seed_pre}#{$seed_post}#{$seed}#{sgen}".to_i.freeze ] = $iarr

and unfortunately there was no real difference because of the amount of
other processing that happens inside the loops . .

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
P

Philip Rhoades

People,

To add to my last post:


Brian,


Philip Rhoades wrote in post #988264:

Either.


Right - for the following in my test script:

h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){ Array.new( 20,
rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ "#{a}.#{b}.#{c}.#{d}".to_i.freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
4.724s
4.896s

So I guess I should live with the slight decrease of readability when
searching for particular results in the JSON output file by using ints
instead of strings for the hash keys.

Then they will be allocated on the heap, just like strings. I presume
you're aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

=> false


I suppose I could convert them all to six or eight digit ints . . they
are measures of biological diversity and changing them backwards and
forwards is a bit of a hassle but maybe it is worth doing for the speed
advantage? - I will try my test script with floats and see what happens.

Or, if you're handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn't allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id


The reason the numbers are called "seeds" is that they correspond to the
seed for the random dumber generator in the C/C++ simulation program -
so they are all unique for each of the 32,000 simulations.

Or you could use frozen strings and share the references.

LABEL1 = "00".freeze
LABEL2 = "01".freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP["00"]
puts a.object_id
puts LABEL1.object_id


I ran that code but I don't understand how it helps . .

It uses less memory if you have (say) millions of identical strings.


Not the case for the keys and unlikely for the values.

It
may help garbage collection performance, but not much else


Suppose the strings "foo" and "bar" comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string "foo" and one instance of string "bar" in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.


Unfortunately this doesn't correspond to my case . .

This is really an edge optimisation though, you really shouldn't need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.


Noooo..... even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Right.


Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.


OK - oh well it was worth a thought!


I ran the normal number of the inner two loops and only one of the
outermost loop (ie 1/50th of the total) and got the following with
different versions of Ruby:

ruby-1.8.7-p334 [ x86_64 ] = 31.41
ruby-1.9.2-p180 [ x86_64 ] = 19.53
rbx-head [ ] = 128.42

Many thanks!

Regards,

Phil.

--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top