Quicker finding strings? Alternative for array, hash, set?

P

Patrick Put

I've been searching for this information already, but cannot really find
the answer I'm looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way.... and it isn't.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I'm wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?
 
J

Jesús Gabriel y Galán

I've been searching for this information already, but cannot really find
the answer I'm looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way.... and it isn't.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I'm wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?

If you are only concerned with lookup time, a hash has amortized O(1)
lookup time:

require 'benchmark'

words = ("aaaa".."zzzz").to_a # => 456,976 strings

hash = {}
words.each {|w| hash[w] = true}

TIMES = 10
Benchmark.bmbm do |x|
x.report("Array#find existing") do
TIMES.times do
words.find {|w| w == "mmmm"}
end
end
x.report("Array#find non-existing") do
TIMES.times do
words.find {|w| w == "mmmmm"}
end
end
x.report("Hash#[] existing") do
TIMES.times do
hash["mmmm"]
end
end
x.report("Hash#[] non-existing") do
TIMES.times do
hash["mmmmm"]
end
end
end



jesus@jesus-laptop:~/personal/programacion/ruby/Benchmarks/lib$ ruby
hash_array2.rb
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)


Jesus.
 
P

Patrick Put

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. ;-) Thanks for showing me this!

I ran a program over some 1.8 billion lines of data this weekend. Good
thing it was more slowed down because of IO than my Ruby programming. It
would have been faster of course, but the damage is not that big...
 
R

Robert Klemme

2009/2/2 Patrick Put said:
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. ;-) Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert
 
P

Patrick Put

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.
Is the implementation "under the hood" the same? That would mean my
choice to use a Set was not so bad after all.
 
J

Jesús Gabriel y Galán

Is the implementation "under the hood" the same? That would mean my
choice to use a Set was not so bad after all.

If you check here:

http://www.ruby-doc.org/core/classes/Set.html

you can see the source code for each method. For example this is the method new:

# File lib/set.rb, line 64
def initialize(enum = nil, &block) # :yields: o
@hash ||= Hash.new

enum.nil? and return

if block
enum.each { |o| add(block[o]) }
else
merge(enum)
end
end

So you can see, it uses a Hash underneath. Let's take a look at the add method:

# File lib/set.rb, line 195
def add(o)
@hash[o] = true
self
end

So, it's using a hash with true as the values (the same I did in my
example). Of course, if the true doesn't mean anything to you (which
shouldn't as you are not using the values), using a Set is a bit more
clear, and has convenience methods for Set "arithmetic".

Jesus.
 
P

Patrick Put

Very clear. I should/could have checked this myself. I'm only using Ruby
for a few months now, and loving it more and more.

Thanks guys!
 
S

Sandor Szücs

Is the implementation "under the hood" the same?

Set uses a Hash as storage.
See set.rb:
--------------8<-------------
# The equality of each couple of elements is determined according to
# Object#eql? and Object#hash, since Set uses Hash as storage.
-------------->8-------------

regards, Sandor Sz=FCcs
--
 
K

Ken Bloom

2009/2/2 Patrick Put said:
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. ;-) Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Quiz #103 (http://rubyquiz.com/quiz103.html)
 
R

Robert Klemme

2009/2/3 Ken Bloom said:
2009/2/2 Patrick Put said:
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

user system total real
Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. ;-) Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.
A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Quiz #103 (http://rubyquiz.com/quiz103.html)

Good idea. But this is only a realistic option if the trie is
implemented in C I guess - otherwise the standard String#hash and
Set#include? are faster.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top