Database-like Algorithmic Problem

C

Christophe Mckeon

hi all,

ok here's the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to 'matching objects', i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

# please don't psychoanalyze the below :)
collection = [
{ :foo => 'python', :bar => 'php', :x => 69 },
{ :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
{ :x => 386 },
{ :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' }
]

template1 = { :mama => 'ruby', :papa => 'C' }
template2 = { :x => Numeric }

match_first(template1, collection)
# should return: { :foo => 'scheme', :mama => 'ruby', :papa => 'C' }
match_all(template2, collection)
# should return: [{:x => 386 }, {:foo => 'python', :bar => 'php', :x =>
69}]

so obviously === is being used internally as is evidenced by Numeric in
template2, but what kind of datastructures/algorithms should i be
looking at for the most efficient (time-wise) implementation, keeping in
mind, there may be many hashes in the collection and those hashes are
generally not very large, i.e. contain few key/value pairs. i'm not too
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color http://pastie.caboo.se/130033):

require 'set'

@collection = {}

def add hash
hash.each do |key,value|
(@collection[key] ||= Set.new) << hash
end
end

def candidates template
sets = []
template.keys.each do |key|
sets << @collection[key] if @collection.include? key
end
if sets.empty?
return sets
else
return sets.inject(sets.pop) do |intersection, set|
intersection & set
end
end
end

def match template, hash
template.each do |key, match|
return false unless match === hash[key]
end
return true
end

def match_first template
candidates(template).each do |hash|
return hash if match(template, hash)
end
return nil
end

def match_all template
return candidates(template).select do |hash|
match(template, hash)
end
end

[
{ :foo => 'python', :bar => 'php', :x => 69 },
{ :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
{ :x => 386, :papa => 'clause' },
{ :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' },
{ :mama => 1946, :boo => 'far', :x => 'x'}
].each do |hash|
add hash
end

puts match_first({ :mama => 'ruby', :papa => 'C' }).inspect
puts match_all({:foo => String}).inspect

thanks for reading this far!
_c
 
R

Robert Klemme

2007/12/18 said:
ok here's the scoop. i have a collection of hashes all with an arbitrary
number of symbolic keys pointing to arbitrary objects. i have a template
hash also containing symbol keys pointing to 'matching objects', i.e.
objects to be matched with the objects in the aforementioned collection,
and i want to pull out the first match. an example:

# please don't psychoanalyze the below :)

You are crazy. Please contact a doctor ASAP. Oh, sorry. :)
collection = [
{ :foo => 'python', :bar => 'php', :x => 69 },
{ :foo => 'scheme', :mama => 'ruby', :papa => 'C' },
{ :x => 386 },
{ :mama => 'ruby', :papa => 'simula', :lil_bro => 'fortran' }
]

template1 = { :mama => 'ruby', :papa => 'C' }
template2 = { :x => Numeric }

match_first(template1, collection)
# should return: { :foo => 'scheme', :mama => 'ruby', :papa => 'C' }
match_all(template2, collection)
# should return: [{:x => 386 }, {:foo => 'python', :bar => 'php', :x =>
69}]

so obviously === is being used internally as is evidenced by Numeric in
template2, but what kind of datastructures/algorithms should i be
looking at for the most efficient (time-wise) implementation, keeping in
mind, there may be many hashes in the collection and those hashes are
generally not very large, i.e. contain few key/value pairs. i'm not too
savy on the taxonomy of algorithms so any pointers to what i should be
looking at to read up on would be appreciated.

my potentially naive solution is something like the following (in full
color http://pastie.caboo.se/130033):

thanks for reading this far!

I think you are on a pretty good way. Here's what I would do differently:

1. Do no recreate all those candidate collections for every
match_first call. Instead, stuff everything into a class and provide a
method that will create an /index/ of the data (see 2).

2. I would use Hashes to store your indexes, i.e. instead of creating
candidates all over again, create them once (see 1) and keep them
there.

Kind regards

robert
 
C

Christophe Mckeon

You are crazy. Please contact a doctor ASAP. Oh, sorry. :)
:)


I think you are on a pretty good way. Here's what I would do
differently:

...

thanks robert. yeah i'd like to do some caching, but as hashes are
constantly
added and removed from the collection, i'll have to weigh up the
overhead of managing a cache, maybe even implement it and do some
profiling w/ and w/o. guess i'll have to read up on cache algs.

regards,
_c
 
C

cruiserdan

Interesting problem. Aside from seeing a doctor, my suggestions are in
the comments in the revised variant below:

---

require 'set'

# encapsulate
class DB

# use hash's default syntax
def initialize
@c = Hash.new{ | hash, key | hash[key] = Set.new }
end

# was: add, use << to follow ruby idiom for array, set
# store val with hash for quicker comparison
def <<(hash)
hash.each do |key,val|
@c[key] << [ val, hash ]
end
end

# need a delete operation ...
def delete(hash)
@c.delete(hash)
end

# was: match_all, use select to follow ruby enumerable idiom
# determine candidates on the fly, avoid intersection code using Set
def select(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
end
end
return good.to_a
end

# was: match_first, use select to follow ruby enumerable idiom
# determine candidates on the fly, avoid intersection code using Set
def find(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash unless bad.include?
hash
end
return good.to_a.first unless good.empty?
end
return nil
end
end


---

HTH!

Dan Yoder
http://dev.zeraweb.com/
 
C

Christophe Mckeon

cruiserdan said:
class DB
...
end

thanks for that, i benchmarked it and it is many times faster. now i
have to see how i'm going to do caching etc..

_c
 
C

Christophe Mckeon

Christophe said:
thanks for that, i benchmarked it and it is many times faster. now i
have to see how i'm going to do caching etc..

_c

oops, i read my benchmarks wrong, it is actually not *that* much faster

yours: 9.020000 0.940000 9.960000 ( 11.283543)
mine: 9.220000 1.380000 10.600000 ( 11.127394)

but is an improvement, so thanks. i'll do some more comprehensive
benchmarks
and post them here in a bit.
 
C

Christophe Mckeon

there's a bug in yours. not sure where yet. try

d = DB.new
d << {:a => 2, :b => 1, :c => 'b'}
puts d.find({ :a => Numeric, :b => 'a' }).inspect

returns {:a => 2, :b => 1, :c => 'b'}
 
C

cruiserdan

Okay, here's a fixed version.

There were 2 problems. The first was that I forgot to remove the bad
(unmatched) hashes before returning a result. The second was that you
have to go through every key in the template to find a match (I was
returning after the first match). Thus find is no more efficient than
select due to the way the data is stored.

I also took out the 'unless bad.include? hash' clause from the compare
because I think it actually is slower that way, depending on what you
are comparing. I thought it would be faster than the original version
posted because I don't have to generate a set of candidates first, so
the initial benchmarks were disappointing. But maybe this version will
do even better. I look forward to seeing your results.

---

# new version of select
def select(template)
good = Set.new; bad = Set.new
template.each do |key,val1|
(@c[key]).each do |val2,hash|
( val1 === val2 ? good : bad ) << hash # unless bad.include?
hash
end
end
return (good - bad).to_a
end

# new version of find
def find(template)
r = select template
r.first unless r.empty?
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,052
Members
48,769
Latest member
Clifft

Latest Threads

Top