Optimisation help needed

M

Martin DeMello

I'm developing a set of ruby utilities to work with DAWG (directed
acyclic word graph) files, as produced by Graham Toal's program at
http://www.gtoal.com/wordgames/dawgutils/

Essentially a dawg is a variant of a trie, with a fixed-length (4 byte)
record containing a letter, a pointer to another node (the root of a
subtree of words starting with the current prefix), and bits flagging
the end of node and end of word markers for the current record
(see http://www.wutka.com/dawg.html)

The problem is that this is a very C-oriented data structure - a node is
stored as a consecutive series of records, which you read until you find
the letter you're looking for or you come to the end of node marker. The
pointer is likewise just an offset into the contiguous memory block the
dawg occupies (or into the file, but slurping the whole thing into
memory is one of the first optimisations I tried). All this ends up
being horribly inefficient in ruby - indeed, the bottleneck is the
get_children function, which I've implemented thus:

def children(index)
eon = eil
a = nil
until(eon)
a = memoized_read_record(index)
yield a
eon = a.eon?
index += 1
end
end

The entire code, and the dawg file I'm using to test it are up on
http://rubyforge.org/frs/?group_id=562 - could people please take a look
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)

martin
 
L

linus sellberg

Martin said:
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)

One of the most efficient algorithms for finding anagrams sorts both the
grammar and the word you want to find anagrams for before it starts to
compare anything.

(that is, bar => abr)
 
R

Robert Klemme

Martin DeMello said:
I'm developing a set of ruby utilities to work with DAWG (directed
acyclic word graph) files, as produced by Graham Toal's program at
http://www.gtoal.com/wordgames/dawgutils/

Essentially a dawg is a variant of a trie, with a fixed-length (4 byte)
record containing a letter, a pointer to another node (the root of a
subtree of words starting with the current prefix), and bits flagging
the end of node and end of word markers for the current record
(see http://www.wutka.com/dawg.html)

The problem is that this is a very C-oriented data structure - a node is
stored as a consecutive series of records, which you read until you find
the letter you're looking for or you come to the end of node marker. The
pointer is likewise just an offset into the contiguous memory block the
dawg occupies (or into the file, but slurping the whole thing into
memory is one of the first optimisations I tried). All this ends up
being horribly inefficient in ruby - indeed, the bottleneck is the
get_children function, which I've implemented thus:

def children(index)
eon = eil
a = nil
until(eon)
a = memoized_read_record(index)
yield a
eon = a.eon?
index += 1
end
end

The entire code, and the dawg file I'm using to test it are up on
http://rubyforge.org/frs/?group_id=562 - could people please take a look
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)

martin

This is what I came up with so far - it's cleaner but unfortunately not
faster. Interestingly enough the StrinIO variant is slower...

Have to go to bed now. HTH

robert
 
R

Robert Klemme

Robert Klemme said:
This is what I came up with so far - it's cleaner but unfortunately not
faster. Interestingly enough the StrinIO variant is slower...

Have to go to bed now. HTH

Ha, I must've been totally blind. Try this version instead!

You probably can make it even faster if you marshal the dictionary once
you've read it.

Good night

robert
 
M

Martin DeMello

linus sellberg said:
One of the most efficient algorithms for finding anagrams sorts both the
grammar and the word you want to find anagrams for before it starts to
compare anything.

(that is, bar => abr)

That doesn't extend itself well to anagrams with wildcards, though.
DAWGs are nice for those.

martin
 
M

Martin DeMello

Robert Klemme said:
Ha, I must've been totally blind. Try this version instead!

You probably can make it even faster if you marshal the dictionary once
you've read it.

Thanks - looks a lot more rubyish than mine, too! Will test it out when
I get home.

martin
 
R

Robert Klemme

Martin DeMello said:
Thanks - looks a lot more rubyish than mine, too! Will test it out when
I get home.

martin

There we some errors still. I did a bit of fixing and tidying up. Most
interesting I find the simplification achieved by "EOW =
DawgNode.new.freeze".

Have fun

robert
 
M

Martin DeMello

Robert Klemme said:
There we some errors still. I did a bit of fixing and tidying up. Most
interesting I find the simplification achieved by "EOW =
DawgNode.new.freeze".

Subtle problem:
current.children[char] = eoword ? DawgNode::EOW : nodes[ptr]

doesn't work because a node could be both EOW *and* have children. But
looks like a very nice clean approach on the whole - will happily
replace my old code with this.

martin
 
M

Martin DeMello

Martin DeMello said:
Robert Klemme said:
There we some errors still. I did a bit of fixing and tidying up. Most
interesting I find the simplification achieved by "EOW =
DawgNode.new.freeze".

Subtle problem:
current.children[char] = eoword ? DawgNode::EOW : nodes[ptr]

doesn't work because a node could be both EOW *and* have children. But
looks like a very nice clean approach on the whole - will happily
replace my old code with this.

(And yes, it turned out to be a very small fix - merely adding the eow
attribute back to every node and saying

if char != 0
current.children[char] = nodes[ptr]
current.eow = eoword
end

plus a few supporting changes elsewhere.)

martin
 
M

Martin DeMello

Martin DeMello said:
(And yes, it turned out to be a very small fix - merely adding the eow
attribute back to every node and saying

if char != 0
current.children[char] = nodes[ptr]
current.eow = eoword
end

My turn to be too sleepy for this :) Of course it's a bit more tricky
than that.

current.children[char].eow = eoword if ptr > 1

seems to work; I'll bang on it some more in the morning.

martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top