M
Martin DeMello
I'm developing a set of ruby utilities to work with DAWG (directed
acyclic word graph) files, as produced by Graham Toal's program at
http://www.gtoal.com/wordgames/dawgutils/
Essentially a dawg is a variant of a trie, with a fixed-length (4 byte)
record containing a letter, a pointer to another node (the root of a
subtree of words starting with the current prefix), and bits flagging
the end of node and end of word markers for the current record
(see http://www.wutka.com/dawg.html)
The problem is that this is a very C-oriented data structure - a node is
stored as a consecutive series of records, which you read until you find
the letter you're looking for or you come to the end of node marker. The
pointer is likewise just an offset into the contiguous memory block the
dawg occupies (or into the file, but slurping the whole thing into
memory is one of the first optimisations I tried). All this ends up
being horribly inefficient in ruby - indeed, the bottleneck is the
get_children function, which I've implemented thus:
def children(index)
eon = eil
a = nil
until(eon)
a = memoized_read_record(index)
yield a
eon = a.eon?
index += 1
end
end
The entire code, and the dawg file I'm using to test it are up on
http://rubyforge.org/frs/?group_id=562 - could people please take a look
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)
martin
acyclic word graph) files, as produced by Graham Toal's program at
http://www.gtoal.com/wordgames/dawgutils/
Essentially a dawg is a variant of a trie, with a fixed-length (4 byte)
record containing a letter, a pointer to another node (the root of a
subtree of words starting with the current prefix), and bits flagging
the end of node and end of word markers for the current record
(see http://www.wutka.com/dawg.html)
The problem is that this is a very C-oriented data structure - a node is
stored as a consecutive series of records, which you read until you find
the letter you're looking for or you come to the end of node marker. The
pointer is likewise just an offset into the contiguous memory block the
dawg occupies (or into the file, but slurping the whole thing into
memory is one of the first optimisations I tried). All this ends up
being horribly inefficient in ruby - indeed, the bottleneck is the
get_children function, which I've implemented thus:
def children(index)
eon = eil
a = nil
until(eon)
a = memoized_read_record(index)
yield a
eon = a.eon?
index += 1
end
end
The entire code, and the dawg file I'm using to test it are up on
http://rubyforge.org/frs/?group_id=562 - could people please take a look
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)
martin