W> Uri Guttman wrote:
W> ) GM> What we need in these situations is something different called DAWG.
W> ) GM> Please have a look at this article to understand what I mean.
W> ) GM>
http://en.wikipedia.org/wiki/DAWG
W> )
W> ) from that article:
W> )
W> ) for a query operation that tests whether a given string belongs
W> ) to the set in time proportional to its length
W> )
W> )
W> ) do you understand O() theory? hashes are close to O(1) while DAWG is
W> ) effectively O(N). not even close.
W> No, hash lookups on strings are O(N), where N is the length of the string.
W> This DAWG structure is the same as a trie in complexity. Also O(N).
i would disagree. the hash computation is O(N) on the length of the
string and then you still need a string compare to check that the bucket
hit is correct. but that is a linear scan of a string. the tree designs
have to go node to node for the length of the string. it is a higher
order amount of work which is O(N). so on average a hash lookup will be
close to O(1) with minor differences based on string length. a tree
lookup will be O(N) on string length in all cases. you need to compare
the largest operation to each other, not the string length.
W> The only difference is a constant factor, which is quite in favour of
W> hashes, because calculating a hash is such a simple operation, and can be
W> unrolled. (Also, in Perl, it's internal, which is a big speed win.)
that isn't just a constant factor. the O(N) parts are doing different
amounts of work.
W> The problem with DAWG over a trie is that building it is effectively akin
W> to data compression, especially when it grows large and you have multiple
W> choices how to proceed.
that is another higher level of work that adds more to the tree than
hashing. note that even in hashing you occaisionally need to rebuild the
hash when it gets too filled. that is still accounted for in the close
to O(1) growth (it is a large part of why it isn't exactly O(1).
W> It's perfect for static dictionaries where you build it once, and then use
W> it for millions of lookups. Dynamic structures seem nigh impossible.
W> I think the best way to build this data structure would be to first build a
W> simple trie, and then run a sort of compressor on it that combines the
W> redundant suffixen.
either way, the OP can't code it nor can he compare the growth of either
algorithm.
uri