Hashes are good, but not good enough.

G

George Mpouras

We all find hashes pretty convenient, and helpful to make things fast; but
the problem with hashes is that they consume too much memory on big trees,
and they are too slow for hundred millions of keys. Try it yourself.

What we need in these situations is something different called DAWG.

Please have a look at this article to understand what I mean.

http://en.wikipedia.org/wiki/DAWG



I tried to implement this DAWG data structure with Perl but I found it VERY
difficult.

What are your thoughts about the necessity of this new data structure.
 
U

Uri Guttman

GM> We all find hashes pretty convenient, and helpful to make things
GM> fast; but the problem with hashes is that they consume too much
GM> memory on big trees, and they are too slow for hundred millions of
GM> keys. Try it yourself.

no need to try it. that large a hash is foolish in all ways. you are
choosing the wrong tool.

GM> What we need in these situations is something different called DAWG.

GM> Please have a look at this article to understand what I mean.

GM> http://en.wikipedia.org/wiki/DAWG

from that article:

for a query operation that tests whether a given string belongs
to the set in time proportional to its length


do you understand O() theory? hashes are close to O(1) while DAWG is
effectively O(N). not even close.

GM> I tried to implement this DAWG data structure with Perl but I
GM> found it VERY difficult.

then you also have difficulty understanding the speed
differences. undoubtedly you have run into thrashing which is not a hash
problem but a ram problem.

GM> What are your thoughts about the necessity of this new data structure.

not needed in core perl. hashes are far faster. perl has always traded
off ram size for more speed in its design. a DAWG will never be able to
replace a hash for that reason. by far most hashes fit in ram just
fine. if you are scaling that large you likely need a db or something
disk based anyhow.

uri
 
G

George Mpouras

Uri Guttman said:
GM> We all find hashes pretty convenient, and helpful to make things
GM> fast; but the problem with hashes is that they consume too much
GM> memory on big trees, and they are too slow for hundred millions of
GM> keys. Try it yourself.

no need to try it. that large a hash is foolish in all ways. you are
choosing the wrong tool.

GM> What we need in these situations is something different called DAWG.

GM> Please have a look at this article to understand what I mean.

GM> http://en.wikipedia.org/wiki/DAWG

from that article:

for a query operation that tests whether a given string belongs
to the set in time proportional to its length


do you understand O() theory? hashes are close to O(1) while DAWG is
effectively O(N). not even close.

GM> I tried to implement this DAWG data structure with Perl but I
GM> found it VERY difficult.

then you also have difficulty understanding the speed
differences. undoubtedly you have run into thrashing which is not a hash
problem but a ram problem.

GM> What are your thoughts about the necessity of this new data
structure.

not needed in core perl. hashes are far faster. perl has always traded
off ram size for more speed in its design. a DAWG will never be able to
replace a hash for that reason. by far most hashes fit in ram just
fine. if you are scaling that large you likely need a db or something
disk based anyhow.

uri

--
Uri Guttman ------ (e-mail address removed) --------
http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training,
Support ------
--------- Gourmet Hot Cocoa Mix ----
http://bestfriendscocoa.com ---------



he millions can also be 2 or 1000 keys. There is something very important
here. DAWG structures are much smaller than hashes and searching them is
also faster.
 
W

Willem

Uri Guttman wrote:
) GM> What we need in these situations is something different called DAWG.
) GM> Please have a look at this article to understand what I mean.
) GM> http://en.wikipedia.org/wiki/DAWG
)
) from that article:
)
) for a query operation that tests whether a given string belongs
) to the set in time proportional to its length
)
)
) do you understand O() theory? hashes are close to O(1) while DAWG is
) effectively O(N). not even close.

No, hash lookups on strings are O(N), where N is the length of the string.
This DAWG structure is the same as a trie in complexity. Also O(N).

The only difference is a constant factor, which is quite in favour of
hashes, because calculating a hash is such a simple operation, and can be
unrolled. (Also, in Perl, it's internal, which is a big speed win.)

The problem with DAWG over a trie is that building it is effectively akin
to data compression, especially when it grows large and you have multiple
choices how to proceed.

It's perfect for static dictionaries where you build it once, and then use
it for millions of lookups. Dynamic structures seem nigh impossible.

I think the best way to build this data structure would be to first build a
simple trie, and then run a sort of compressor on it that combines the
redundant suffixen.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Willem

Uri Guttman wrote:
) hashes don't change their speed based on key length.

Excuse me ?
You have to traverse the whole length of the key to calculate the hash.
And after that, to check that you've got the right entry, you have to
compare the search key to the hash key. Again O(N) in key length.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
T

Ted Zlatanov

GM> We all find hashes pretty convenient, and helpful to make things fast; but
GM> the problem with hashes is that they consume too much memory on big trees,
GM> and they are too slow for hundred millions of keys.

This is true and any experienced programmer will tell you not to put
hundreds of millions of keys in a hash in any language. Can you give a
specific case where this is needed?

GM> What we need in these situations is something different called DAWG.

I think that's very debatable. At the very least you should understand
the data and the functional requirements when you deal with non-trivial
data management on a medium-to-large scale. This is what good database
designers will help you accomplish. Here's a sample of the questions I
would ask:

- is access batched, random, or mixed?

- is the data clustered in any way? Can it be partitioned?

- are updates random, append, or both?

- what are the performance requirements for each of the above?

- what's the budget?

- how stable does the environment need to be, e.g. 99.999% uptime

- what's the security model?

- have we done this before? Has this been done before?

etc etc. A different data structure will not help you with most of
these questions. Also, remember that VENDORS AND PROGRAMMERS LIE. They
lie about cost, stability, performance, and support. So suspect anyone
that tells you managing this much data is easy with ANY product,
including Perl.

Ted
 
U

Uri Guttman

W> Uri Guttman wrote:
W> ) GM> What we need in these situations is something different called DAWG.
W> ) GM> Please have a look at this article to understand what I mean.
W> ) GM> http://en.wikipedia.org/wiki/DAWG
W> )
W> ) from that article:
W> )
W> ) for a query operation that tests whether a given string belongs
W> ) to the set in time proportional to its length
W> )
W> )
W> ) do you understand O() theory? hashes are close to O(1) while DAWG is
W> ) effectively O(N). not even close.

W> No, hash lookups on strings are O(N), where N is the length of the string.
W> This DAWG structure is the same as a trie in complexity. Also O(N).

i would disagree. the hash computation is O(N) on the length of the
string and then you still need a string compare to check that the bucket
hit is correct. but that is a linear scan of a string. the tree designs
have to go node to node for the length of the string. it is a higher
order amount of work which is O(N). so on average a hash lookup will be
close to O(1) with minor differences based on string length. a tree
lookup will be O(N) on string length in all cases. you need to compare
the largest operation to each other, not the string length.

W> The only difference is a constant factor, which is quite in favour of
W> hashes, because calculating a hash is such a simple operation, and can be
W> unrolled. (Also, in Perl, it's internal, which is a big speed win.)

that isn't just a constant factor. the O(N) parts are doing different
amounts of work.

W> The problem with DAWG over a trie is that building it is effectively akin
W> to data compression, especially when it grows large and you have multiple
W> choices how to proceed.

that is another higher level of work that adds more to the tree than
hashing. note that even in hashing you occaisionally need to rebuild the
hash when it gets too filled. that is still accounted for in the close
to O(1) growth (it is a large part of why it isn't exactly O(1).

W> It's perfect for static dictionaries where you build it once, and then use
W> it for millions of lookups. Dynamic structures seem nigh impossible.

W> I think the best way to build this data structure would be to first build a
W> simple trie, and then run a sort of compressor on it that combines the
W> redundant suffixen.

either way, the OP can't code it nor can he compare the growth of either
algorithm.

uri
 
W

Willem

Uri Guttman wrote:
) W> No, hash lookups on strings are O(N), where N is the length of the string.
) W> This DAWG structure is the same as a trie in complexity. Also O(N).
)
) i would disagree. the hash computation is O(N) on the length of the
) string and then you still need a string compare to check that the bucket
) hit is correct. but that is a linear scan of a string. the tree designs
) have to go node to node for the length of the string. it is a higher
) order amount of work which is O(N). so on average a hash lookup will be
) close to O(1) with minor differences based on string length. a tree
) lookup will be O(N) on string length in all cases. you need to compare
) the largest operation to each other, not the string length.

You should read up on big-oh notation. You certainly cannot sweep an O(N)
operation under the carpet and claim that something is O(1), because the
O(N) bit is blazingly fast. The whole basis of big-oh is that O(f(x)) is
*always* based on the largest *growing* factor of f(x), and that any
constant factors are irrelevant to the big-oh.

To restate: hashes and tries have the same big-oh time complexity for key
lookups, but the constant factor probably differs by a few orders of magnitude.

) W> The only difference is a constant factor, which is quite in favour of
) W> hashes, because calculating a hash is such a simple operation, and can be
) W> unrolled. (Also, in Perl, it's internal, which is a big speed win.)
)
) that isn't just a constant factor. the O(N) parts are doing different
) amounts of work.

I don't follow. The amount of work done for a single character, when
comparing the string to the key in a hash, is constant. The amount of
work done for a single character, when following nodes in a tree, is
also constant. Since they are both constant, the difference is a constant
factor.

) W> The problem with DAWG over a trie is that building it is effectively akin
) W> to data compression, especially when it grows large and you have multiple
) W> choices how to proceed.
)
) that is another higher level of work that adds more to the tree than
) hashing. note that even in hashing you occaisionally need to rebuild the
) hash when it gets too filled. that is still accounted for in the close
) to O(1) growth (it is a large part of why it isn't exactly O(1).

This part of my reply had nothing to do with the time complexity.
The DAWG structure is effectively ROM. You build it once, at quite
a high cost, and then you get very fast lookups with relatively small
space usage.

) W> It's perfect for static dictionaries where you build it once, and then use
) W> it for millions of lookups. Dynamic structures seem nigh impossible.
)
) W> I think the best way to build this data structure would be to first build a
) W> simple trie, and then run a sort of compressor on it that combines the
) W> redundant suffixen.
)
) either way, the OP can't code it nor can he compare the growth of either
) algorithm.

I think what the OP especially missed is the time complexity of inserting
keys, as opposed to a trie. As a side note, it is also impossible to
store data along with the keys, because the uniqueness of the end nodes
is lost.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
U

Uri Guttman

W> Uri Guttman wrote:
W> ) W> No, hash lookups on strings are O(N), where N is the length of the string.
W> ) W> This DAWG structure is the same as a trie in complexity. Also O(N).
W> )
W> ) i would disagree. the hash computation is O(N) on the length of the
W> ) string and then you still need a string compare to check that the bucket
W> ) hit is correct. but that is a linear scan of a string. the tree designs
W> ) have to go node to node for the length of the string. it is a higher
W> ) order amount of work which is O(N). so on average a hash lookup will be
W> ) close to O(1) with minor differences based on string length. a tree
W> ) lookup will be O(N) on string length in all cases. you need to compare
W> ) the largest operation to each other, not the string length.

W> You should read up on big-oh notation. You certainly cannot sweep an O(N)
W> operation under the carpet and claim that something is O(1), because the
W> O(N) bit is blazingly fast. The whole basis of big-oh is that O(f(x)) is
W> *always* based on the largest *growing* factor of f(x), and that any
W> constant factors are irrelevant to the big-oh.

you can do that if the work per unit is less than the tree scanning
operation. it is done all the time. you need to look at the largest work
unit and compare them. a disk sort will use ram sorts but the ram sort
work is irrelevent as the disk sort will dominate the run time. the
dominant part of a hash is just a hash generation which is much faster
in a scan than going down a tree. a hash and tree may both be O(N) but
they have different dominant parts. hashing is considered close to O(1)
because it doesn't make a difference in how many elements you have, it
is close to a constant time per lookup. any tree is dependent on the
size of the tree itself. that means it is O(log N) at best and O(N) at
worst. the scale factor is the number of elements in the structure, not
the length of the keys. this is how you do proper algorithm analysis -
you need to find the growth curve and that is dependent on the count of
elements, not their string length.

W> To restate: hashes and tries have the same big-oh time complexity
W> for key lookups, but the constant factor probably differs by a few
W> orders of magnitude.

and as i showed you, you are wrong. in growth terms hashes are close to
O(1) and any tree is no better than O(log N).

uri
 
W

Willem

Uri Guttman wrote:
) W> You should read up on big-oh notation. You certainly cannot sweep an O(N)
) W> operation under the carpet and claim that something is O(1), because the
) W> O(N) bit is blazingly fast. The whole basis of big-oh is that O(f(x)) is
) W> *always* based on the largest *growing* factor of f(x), and that any
) W> constant factors are irrelevant to the big-oh.
)
) you can do that if the work per unit is less than the tree scanning
) operation. it is done all the time. you need to look at the largest work
) unit and compare them. a disk sort will use ram sorts but the ram sort
) work is irrelevent as the disk sort will dominate the run time. the
) dominant part of a hash is just a hash generation which is much faster
) in a scan than going down a tree. a hash and tree may both be O(N) but
) they have different dominant parts. hashing is considered close to O(1)
) because it doesn't make a difference in how many elements you have, it
) is close to a constant time per lookup. any tree is dependent on the
) size of the tree itself. that means it is O(log N) at best and O(N) at
) worst. the scale factor is the number of elements in the structure, not
) the length of the keys. this is how you do proper algorithm analysis -
) you need to find the growth curve and that is dependent on the count of
) elements, not their string length.

You're talking about trees. A trie is different, in that each node is
a small part of a key, not a whole key. A lookup in a trie takes N
operations, where N is the key length. In a hash a lookup also takes
at least N operations, the only difference being that the operations
themselves are faster by a constant factor.

) W> To restate: hashes and tries have the same big-oh time complexity
) W> for key lookups, but the constant factor probably differs by a few
) W> orders of magnitude.
)
) and as i showed you, you are wrong. in growth terms hashes are close to
) O(1) and any tree is no better than O(log N).

You did not show me any such thing, all I saw was some handwaving about
how an O(N) operation is dominated by a O(1) operation, which is hogwash
in complexity theory. O(N) *always* dominates O(1). No exceptions.

Also, a trie is very different from a tree in this respect. log N
doesn't even enter into the picture.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
I

Ilya Zakharevich

GM> We all find hashes pretty convenient, and helpful to make things fast; but
GM> the problem with hashes is that they consume too much memory on big trees,
GM> and they are too slow for hundred millions of keys.

This is true and any experienced programmer will tell you not to put
hundreds of millions of keys in a hash in any language.

???!!! How comes? If what I need is a hash-like semantic, and I need
"hundreds of millions of keys", why would not I use hash? I never
programmed in this "any language", so you can reduce your arguments to
Perl. Again, to be specific, assume that this particular process has
ulimit of 100GB heap, and other processes do not use much memory, so
there won't be any swapping...

Ilya

PS [I skip the rest, which is very on-point...]
 
I

Ilya Zakharevich

We all find hashes pretty convenient, and helpful to make things fast; but
the problem with hashes is that they consume too much memory on big trees,
and they are too slow for hundred millions of keys. Try it yourself.

What we need in these situations is something different called DAWG.

Please have a look at this article to understand what I mean.

http://en.wikipedia.org/wiki/DAWG

I find this claim

a DAWG is very similar to a trie, but it is much more space efficient

very suspicious. If you want DAWG to be dynamic, then extra "parent"
pointers you need to keep would, in most situations, far outweight the
space savings. If DAWG is not intended to change, you do not need
parent pointers, but still, I can't imagine many situations in which
the tail-folding would save much.

Are you sure that your set of keys is so special?

Ilya
 
G

George Mpouras

either way, the OP can't code it nor can he compare the growth of either
algorithm.

DAWG and compressed DAWG unfortunately are not possible to implemented
with pure perl, but A C++ module is possible.
I believe CDAWG functionality, will give Perl a huge advantage over
other languages. CDAWGs can give instant ansewers over bilion of
recoreds. They are even faster than Perfect hashes (no collision hashes)
 
P

Peter J. Holzer

GM> We all find hashes pretty convenient, and helpful to make things fast; but
GM> the problem with hashes is that they consume too much memory on big trees,
GM> and they are too slow for hundred millions of keys.

This is true and any experienced programmer will tell you not to put
hundreds of millions of keys in a hash in any language.

Actually, only part of this is true. Hashes do consume a lot of memory,
but they are not slow for a large number of keys. As Willem correctly
pointed out, they are O(n) on the length of the key but O(1) on the
number of keys. As long as your hash fits into RAM, lookups and
insertions are the same speed whether your hash contains 100 keys or
100,000,000.

hp
 
W

Willem

George Mpouras wrote:
)>
)> either way, the OP can't code it nor can he compare the growth of either
)> algorithm.
)>
)
) DAWG and compressed DAWG unfortunately are not possible to implemented
) with pure perl, but A C++ module is possible.

Where did you get that strange idea ?
Of course they are possible to be implemented with pure perl.

) I believe CDAWG functionality, will give Perl a huge advantage over
) other languages. CDAWGs can give instant ansewers over bilion of
) recoreds. They are even faster than Perfect hashes (no collision hashes)

This DAWG data-structure only has very few, limited use cases.
It is definitely not a replacement for perl hashes.

If you want to use it, fine, but just write it as a perl module,
perhaps with a C library for speed. You can even tie it to a hash.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
J

Jürgen Exner

George Mpouras said:
We all find hashes pretty convenient, and helpful to make things fast; but
the problem with hashes is that they consume too much memory on big trees,
and they are too slow for hundred millions of keys. Try it yourself.

What we need in these situations is something different called DAWG.

Please have a look at this article to understand what I mean.

http://en.wikipedia.org/wiki/DAWG

I tried to implement this DAWG data structure with Perl but I found it VERY
difficult.

What are your thoughts about the necessity of this new data structure.

Apples and oranges.

A hash is a mapping from string to scalar, a DAWG is a special
implementation of a string set, which trades space for algorithmic
complexity (ever tried deleting a word from a DAWG?).

The one has nothing to do with the other. Of course you can use hashes
to implement a DAWG, but that's a different story.

jue
 
J

Jürgen Exner

Willem said:
Uri Guttman wrote:
) W> You should read up on big-oh notation. You certainly cannot sweep an O(N)
) W> operation under the carpet and claim that something is O(1), because the
) W> O(N) bit is blazingly fast. The whole basis of big-oh is that O(f(x)) is
) W> *always* based on the largest *growing* factor of f(x), and that any
) W> constant factors are irrelevant to the big-oh.
)
) you can do that if the work per unit is less than the tree scanning
) operation. it is done all the time. you need to look at the largest work
) unit and compare them. a disk sort will use ram sorts but the ram sort
) work is irrelevent as the disk sort will dominate the run time. the
) dominant part of a hash is just a hash generation which is much faster
) in a scan than going down a tree. a hash and tree may both be O(N) but
) they have different dominant parts. hashing is considered close to O(1)
) because it doesn't make a difference in how many elements you have, it
) is close to a constant time per lookup. any tree is dependent on the
) size of the tree itself. that means it is O(log N) at best and O(N) at
) worst. the scale factor is the number of elements in the structure, not
) the length of the keys. this is how you do proper algorithm analysis -
) you need to find the growth curve and that is dependent on the count of
) elements, not their string length.

You're talking about trees. A trie is different, in that each node is
a small part of a key, not a whole key. A lookup in a trie takes N
operations, where N is the key length. In a hash a lookup also takes
at least N operations, the only difference being that the operations
themselves are faster by a constant factor.

) W> To restate: hashes and tries have the same big-oh time complexity
) W> for key lookups, but the constant factor probably differs by a few
) W> orders of magnitude.
)
) and as i showed you, you are wrong. in growth terms hashes are close to
) O(1) and any tree is no better than O(log N).

You did not show me any such thing, all I saw was some handwaving about
how an O(N) operation is dominated by a O(1) operation, which is hogwash
in complexity theory. O(N) *always* dominates O(1). No exceptions.

Would you agree if instead of "dominate" Uri would have used "most
expensive"? I guess that would have made it clearer.

For the kind of operation we are looking at here the most expensive
operation is RAM access. And the number of times the algorithm reads
from RAM for a hash is typically 1 or a small number if the bucket has
more than one element while for a DAWG or tree it will be one RAM access
for each character in the word, i.e. total of O(n).

jue
 
J

Jürgen Exner

George Mpouras said:
DAWG and compressed DAWG unfortunately are not possible to implemented
with pure perl,

Perl is (except for space constraints) turing-capable, therefore of
course you can implement DAWGs in Perl.
And I think it's not even that hard using cascading HoH's. Of course
designing the algorithms to add and in particular to delete elements is
a totally different story.
I believe CDAWG functionality, will give Perl a huge advantage over
other languages. CDAWGs can give instant ansewers over bilion of
recoreds. They are even faster than Perfect hashes (no collision hashes)

Yeah, but where do you need such features? I am sure there are areas,
maybe genetics or astronomie or particle physics or fluid dynamics.
But Perl is a general purpose language and the needs of people working
in those exotic fields of science are so, well, exotic, that they need
to do their own specialized coding anyway.

DAWGs are nothing but a very special implementation of string sets. If
at all you should be asking for adding string sets as a basic data
structure to Perl. And then as a second step you can discuss the actual
implementation of string sets and DAWGs may be one option among many.

jue
 
J

Jürgen Exner

Willem said:
This DAWG data-structure only has very few, limited use cases.
It is definitely not a replacement for perl hashes.

Quite right. A DAWGs is merely a set of strings while a hash is a
mapping (from strings to scalar). And thus can also be used to simulate
a string set.
If you want to use it, fine, but just write it as a perl module,
perhaps with a C library for speed.

Exactly

jue
 
W

Willem

J?rgen Exner wrote:
) Would you agree if instead of "dominate" Uri would have used "most
) expensive"? I guess that would have made it clearer.
)
) For the kind of operation we are looking at here the most expensive
) operation is RAM access. And the number of times the algorithm reads
) from RAM for a hash is typically 1 or a small number if the bucket has
) more than one element while for a DAWG or tree it will be one RAM access
) for each character in the word, i.e. total of O(n).

You can't just claim that one RAM access of unlimited size can be counted
as one operation. How long that access takes is still linearly dependent
on the size of the key.

I'm not saying that hashes are not orders of magnitude faster than tries.

What I'm saying is that you can't use complexity theory as a proof of this.

For example, you could (roughly) say the following:
(The numbers are just examples, to make the point)

The time to check if a key of length N is in a hash is 1000 + N.
The time to check if a key of length N is in a trie is 1000 * N.

Now, looking at that, it's obvious that a hash is a lot faster.

But from the viewpoint of complexity theory, 1000 + N = O(N),
and 1000 * N = O(N), so they're both O(N). They have the same
time complexity.

If two algorithms have the same time complexity, you have
to look at other things to determine which will be faster.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top