Hashes are good, but not good enough.

P

Peter J. Holzer

Ilya Zakharevich said:
I find this claim

a DAWG is very similar to a trie, but it is much more space efficient

very suspicious. If you want DAWG to be dynamic, then extra "parent"
pointers you need to keep would, in most situations, far outweight the
space savings. If DAWG is not intended to change, you do not need
parent pointers,

I don't think a DAWG is intended to change. Inserts and deletions look
expensive to me. You probably build it once from a collection of words
and then do only lookups.
but still, I can't imagine many situations in which
the tail-folding would save much.

The data structure seems to be designed to represent a dictionary of
(English) words. Many words share a common suffix, so would expect
significant space savings compared to a trie. If you use it to store
data without common suffixes it will degenerate into a trie, so it should
never be worse than a trie, IMHO.

hp
 
J

Jürgen Exner

Willem said:
J?rgen Exner wrote:
) Would you agree if instead of "dominate" Uri would have used "most
) expensive"? I guess that would have made it clearer.
)
) For the kind of operation we are looking at here the most expensive
) operation is RAM access. And the number of times the algorithm reads
) from RAM for a hash is typically 1 or a small number if the bucket has
) more than one element while for a DAWG or tree it will be one RAM access
) for each character in the word, i.e. total of O(n).

You can't just claim that one RAM access of unlimited size can be counted
as one operation. How long that access takes is still linearly dependent
on the size of the key.

???
You lost me, with both sentences. Probably I simply don't understand
what you are meaning. But from my understanding
- there is no "unlimited size". In a hash without collisions there is
exactly one retrieval operation which grabs exactly the one single
element that is wanted.
- the time it takes to retrieve this one element is a constant because
it is retrieved by direct access instead of going through a chain of
"check for next letter; retrieve pointer to the rest of the word"
I'm not saying that hashes are not orders of magnitude faster than tries.

What I'm saying is that you can't use complexity theory as a proof of this.

For example, you could (roughly) say the following:
(The numbers are just examples, to make the point)

The time to check if a key of length N is in a hash is 1000 + N.

But it isn't. The hash function returns the direct location of where to
find the desired value, therefore the cost is 1000+1.

jue
 
P

Peter J. Holzer

???
You lost me, with both sentences. Probably I simply don't understand
what you are meaning. But from my understanding
- there is no "unlimited size". In a hash without collisions there is
exactly one retrieval operation which grabs exactly the one single
element that is wanted.

A hash in Perl is a mapping from string to scalar. A lookup in a hash
consists of:

1) Compute the hash value of the key.

2) Locate the bucket for the hash value.

3) For each key/value pair in the bucket, perform a string comparison
until the matching key is found.

ad 1) To compute the hash value of a string, you have to retrieve each
byte of the string. So for a key of n bytes length, this step consists
of n memory accesses plus n computation steps, so this step is
O(key_length). (you may be able to optimize this to n/4 or n/8 memory
accesses, but that's a constant factor so it doesn't change Big O).

ad 2) Locating the bucket is typically a single memory access. This
memory access is typically much more expensive than each of the byte
accesses in step 1 (which had a good chance of hitting L1 cache). But
it's a single operation with an upper bound in time, so it's O(1).

ad 3) Each string comparison is O(key_length) for a hit and at most
O(key_length) for a miss (for a good hash function I would expect a
difference in the first few bytes, though). So this is something between
O(key_length) + O(bucket_size) and O(key_length) * O(bucket_size).

In total we now have a complexity between

O(key_length) + O(1) + O(key_length) + O(bucket_size)
= O(key_length) + O(bucket_size)

and

O(key_length) + O(1) + O(key_length) * O(bucket_size)
= O(key_length) * O(bucket_size)

Since the bucket size should be roughly constant, this reduces to

O(key_length).


IMO this is almost completely irrelevant, though. Big O notation is only
useful to determine the behaviour for large values, while hash keys are
typically short.
But it isn't. The hash function returns the direct location of where to
find the desired value, therefore the cost is 1000+1.

The hash function itself is O(key_length).

hp
 
P

Peter J. Holzer

???
You lost me, with both sentences. Probably I simply don't understand
what you are meaning. But from my understanding
- there is no "unlimited size". In a hash without collisions there is
exactly one retrieval operation which grabs exactly the one single
element that is wanted.

A hash in Perl is a mapping from string to scalar. A lookup in a hash
consists of:

1) Compute the hash value of the key.

2) Locate the bucket for the hash value.

3) For each key/value pair in the bucket, perform a string comparison
until the matching key is found.
[...]

this reduces to

O(key_length).


IMO this is almost completely irrelevant, though. Big O notation is only
useful to determine the behaviour for large values, while hash keys are
typically short.

Maybe not quite so irrelevant. I expected the linear behaviour to show
itself only for keys of at least a few thousand bytes, but I was wrong.
At least on my system, some dependence on key length is noticeable even
for very short key lengths and from about 200 bytes on the run-time is
almost completely linear (i.e. a lookup of a 400 bytes key takes about
twice as long as a lookup of a 200 bytes key).

Here is a short benchmark program:


#!/usr/bin/perl
use warnings;
use strict;

use Time::HiRes qw(time);

my $n1 = 1000;
my $n2 = 1000;

$| = 1;
for (my $len = 10; $len < 1_000_000; $len *= 2) {

print "$len preparing ...";
my @keys = ();
for (1 .. $n1) {
my $k = "";
for (1 .. $len) {
$k .= chr(int(rand(64)) + 32);
}
push @keys, $k;
}
print "\n";
my $t0 = time;
my %hash;
for (1 .. $n2) {
for (@keys) {
$hash{$_}++;
}
}
my $t1 = time;
printf "%d %g\n", $len, ($t1 - $t0);
}

hp
 
S

scattered

We all find hashes pretty convenient, and helpful to make things fast; but
the problem with hashes is that they consume too much memory on big trees,
and they are too slow for hundred millions of keys. Try it yourself.

What we need in these situations is something different called DAWG.

Please have a look at this article to understand what I mean.

http://en.wikipedia.org/wiki/DAWG

I tried to implement this DAWG data structure with Perl but I found it VERY
difficult.

What are your thoughts about the necessity of this new data structure.

DAWGs are data structures for strings whose sole purpose is to tell if
a given string is in a set of strings. It can't be used as a data
structure which implements a key-data mapping. The entire point of a
DAWG is that strings which begin and end the same way end up at the
same final node - so you can't just put the keyed data at that node.
You could put a hash at a final node, with keys consisting of those
keys in the original set of keys which lead to the same DAWG node -
but in your intended application of 100s of millions of keys, you
would doubtless just replace one huge hash by thousands of smaller,
but still large hashes. There might be *some* scenario in which this
would be advantageous (since these smaller hashes would have fewer
collisions) - but the overhead would be substantial and the lack of
easy insertions and deletions a major drawback. Hashes are good at
what they do, and DAWGs provide no real competition.

Having said all this, a DAWG implementation in Perl would be quite
nice, though it would clearly belong in CPAN rather than as a core
part of the language. It is hard to believe that it hasn't been done
by anybody, but maybe not (the unfortunate tendency for semi-literate
people to spell "dog" as "dawg" in the mistaken belief that this is
somehow cute or cool makes searching for DAWG implementations somewhat
annoying) . A Perl DAWG implementation could make code for something
like http://dotnetperls.com/scrabble more elegant.

-scattered
 
I

Ilya Zakharevich

I don't think a DAWG is intended to change. Inserts and deletions look
expensive to me. You probably build it once from a collection of words
and then do only lookups.

There is no problem on deletion. For insertion, one needs a
dictionary of suffixes - which is, essentially, given by parent
pointers, since they form a DAWG as well.
The data structure seems to be designed to represent a dictionary of
(English) words. Many words share a common suffix, so would expect
significant space savings compared to a trie.

"Significant" in which sense? For me, half an order of magnitude is
significant. 30% is not - in most situations Perl is used for. I
would expect it to be of second type for English words - but it is
just a guts feeling... (And I was considered "DAWG vs *folded* tries"
- those where a node contains a substring, not a char.)
If you use it to store
data without common suffixes it will degenerate into a trie, so it should
never be worse than a trie, IMHO.

DAWG may be MUCH worse than a folded trie - even if you fold a DAWG.

Ilya
 
T

Ted Zlatanov

GM> We all find hashes pretty convenient, and helpful to make things fast; but
GM> the problem with hashes is that they consume too much memory on big trees,
GM> and they are too slow for hundred millions of keys.
IZ> ???!!! How comes? If what I need is a hash-like semantic, and I need
IZ> "hundreds of millions of keys", why would not I use hash? I never
IZ> programmed in this "any language", so you can reduce your arguments to
IZ> Perl. Again, to be specific, assume that this particular process has
IZ> ulimit of 100GB heap, and other processes do not use much memory, so
IZ> there won't be any swapping...


PJH> Actually, only part of this is true. Hashes do consume a lot of memory,
PJH> but they are not slow for a large number of keys. As Willem correctly
PJH> pointed out, they are O(n) on the length of the key but O(1) on the
PJH> number of keys. As long as your hash fits into RAM, lookups and
PJH> insertions are the same speed whether your hash contains 100 keys or
PJH> 100,000,000.

Hundreds of millions of hash keys will take up a lot of memory, probably
unnecessarily. Iterating through them will be painful even unsorted (as
with any large amount of data, it's the management that's hard, not
accumulating it). I would *at least* partition the keys with a
multi-level hash. So I stand by "don't put hundreds of millions of keys
in a hash" whether you're using Perl or Java or C or C++. It may be
possible, but it's not practical.

If a *string set* is required, then much more compact structures exist.
We've discussed some and I would add bloom filters to the list because
they are both compact and fast (at the cost of some false positives).

Ted
 
G

George Mpouras

If a *string set* is required, then much more compact structures exist.
We've discussed some and I would add bloom filters to the list because
they are both compact and fast (at the cost of some false positives).

Ted

the problenm with bloom filters is that if you seach for more than 50
keys using "or" thery are copletely useless because the will give you
ALWAYS false positive.
 
T

Ted Zlatanov

GM> the problenm with bloom filters is that if you seach for more than 50
GM> keys using "or" thery are copletely useless because the will give you
GM> ALWAYS false positive.

That's definitely not true. The false positive rate is a function of
how many bytes you store per entry. So you can make false positives
much less likely if you add a few bytes per entry. The idea with bloom
filters is to filter out all the words that are definitely not in the
set and then test against a slower lookup for the definitive answer (or
run a slower function that can fail gracefully on false positives).

The speed and compactness of bloom filters usually make that trade
worthwhile, but of course if you can't tolerate false positives at any
rate then they are not right for you. I would certainly not call them
"completely useless" in the context of a string set, though.

Ted
 
X

Xho Jingleheimerschmidt

Ted said:
Hundreds of millions of hash keys will take up a lot of memory, probably
unnecessarily.

Any space is unnecessary. I could always just throw the problem over
the wall to someone else,
Iterating through them will be painful even unsorted

Painful for whom? Does a CPU feel pain? And if so, do I care?

(as
with any large amount of data, it's the management that's hard, not
accumulating it). I would *at least* partition the keys with a
multi-level hash.

What would that get you? Would the CPU suffer less pain iterating over it?


Xho
 
T

Ted Zlatanov

XJ> Any space is unnecessary. I could always just throw the problem over
XJ> the wall to someone else,

(AKA "put it in a database", and of course databases are *engineered* to
store and index large amounts of data)

Any time you take an argument to extremes it sounds awkward. As I keep
asking, why do you need hundreds of millions of keys? Are you actually
going to use the values? What's the usage pattern? My point is that
"just stuff hundreds of millions of keys in a hash" is VERY likely not
the answer to any actual need and experienced programmers know that.

XJ> Painful for whom? Does a CPU feel pain? And if so, do I care?

For the user waiting for the results. Again, what's the usage pattern?
Will you need to iterate? If so, how quickly do you need the results?
Do you mind waiting for one CPU to crunch the data? Or are you going to
fork or use threads? You'll certainly have time to think about this
while you're waiting for Perl to sputter through the opcodes in a tight
loop, busting the cachelines every time.

XJ> What would that get you? Would the CPU suffer less pain iterating over it?

Yes, and then it would have more time to compose register limericks.

AX RAX RIP RIP RIP
DI DI AX AX AX
SI DI IP
DI SI RIP
RAX AX EIP!

Ted
 
I

Ilya Zakharevich

XJ> Any space is unnecessary. I could always just throw the problem over
XJ> the wall to someone else,

(AKA "put it in a database", and of course databases are *engineered* to
store and index large amounts of data)

So are Perl hashes - and I expect they are going to be order(s?) of
magnitude more space efficient and quickier than databases. (Provided
that memory holds enough water.)
XJ> Painful for whom? Does a CPU feel pain? And if so, do I care?

For the user waiting for the results.

I expect the opposite. (But only based on "theoretical experience". ;-)

Which database can efficently serve hundreds of millions of records
without (excessive) disk access with 200bytes of memory/per/record
footprint?

Ilya
 
G

George Mpouras

Which database can efficently serve hundreds of millions of records
without (excessive) disk access with 200bytes of memory/per/record
footprint?

Ilya

How about google ;
Have you every think the data size of google and how is it possible to get
instant results ;
As far I can undestand google is using perfect hashes over inverted files.
 
X

Xho Jingleheimerschmidt

Ted said:
XJ> Any space is unnecessary. I could always just throw the problem over
XJ> the wall to someone else,

(AKA "put it in a database", and of course databases are *engineered* to
store and index large amounts of data)

Most database systems, at least RDBMS ones, are actually primarily
engineered for ACID compliance, and these designs impose huge overheads.
My first answer to most insufferably slow join problems is just to
suck all the data into giant Perl hashes and do it there.
Any time you take an argument to extremes it sounds awkward. As I keep
asking, why do you need hundreds of millions of keys?

I don't, that I know of. But I am not the only person who does things,
and I oppose the idea we should insults people's problems just because
we don't understand them. I've used ordinary Perl hashes with up to
16777216 and would go to 4 times that without a second thought. Beyond
that, I'd probably still do it, but just keep a close eye on "top" until
I was comfortable with it.

Are you actually
going to use the values? What's the usage pattern? My point is that
"just stuff hundreds of millions of keys in a hash" is VERY likely not
the answer to any actual need and experienced programmers know that.

Except for the ones that disagree with it.
XJ> Painful for whom? Does a CPU feel pain? And if so, do I care?

For the user waiting for the results. Again, what's the usage pattern?
Will you need to iterate? If so, how quickly do you need the results?

They don't *need* the results at all. But they would like them more
quickly than they will get them by waiting for me to code and debug a
highly specialized data structure, rather than using the general purpose
one I already have.

Do you mind waiting for one CPU to crunch the data?

If I did, I'd do something about it. That something would probably
involve Perl hashes to some degree, but 100 million element ones. It
probably wouldn't involve DAWG.
Or are you going to
fork or use threads? You'll certainly have time to think about this
while you're waiting for Perl to sputter through the opcodes in a tight
loop, busting the cachelines every time.

Normally I would think about some *different* problem while waiting for
one crunch to go through. Or go get some lunch. Or go home and sleep.

What do you think PostGresql, Oracle and DAWGS do to cachelines?

Xho
 
T

Ted Zlatanov

XJ> Most database systems, at least RDBMS ones, are actually primarily
XJ> engineered for ACID compliance, and these designs impose huge
XJ> overheads. My first answer to most insufferably slow join problems is
XJ> just to suck all the data into giant Perl hashes and do it there.

I was talking about databases in general, including NoSQL types without
the ACID constraint.

Joins are a special kind of query where it does make sense to do things
in memory if the database can't, so I would agree with you that's a good
idea, depending on the specific need. I actually did something similar
to that kind of join query in Perl and compared it to just running the
query inside Sybase. Perl was faster for a while but Sybase won in the
end as the amount of data grew.

XJ> I don't, that I know of. But I am not the only person who does
XJ> things, and I oppose the idea we should insults people's problems just
XJ> because we don't understand them.

I think if you read my original message, 90% of it was trying to find
out more about the problem. I was trying to say I'm deeply suspicious
of "just throwing lots of data in a hash" as a way to solve actual
problems. It may work, but as the amount of data grows it won't
necessarily scale.

XJ> They don't *need* the results at all. But they would like them more
XJ> quickly than they will get them by waiting for me to code and debug a
XJ> highly specialized data structure, rather than using the general
XJ> purpose one I already have.

Er, the data structure is *not* the issue and I wasn't advocating a
complicated one. Whether you throw 100 million keys in a hash or 100
million rows in a database table, you're still managing large amounts of
data in a flat structure. But you're not answering the usage questions,
which I'll say again are what really matters.

XJ> Normally I would think about some *different* problem while waiting
XJ> for one crunch to go through. Or go get some lunch. Or go home and
XJ> sleep.

XJ> What do you think PostGresql, Oracle and DAWGS do to cachelines?

Interesting question. DAWGs don't do anything, they are a data
structure. If the code is in C and the data structure is compact, you
have a chance to avoid these penalties. I think DAWG won't do well
there but tries and bloom filters may.

With Perl the data and instructions are just not local enough. This is
my experience on x86 Nehalems and a reasonable inner loop.

Databases are usually engineered for concurrent access, which lets you
scale much more than Perl hashes would. I don't know performance
specifics for PostgreSQL or Oracle, but they do seem to be popular,
especially Oracle, for large data sets. I would imagine Oracle, at
least, has a couple of engineers checking the code performance at the
CPU level.

Ted
 
P

Peter J. Holzer

There is no problem on deletion.

A deletion may force you to split nodes. For example, use the DAWG from
the wikipedia article which represents the words "tap", "taps", "top",
"tops":


digraph "g1" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n2 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n4 [ label="s" ];
n3 -> n5 [ label="EOW" ];
n4 -> n5 [ label="EOW" ];
}

[use the dotty program from the graphviz package to view the graph]

When you delete "tops" from this DAWG, nodes n2 and n3 can no longer be
used for "top" since they would also lead to "tops". So you have to
create new nodes for "top":

digraph "g2" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n6 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n4 [ label="s" ];
n3 -> n5 [ label="EOW" ];
n4 -> n5 [ label="EOW" ];
n6 -> n7 [ label="p" ];
n7 -> n5 [ label="EOW" ];
}

In this case I think this is the only possible result of the deletion,
but in more complex DAWGs there may be more than one possible result, so
you also have to find the optimal out of several possible results.
For insertion, one needs a
dictionary of suffixes - which is, essentially, given by parent
pointers, since they form a DAWG as well.

An insertion may add or remove nodes (or both). For example if we start
at ("tap", "top"):

digraph "g3" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n2 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n5 [ label="EOW" ];
}

and add "taps", we need to split n2 and n3 to get graph g2. And if we
then add "tops" we can again merge nodes to get to graph g1. Having a
suffix dictionary probably helps (I haven't tried to implement it), but
in any case it doesn't look quite straightforward to me, and a suffix
dictionary needs extra space and since the whole reason for a DAWG is to
save space you wouldn't want that (I'm not sure whether you can even
call a DAWG with parent pointers a DAWG).

"Significant" in which sense? For me, half an order of magnitude is
significant. 30% is not - in most situations Perl is used for.

My rule of thumb is that it's worth considering if it's an improvement
by a factor of at least two (i.e., it saves at least 50% of memory). But
of course that depends on the problem. If I have a 16 bit system a
reduction from 70k (too big) to 60k (small enough) may be worthwhile. On
my workstation with 4GB RAM I may not care about the possibility of
reducing memory consumption from 2GB to 0.5 GB.
I would expect it to be of second type for English words - but it is
just a guts feeling...

My gut feeling is that it's better. But I don't have any data for it so
I won't make any claims.
(And I was considered "DAWG vs *folded* tries"
- those where a node contains a substring, not a char.)

I found several references to "folded tries", but all of them were about
storing unicode strings and they stored only part of a character per
node. Anyway, whatever optimization you use, you can also use it for a
DAWG, so I don't see the difference.

DAWG may be MUCH worse than a folded trie - even if you fold a DAWG.

You didn't show why my assumption that DAWG will degenerate into a trie
in the worst case is false.

hp
 
P

Peter J. Holzer

GM> We all find hashes pretty convenient, and helpful to make things fast; but
GM> the problem with hashes is that they consume too much memory on big trees,
GM> and they are too slow for hundred millions of keys.

IZ> ???!!! How comes? If what I need is a hash-like semantic, and I need
IZ> "hundreds of millions of keys", why would not I use hash? I never
IZ> programmed in this "any language", so you can reduce your arguments to
IZ> Perl. Again, to be specific, assume that this particular process has
IZ> ulimit of 100GB heap, and other processes do not use much memory, so
IZ> there won't be any swapping...


PJH> Actually, only part of this is true. Hashes do consume a lot of memory,
PJH> but they are not slow for a large number of keys. As Willem correctly
PJH> pointed out, they are O(n) on the length of the key but O(1) on the
PJH> number of keys. As long as your hash fits into RAM, lookups and
PJH> insertions are the same speed whether your hash contains 100 keys or
PJH> 100,000,000.

Hundreds of millions of hash keys will take up a lot of memory, probably
unnecessarily.

If it's not necessary then don't use it. We are of course talking about
the case where a hash is the appropriate data structure.
Iterating through them will be painful even unsorted (as
with any large amount of data, it's the management that's hard, not
accumulating it).

Hashes are generally used for fast random lookup, not for iterating.
I would *at least* partition the keys with a
multi-level hash.

That takes even more memory, possibly a lot more (depends where you
split the key), and it doesn't improve locality. It may help if you only
need to access data from one partition, but then why are you putting the
other data into memory at all? It also helps for sorting, but you may
not have to sort your data.

So I stand by "don't put hundreds of millions of keys
in a hash" whether you're using Perl or Java or C or C++. It may be
possible, but it's not practical.

It is practical as long as it fits into memory. One case where I could
have used a hash of 50+ million elements (but I could't at the time,
since that was on a 32 bit system so it wouldn't fit into memory) was a
comparison between to datasets: Read the first one into a hash, then
step through the second one record by record and look up the data from
the first. Total time: roughly the time to read the two files
sequentially. What I did instead was sort both datasets and them compare
the sorted files. Works, but was a lot slower (but still faster then
letting Oracle do the comparison).

If a *string set* is required, then much more compact structures exist.

Yes. But you weren't talking about string sets. You claimed that having
huge hashes is slow (it isn't) and always a mistake (it isn't).

hp
 
B

Bo Lindbergh

For insertion, one needs a
dictionary of suffixes - which is, essentially, given by parent
pointers, since they form a DAWG as well.

No, they don't. Consider the DAWG matching the two words "abc" and "bc".
The vertex with an outgoing edge labelled "c" has two incoming edges,
both labelled "b".

The data structure seems to be designed to represent a dictionary of
(English) words. Many words share a common suffix, so would expect
significant space savings compared to a trie.

The /usr/share/dict/words that comes with Mac OS X 10.6 has 234936 words
totalling 2251877 bytes (newlines excluded). A trie matching these
words (with a one-bit flag per vertex instead of end-of-word edges) has
791089 vertices. An optimal DAWG based on this trie has 130892 vertices.
I don't know about you people, but I call a factor of six "significant
savings".


/Bo Lindbergh
 
B

Bo Lindbergh

Having said all this, a DAWG implementation in Perl would be quite
nice, though it would clearly belong in CPAN rather than as a core
part of the language. It is hard to believe that it hasn't been done
by anybody

Maybe the high resource usage during the construction phase
makes people think it's not worth the effort to write?

Anyway, since it isn't _completely_ useless, I named it Text::DAWG
rather than Acme::DAWG. Coming soon to a CPAN mirror near you!


/Bo Lindbergh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top