Hashes are good, but not good enough.

I

Ilya Zakharevich

A deletion may force you...

Now I see how much I goofed here... Thanks!
I found several references to "folded tries", but all of them were about
storing unicode strings and they stored only part of a character per
node. Anyway, whatever optimization you use, you can also use it for a
DAWG, so I don't see the difference.

Won't work for DAWGs: E.g., in a folded trie for /usr/bin/words, after
I reach prefix "a" "b" "d" "i", all I need to store is "cate". I do
not expect you get any significant number of such chains of letters in
the corresponding DAWG.
You didn't show why my assumption that DAWG will degenerate into a trie
in the worst case is false.

I just do not see why "optimizing" a folded trie into a (folded) DAWG
would ALWAYS save space. At LEAST, this should depend on the size of
overhead per node. Storing M strings of length L would take about LM
bytes of memory (imaging these strings as tails in a folded trie).
Now convert them into a DAWG - you get N nodes of size S. Obviously,
for large enough S, NS is much more than LM.

My guts feeling is that even with reasonable implemementations, QUITE
OFTEN one would have NS > ML.

Yours,
Ilya
 
I

Ilya Zakharevich

The /usr/share/dict/words that comes with Mac OS X 10.6 has 234936 words
totalling 2251877 bytes (newlines excluded). A trie matching these
words (with a one-bit flag per vertex instead of end-of-word edges) has
791089 vertices. An optimal DAWG based on this trie has 130892 vertices.
I don't know about you people, but I call a factor of six "significant
savings".

This is an example of a completely useless stat. What is the URL for
this list? Do you consider tries or folded tries (those with a
string-per-node, not char-per-node)? 6x-difference in node count MAY
be still trivial if node sizes differ.

Note that if you implemement your DAWG
as struct {struct node*, char}

then on 64-bit machine it would take 2094272 bytes, so would provide
little space advantage over the initial word list (but blazing-fast
retrieval).

Ilya
 
G

George Mpouras

Bo Lindbergh said:
Maybe the high resource usage during the construction phase
makes people think it's not worth the effort to write?

Anyway, since it isn't _completely_ useless, I named it Text::DAWG
rather than Acme::DAWG. Coming soon to a CPAN mirror near you!


/Bo Lindbergh

Hi Bo,
I tried at weekend to make one, but I end up with two bloom filter variants.
I am waiting for your module, I realy want to see its insides, please tell
us when it is ready !
G.MPouras
 
G

George Mpouras

Bo Lindbergh said:
Maybe the high resource usage during the construction phase
makes people think it's not worth the effort to write?

Anyway, since it isn't _completely_ useless, I named it Text::DAWG
rather than Acme::DAWG. Coming soon to a CPAN mirror near you!


/Bo Lindbergh


Also try to avoid the Obj.Or. interface because it slow down things, the
classic fantions are prefered here.
 
T

Ted Zlatanov

PJH> If it's not necessary then don't use it. We are of course talking about
PJH> the case where a hash is the appropriate data structure.

This is a circular argument: of course if it's the appropriate data
structure, it's the appropriate data structure. My point was that
taking this approach is impractical generally.

PJH> Hashes are generally used for fast random lookup, not for iterating.

Sure, once it's loaded, it's fast. But the startup time is atrocious
and you can only do lookups. Also, do you need deletions, and if so,
are you happy with Perl's memory management?

PJH> That takes even more memory, possibly a lot more (depends where you
PJH> split the key), and it doesn't improve locality. It may help if you only
PJH> need to access data from one partition, but then why are you putting the
PJH> other data into memory at all? It also helps for sorting, but you may
PJH> not have to sort your data.

In other words, it may help and it may not, and we don't know if we
don't know the specific use case. Right.

PJH> It is practical as long as it fits into memory. One case where I could
PJH> have used a hash of 50+ million elements (but I could't at the time,
PJH> since that was on a 32 bit system so it wouldn't fit into memory) was a
PJH> comparison between to datasets: Read the first one into a hash, then
PJH> step through the second one record by record and look up the data from
PJH> the first. Total time: roughly the time to read the two files
PJH> sequentially. What I did instead was sort both datasets and them compare
PJH> the sorted files. Works, but was a lot slower (but still faster then
PJH> letting Oracle do the comparison).

This is similar to the case where I ended up with Sybase because the
Perl-based approach didn't scale. I can't say whether in your situation
it would have been a fast and sufficient approach to just use a hash,
but against Sybase I was able to parallelize the queries and do specific
comparisons on the interesting fields instead of the whole record. It
cut the run time significantly to do it against an RDBMS. As I said, it
was faster to just use a hash at first, but it quickly became an
unmanageable hack. That was my experience.

PJH> Yes. But you weren't talking about string sets.

I was giving practical advice that could actually be relevant to the
OP *after* discussing hashes.

PJH> You claimed that having huge hashes is slow (it isn't) and always a
PJH> mistake (it isn't).

OK, let's say it's fast for lookups in some cases and is a passable
solution in some very rare cases. Is that a fair summary?

Ted
 
P

Peter J. Holzer

Now I see how much I goofed here... Thanks!



Won't work for DAWGs: E.g., in a folded trie for /usr/bin/words, after

ITYM /usr(/share)/dict/words.
I reach prefix "a" "b" "d" "i", all I need to store is "cate". I do
not expect you get any significant number of such chains of letters in
the corresponding DAWG.

I don't see why you couldn't do the same thing in a DAWG. In addition,
you can reuse the suffix /cate(|d|s)$/ for the prefixes "adjudi",
"advo", "allo", "authenti", etc.

I just do not see why "optimizing" a folded trie into a (folded) DAWG
would ALWAYS save space.

I didn't say it always saves space, I said it cannot take more space. In
the worst case (no common suffixes) a DAWG just degenerates into a trie.
At LEAST, this should depend on the size of overhead per node.

This should be the same as the information per node is the same in both
structures (a list of possible continuations).

There are a lot of ways a node in a trie or DAWG can be implemented.

It can be a fixed size array of pointers (I think this is the "classic"
trie - obviously this is only useful if you have small alphabet, e.g.
the 26 characters of the latin alphabet; having 1114112 pointers per
node for each possible unicode character would be clearly impractical).

It can be an array of pointers assoiciated with a range of characters.

It can be an array of (character, pointer) pairs. This array can be
sorted or unsorted.

It can be a linked list of (character, pointer) pairs (that adds another
pointer, of course).

Where the character is explicitely stored, it can also be a substring.

Instead of a character you can also use part of a character (e.g. a
single byte of the UTF-8 encoding).

Instead of a pointer you can use an offset into an array. The offset
can also be relative.


As should be obvious the size per node can vary enormously depending on
the chosen representation. It also depends on the dictionary (because
some representations have nodes of variable size). So which
representation has the smallest memory footprint depends on the
dictionary and the machine architecture. And of course, a small memory
footprint may not be your only criterion. Maybe you are more interested
in fast lookups, or in an easy way to add or delete members.

All the variations I listed above apply to tries and DAWGs alike. It is
possible that a specific space optimization works better for tries than
for DAWGs, but that isn't obvious at all and I'd need to see a specific
example to believe it.

Storing M strings of length L would take about LM
bytes of memory (imaging these strings as tails in a folded trie).
Now convert them into a DAWG - you get N nodes of size S. Obviously,
for large enough S, NS is much more than LM.

You were comparing tries to DAWGs, not strings to DAWGs. So you must
compare N(t) * S(t) (number of nodes in the trie times average size of
each node) with N(d) * S(d) (number of nodes in the DAWG times average
size of the node). S(t) and S(d) should be roughly the same (they
contain the same information), and N(d) should not be greater than N(t)
(if it is you can always reorganize the DAWG into a trie).

My guts feeling is that even with reasonable implemementations, QUITE
OFTEN one would have NS > ML.

Possible, but that applies equally to tries as to DAWGs.

hp
 
P

Peter J. Holzer

PJH> If it's not necessary then don't use it. We are of course talking about
PJH> the case where a hash is the appropriate data structure.

This is a circular argument: of course if it's the appropriate data
structure, it's the appropriate data structure.

Not quite. I say that when a hash is the appropriate data structure in
general, it very often is the appropriate data structure for a large
number of elements, because a hash scales well (lookup, insertion and
deletion are O(1)).

It becomes impractical only when the size exceeds the size of available
RAM (but note that many databases also use hash-based indexes, so hashes
are also usable for on-disk data, you just have to organize your data a
bit differently).

My point was that
taking this approach is impractical generally.

And my point is that hashes are *generally* well suited for large data
structures because of their O(1) behaviour for all operations. Most
other data structures scale less well.
PJH> Hashes are generally used for fast random lookup, not for iterating.

Sure, once it's loaded, it's fast. But the startup time is atrocious

The startup time is O(n). For most other data structures it's worse (for
trees it's O(n * log n), for example). The insertion time for each
element may be a bit long (on my test system, inserting a hash member
with a short key takes about 2µs - 10 times as long as inserting an
array member), but that's the same for small hashes as for large hashes,
so it's not an argument against large hashes specifically.

Of course loading a hash with 100 million members into memory and then
doing a single lookup would be silly. But again that's true for any data
structure. The time to build the data structure must amortize itself
over the lifetime of the structure. So a huge hash makes only sense if
the time to build it is small compared to the time saved by doing fast
lookups. Adding 100 million records into a database takes even longer
(a *lot* longer), but if it lives long enough it may eventually amortize
itself.

and you can only do lookups.

You can also walk the whole hash in hash order. If you have different
requirements, don't use a hash. But again that's mostly independent of
the hash size.
Also, do you need deletions, and if so, are you happy with Perl's
memory management?

I've griped about Perl's memory management often enough that you
probably know that I'm not happy with it :). But it doesn't actually
treat large hashes badly. Deleting a single member is still O(1) - the
reference counting GC is good in this case, I think a mark-sweep based
one would be worse. What's bad is freeing a large hash at once - it's
equivalent to deleting each member in turn. But again, that's the same
for any data structure which consists of a large number of objects in
perl.

PJH> That takes even more memory, possibly a lot more (depends where you
PJH> split the key), and it doesn't improve locality. It may help if you only
PJH> need to access data from one partition, but then why are you putting the
PJH> other data into memory at all? It also helps for sorting, but you may
PJH> not have to sort your data.

In other words, it may help and it may not, and we don't know if we
don't know the specific use case. Right.

You forgot the that it may also hurt a lot. So the advice to always
partition large hashes without looking at the specific use case may
backfire badly.


So you don't know the use case but you give general advice? Right.

PJH> It is practical as long as it fits into memory. One case where I could
PJH> have used a hash of 50+ million elements (but I could't at the time,
PJH> since that was on a 32 bit system so it wouldn't fit into memory) was a
PJH> comparison between to datasets: Read the first one into a hash, then
PJH> step through the second one record by record and look up the data from
PJH> the first. Total time: roughly the time to read the two files
PJH> sequentially. What I did instead was sort both datasets and them compare
PJH> the sorted files. Works, but was a lot slower (but still faster then
PJH> letting Oracle do the comparison).

This is similar to the case where I ended up with Sybase because the
Perl-based approach didn't scale.

That doesn't have much to do with the scalability of Perl hashes (within
RAM limits). You delegated a task to a program written for that task in
C instead of writing it in Perl and discovered that it was faster at it
than your Perl program. Big surprise. (BTW, did you look at the
execution plan? I wouldn't be surprised if it said "hash join" at some
place or other).

it would have been a fast and sufficient approach to just use a hash,
but against Sybase I was able to parallelize the queries and do specific
comparisons on the interesting fields instead of the whole record. It
cut the run time significantly to do it against an RDBMS. As I said, it
was faster to just use a hash at first, but it quickly became an
unmanageable hack. That was my experience.

The possibility of parallelization may be an advantage, but that's a
weakness of Perl generally, not specific to large hashes.

Limiting the comparisons to interesting fields should be possible with
hashes also (in fact I would strongly advise to load only the
interesting data into memory to reduce memory consumption).

Other than that I suspect that you just ran out of RAM. Perl is a memory
hog and adds a large overhead. But again, that's perl's weakness in
general and not specific to hashes (much less large hashes).

PJH> You claimed that having huge hashes is slow (it isn't) and always a
PJH> mistake (it isn't).

OK, let's say it's fast for lookups in some cases and is a passable
solution in some very rare cases. Is that a fair summary?

No. There are use cases where hashes are appropriate and some where they
are not. This depends on a lot of factors but the the hash size is of
relatively small importance. In fact, hashes scale better than almost
any other data structure.

I strongly disagree with "small hashes good, large hashes bad".

hp
 
T

Ted Zlatanov

PJH> The startup time is O(n). For most other data structures it's worse (for
PJH> trees it's O(n * log n), for example). The insertion time for each
PJH> element may be a bit long (on my test system, inserting a hash member
PJH> with a short key takes about 2µs - 10 times as long as inserting an
PJH> array member), but that's the same for small hashes as for large hashes,
PJH> so it's not an argument against large hashes specifically.

PJH> Of course loading a hash with 100 million members into memory and then
PJH> doing a single lookup would be silly. But again that's true for any data
PJH> structure. The time to build the data structure must amortize itself
PJH> over the lifetime of the structure. So a huge hash makes only sense if
PJH> the time to build it is small compared to the time saved by doing fast
PJH> lookups. Adding 100 million records into a database takes even longer
PJH> (a *lot* longer), but if it lives long enough it may eventually amortize
PJH> itself.

A tiny MySQL database will happily chug through the same amount of
records as a 100GB hash, have very cheap startup time once the records
are in persistent storage, will be accessible from multiple processes
and network locations, and have other nice properties like better
queries and deterministic enumeration. I don't buy your argument, which
boils down to "fast lookups are worth a lot of trouble, like locking
yourself down to a single process, starting up slowly every time, and
using up a lot of memory." It doesn't make technical or business sense
to propose it as a solution unless you're certain there is no need for a
database and anticipate no need for long-term maintenance or extensions.

PJH> You can also walk the whole hash in hash order. If you have different
PJH> requirements, don't use a hash. But again that's mostly independent of
PJH> the hash size.

Well, no. The hash size can make it impossible to construct a list of
the keys if memory runs out. So it certainly matters and you have to be
VERY careful to avoid making a copy of the keys. A trie, by comparison,
can be comfortably walked without using much memory.

PJH> You forgot the that it may also hurt a lot. So the advice to always
PJH> partition large hashes without looking at the specific use case may
PJH> backfire badly.

I think all along I've shown great willingness to look at specific use cases.

PJH> That doesn't have much to do with the scalability of Perl hashes (within
PJH> RAM limits). You delegated a task to a program written for that task in
PJH> C instead of writing it in Perl and discovered that it was faster at it
PJH> than your Perl program. Big surprise.

But Peter, we're talking about how to approach a problem here. The tool
matters, certainly: you want to delegate to tools written for managing
large amounts of data. Perl hashes are not such a tool. They are a
general-purpose storage structure.

PJH> Other than that I suspect that you just ran out of RAM. Perl is a memory
PJH> hog and adds a large overhead. But again, that's perl's weakness in
PJH> general and not specific to hashes (much less large hashes).

I think you'll find most hash table implementations in any language have
that problem. They must overallocate by quite a bit, for instance, to
amortize the malloc costs.

PJH> There are use cases where hashes are appropriate and some where
PJH> they are not. This depends on a lot of factors but the the hash
PJH> size is of relatively small importance.

I agree as long as the resulting hash size is manageable in terms of the
specific needs of the user. But when it becomes a problem (AKA the
point at which the CompSci major calls the CompEng major to complain the
system is slow) it's a *big* problem. So I'm very suspicious of large
hash tables.

PJH> In fact, hashes scale better than almost any other data structure.

That's definitely not true. They make terrible arrays ;)

PJH> I strongly disagree with "small hashes good, large hashes bad".

I don't think I said that. I use verbs, not grunts.

Ted
 
P

Peter J. Holzer

Well, no. The hash size can make it impossible to construct a list of
the keys if memory runs out.

How often do I have to repeat "as long as it fits into memory"? I don't
think it makes sense to continue this discussion if you don't read what
I write.

hp
 
T

Ted Zlatanov

PJH> How often do I have to repeat "as long as it fits into memory"? I don't
PJH> think it makes sense to continue this discussion if you don't read what
PJH> I write.

By "it" you meant the hash itself, right? I am saying that the list of
keys may be large enough that constructing it for any purpose will make
you run out of memory even if the hash fits by itself.

Also you haven't mentioned memory overallocation and hash resizing.
These performance optimizations make it hard to judge exactly when the
hash will stop fitting in memory, with unfortunate consequences.

Ted
 
B

Bo Lindbergh

Ilya Zakharevich said:
This is an example of a completely useless stat. What is the URL for
this list? Do you consider tries or folded tries (those with a
string-per-node, not char-per-node)? 6x-difference in node count MAY
be still trivial if node sizes differ.

http://www.stacken.kth.se/~blgl/triedawgfoldtest/

The provided program should give you more than enough useful information
to compare the sizes of various trie variations. Plus, you can feed its
output to Graphviz to make pretty diagrams! (Bring lots of paper if you
want to visualise an entire English dictionary.)


/Bo Lindbergh
 
G

George Mpouras

I tried module Text::DAWG using sample data generated from the following
code

my $how_many_numbers = 5_000_000;
my $length_of_number = 14;
for(1..$how_many_numbers){my $n; $n .= int rand 10 for
(1..$length_of_number); print "$n\n"}

The data size is exactly 80 Mb
I kill the Perl process while its memory was 3.7 Gb
I was forced to kill because all my physical memory was consumed.

Suppossed that the DAWG should be smaller than the actual data.
I tried a blue filter and was able to store all these 80 Mb numbers to only
13k with very few false possitives
Apparentrly something does not work very well here ...
 
B

Bo Lindbergh

George Mpouras said:
I tried module Text::DAWG using sample data generated from the following
code

my $how_many_numbers = 5_000_000;
my $length_of_number = 14;
for(1..$how_many_numbers){my $n; $n .= int rand 10 for
(1..$length_of_number); print "$n\n"}

The data size is exactly 80 Mb
I kill the Perl process while its memory was 3.7 Gb
I was forced to kill because all my physical memory was consumed.

Suppossed that the DAWG should be smaller than the actual data.
I tried a blue filter and was able to store all these 80 Mb numbers to only
13k with very few false possitives
Apparentrly something does not work very well here ...

This is just as expected. The Text::DAWG documentation plainly states:

However, the unoptimised trie and the optimisation process itself
uses many times as much memory as the final result.


/Bo Lindbergh
 
B

Bo Lindbergh

http://www.stacken.kth.se/~blgl/triedawgfoldtest/
The provided program should give you more than enough useful information
to compare the sizes of various trie variations. Plus, you can feed its
output to Graphviz to make pretty diagrams! (Bring lots of paper if you
want to visualise an entire English dictionary.)

using the words

agapi
agapo
ogapo

the part "gap" should be common but it is not[/QUOTE]

No, it shouldn't. An infix will be shared only if every combination
of prefixes and suffixes occurs. You have to add "ogapi" to the set
to fulfill this condition.


/Bo Lindbergh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top