Keith Thompson said:
Jürgen Exner said:
[...]
Also hashing algorythm is much slower than binary search,
???
hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?
Big-O notation only tells you how an algorithm scales with increasing n.
It doesn't tell you how it performs for any particular n (especially not
for a small n).
<nitpick>
The data structure is called a "hash table". There is also the "hash
function". Both are frequently abbreviated as "hash". "Hashing" may
refer to computing a hash function or to filling a hash table. In
both cases it is an algorithm.
Absolutely. That's why I said "typically". Of course, if you got a
degenerated hash then access might be as bad as a linear search.
A degenerated hash is not necessary.
A successful lookup of an element in a hash consists of the following
steps:
1) compute the hash of the lookup key (cost: c1 * length(lookup_key))
2) compute the location of element in the hash (cost: negligible)
3) compare the lookup key to the element key
(cost c2 * length(lookup_key) if successfull, almost zero if
unsuccessfull, because two strings with the same hash collision are
unlikely to have a common prefix).
4) If unsuccessfull, compute the location of the next element (this may
be a simple pointer lookup or involve the computation of a new hash)
and continue at 3.
If the hash is properly constructed, step 4 is very rare, so the access
time is
c1 * length(lookup_key) + c2 * length(lookup_key)
For a binary tree, you have to descend d levels, and do a string
comparison at every level, so the cost is
c2 * length(common_prefix(lookup_key, node1_key)) +
c2 * length(common_prefix(lookup_key, node2_key)) +
...
c2 * length(lookup_key)
(the common prefix is likely to get longer as we descend down the tree)
Note that the last term is the same, so we can eliminate it.
It follows that the hash access is faster than the binary tree[1] access
if the computation of the hash is faster than the (d-1) partial string
compares. This depends on the length of lookup_key, the distribution of
the keys in the dictionary, the hash algorithm, the depth of the tree,
how much of your data is already in the CPU cache, etc.
In short, it is impossible to say that a hash is faster than a binary
tree or vice versa. You have to benchmark particular implementations for
particular data.
There are some general observations, though:
The tree gets slower with increasing depth while the hash is quite
insensitive to the number of elements, so the hash has an advantage
for a large number of elements.
For a hash lookup you have to compute a hash of the entire key up front,
while for the tree you only have to compare the common prefix (+1 byte),
so the tree has an advantage if the keys are long (and don't have a
common prefix).
The hash is likely to be more cache friendly because you need to
look only at two contiguous memory locations while in the tree you need
to look at (d+1) memory locations.
Also ACK.
hp
[1] George also wrote "btree", which is not a binary tree. For a btree
it gets even more complicated, but btrees are normally used for disk
data structures, not in memory, although some in-memory data structures
(e.g. Judy arrays) use similar techniques to exploit caching.