hash

George Mpouras · May 12, 2012

The clever equal $a ~~ @array is a lot of slower than exists $hash{$a}
Also hashing algorythm is much slower than binary search, so wouldn't be
better if we store key,values to btree structures instead of hashes ?

Jürgen Exner · May 12, 2012

George Mpouras said:
Also hashing algorythm is much slower than binary search,

???

hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?

so wouldn't be
better if we store key,values to btree structures instead of hashes ?

Is this a question about the internal implementation of Perl hashes deep
down in the guts of the abstract Perl machine?

jue

Keith Thompson · May 13, 2012

JÃ¼rgen Exner said:
???

hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?

Hashing itself is not really an algorithm; it's a data structure.

Some particular thing you do with hashing may well be slower than binary
search, but without knowing what what the OP is doing it's difficult to
speculate.

[...]

Jürgen Exner · May 13, 2012

Keith Thompson said:
Hashing itself is not really an algorithm; it's a data structure.
True

Some particular thing you do with hashing may well be slower than binary
search,

Absolutely. That's why I said "typically". Of course, if you got a
degenerated hash then access might be as bad as a linear search.

but without knowing what what the OP is doing it's difficult to
speculate.

ACK

jue

George Mpouras · May 13, 2012

Hashing itself is not really an algorithm; it's a data structure.

True

hash is a data structure and there is a an internal logic of how to retrieve
values. If there is a big number of keys then hash may have conflict keys;
these conflicts are stored as lists where its retrieval is slow (n).
Retrieving btree values is much faster O(log n). There is no particular
problem, I have a conversation with my colleague discussing this and maybe
there is a need for a module to do it easy.

Peter J. Holzer · May 13, 2012

Keith Thompson said:
Keith Thompson said:

Jürgen Exner said:

[...]
Also hashing algorythm is much slower than binary search,

???

hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?

Click to expand...

Click to expand...

Big-O notation only tells you how an algorithm scales with increasing n.
It doesn't tell you how it performs for any particular n (especially not
for a small n).

True

<nitpick>
The data structure is called a "hash table". There is also the "hash
function". Both are frequently abbreviated as "hash". "Hashing" may
refer to computing a hash function or to filling a hash table. In
both cases it is an algorithm.

Absolutely. That's why I said "typically". Of course, if you got a
degenerated hash then access might be as bad as a linear search.

A degenerated hash is not necessary.

A successful lookup of an element in a hash consists of the following
steps:

1) compute the hash of the lookup key (cost: c1 * length(lookup_key))
2) compute the location of element in the hash (cost: negligible)
3) compare the lookup key to the element key
(cost c2 * length(lookup_key) if successfull, almost zero if
unsuccessfull, because two strings with the same hash collision are
unlikely to have a common prefix).
4) If unsuccessfull, compute the location of the next element (this may
be a simple pointer lookup or involve the computation of a new hash)
and continue at 3.

If the hash is properly constructed, step 4 is very rare, so the access
time is

c1 * length(lookup_key) + c2 * length(lookup_key)

For a binary tree, you have to descend d levels, and do a string
comparison at every level, so the cost is

c2 * length(common_prefix(lookup_key, node1_key)) +
c2 * length(common_prefix(lookup_key, node2_key)) +
...
c2 * length(lookup_key)

(the common prefix is likely to get longer as we descend down the tree)

Note that the last term is the same, so we can eliminate it.

It follows that the hash access is faster than the binary tree[1] access
if the computation of the hash is faster than the (d-1) partial string
compares. This depends on the length of lookup_key, the distribution of
the keys in the dictionary, the hash algorithm, the depth of the tree,
how much of your data is already in the CPU cache, etc.

In short, it is impossible to say that a hash is faster than a binary
tree or vice versa. You have to benchmark particular implementations for
particular data.

There are some general observations, though:

The tree gets slower with increasing depth while the hash is quite
insensitive to the number of elements, so the hash has an advantage
for a large number of elements.

For a hash lookup you have to compute a hash of the entire key up front,
while for the tree you only have to compare the common prefix (+1 byte),
so the tree has an advantage if the keys are long (and don't have a
common prefix).

The hash is likely to be more cache friendly because you need to
look only at two contiguous memory locations while in the tree you need
to look at (d+1) memory locations.

ACK

Also ACK.

hp

[1] George also wrote "btree", which is not a binary tree. For a btree
it gets even more complicated, but btrees are normally used for disk
data structures, not in memory, although some in-memory data structures
(e.g. Judy arrays) use similar techniques to exploit caching.

Rainer Weikusat · May 13, 2012

Keith Thompson said:
Hashing itself is not really an algorithm; it's a data structure.

Hashing is an algorithm and not a data structure: Usually, it refers
to 'calculate a "hash value"' (relatively small integer) from some
(significantly) larger 'input data value'. Usually, this hash value is
then used as index into a vector of pointers to locate 'a list' on
which some kind of data item associated with this 'input data value'
(key) should either exist or needs to be put on.

Rainer Weikusat · May 13, 2012

"George Mpouras" <[email protected]>
writes:

[...]

hash is a data structure and there is a an internal logic of how to
retrieve values. If there is a big number of keys then hash may have
conflict keys; these conflicts are stored as lists where its retrieval
is slow (n). Retrieving btree values is much faster O(log n).

This doesn't exactly make sense: Let's assume I'm storing 1024
key-value pairs in a hash table and what I end up with are 128 lists of
length 8. Ignoring the overhead for calculating the hash values, this
means that I can locate each of these 1024 pairs with doing at most 8
key comparisons. If the same 1024 key-value pairs where organized as
some kind of balanced binary search tree, that would be log2(1024) =
10 key comparisons. Since '128 pointers' isn't exactly a lot of data
nowadays, it should be possible to double the size of the table and
thus end up with 256 list of length 4 and so on.

NB: This is a theoretical consideration and 'in practice' things
aren't that simple. But it should be sufficient to show that 'searcing
on a linked list is slow while ...' doesn't hold water.

Rainer Weikusat · May 13, 2012

Ben Morrow said:
Quoth Rainer Weikusat said:

Keith Thompson said:

[...]
Also hashing algorythm is much slower than binary search,

hash access is typically O(1) while binary search is O(log n). Why do
you think hashing is slower than binary search?

Hashing itself is not really an algorithm; it's a data structure.

Click to expand...

Hashing is an algorithm and not a data structure: Usually, it refers
to 'calculate a "hash value"' (relatively small integer) from some
(significantly) larger 'input data value'. Usually, this hash value is
then used as index into a vector of pointers to locate 'a list' on
which some kind of data item associated with this 'input data value'
(key) should either exist or needs to be put on.

Click to expand...

I don't know about 'usually'. One of the more common uses of hash
algorithms is in message digests and digital signatures and such.

The common use of hashing in perl is the implementation of so-called
hashes (if you think that hashes in perl are sufficiently uncommon
that referring to them as 'commmon' seems unwarranted, please feel
free to argue for another definition of 'commmonly used'). In this
case, it reduces a sequence of bytes to an unsigned 32-bit value
which is used as 'base value' for an index into a table of
lists.

So-called 'hash functions' have other uses than in lookup
tables (such as for generating message digests which can then be
'signed' by encrypting them with the private key of a public/private
key pair for public key cryptography to produce a so-called 'digital
signature or to calcluate hased message authentication codes [HMAC])
but in Perl, such uses are relatively rare, and in any case, this is
besides the point in a discussion about 'fast/slow lookups'.

I think perhaps Keith meant to say 'A hash *table* is not an
algorithm, it's a data structure', which is entirely true.

The term is commonly used but this is really just as sloppy as
referring to 'the usual case' as 'the usual case': A 'hash table' is
inherently in now way different from any other 'table' (for this
definition of table), just the way it is being used differs.

Martijn Lievaart · May 14, 2012

"George Mpouras" <[email protected]>
writes:

[...]

hash is a data structure and there is a an internal logic of how to
retrieve values. If there is a big number of keys then hash may have
conflict keys; these conflicts are stored as lists where its retrieval
is slow (n). Retrieving btree values is much faster O(log n).

Click to expand...

This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
pairs in a hash table and what I end up with are 128 lists of length 8.
Ignoring the overhead for calculating the hash values, this means that I
can locate each of these 1024 pairs with doing at most 8 key
comparisons.

If you end up with so many lists of that length, something is seriously
wrong with your hash table. It's either to small, or you have a very bad
hashing function.

If the same 1024 key-value pairs where organized as some
kind of balanced binary search tree, that would be log2(1024) = 10 key
comparisons. Since '128 pointers' isn't exactly a lot of data nowadays,
it should be possible to double the size of the table and thus end up
with 256 list of length 4 and so on.

George was talking about a btree, not a binary tree. Those are fairly
different beasts.

HTH,
M4

Peter J. Holzer · May 14, 2012

If you end up with so many lists of that length, something is seriously
wrong with your hash table. It's either to small, or you have a very bad
hashing function.

The perl hash implementation always keeps the load factor <= 1, so a
hash table with 1024 elements has at least 1024 slots. Of course it's
possible that 896 of these slots are empty and 128 slots have 8 elements
each, but as you say that would mean a very bad hash function or a
deliberate attack. To guard against the latter possibility, perl >= 5.8
(IIRC) monitors the chain length and changes the seed of the hash
function if a chain grows too long.

But I think Rainer was trying to come up with a worst-case scenario
where a hash is still faster (requires fewer comparisons) than a binary
tree, as the next paragraph shows:

George was talking about a btree, not a binary tree. Those are fairly
different beasts.

Yes, I noted that too in an earlier posting, but btrees are normally
used for on-disk data structures and they don't implement a binary
search, so it seems likely that George really meant binary trees, not
btrees.

hp

Martijn Lievaart · May 14, 2012

Yes, I noted that too in an earlier posting, but btrees are normally
used for on-disk data structures and they don't implement a binary

Btrees are used all over the place, not just on-disk. OTOH, in memory you
are much more likely to actually use a red-black tree or similar, but the
interface and performance characteristics are so similar to a btree that
they are often called btrees as well.

search, so it seems likely that George really meant binary trees, not
btrees.

Possible, even probable.

M4

Rainer Weikusat · May 14, 2012

Martijn Lievaart said:
"George Mpouras" <[email protected]>
writes:

[...]

hash is a data structure and there is a an internal logic of how to
retrieve values. If there is a big number of keys then hash may have
conflict keys; these conflicts are stored as lists where its retrieval
is slow (n). Retrieving btree values is much faster O(log n).

Click to expand...

This doesn't exactly make sense: Let's assume I'm storing 1024 key-value
pairs in a hash table and what I end up with are 128 lists of length 8.
Ignoring the overhead for calculating the hash values, this means that I
can locate each of these 1024 pairs with doing at most 8 key
comparisons.

Click to expand...

If you end up with so many lists of that length, something is seriously
wrong with your hash table. It's either to small, or you have a very bad
hashing function.

This was an example supposed to illustrate that 'searching in linked
list is slow while ...' doesn't make sense.

Apart from that: If I have a fixed size table (128 slots in this
example) and my key/value pairs are distributed evenly onto these 128
slots (yielding 128 lists of length 8), my hash function quite
obviously did the best job it is theoretically capable of doing.

George was talking about a btree, not a binary tree. Those are fairly
different beasts.

And I was writing about a data structure with the performance
characteristics he mentioned. Even if he was really referring to 'a
B-tree' and not just using btree as abbreviations for 'binary tree'
(which I very strongly suspect), that doesn't matter for this example.

Rainer Weikusat · May 14, 2012

Martijn Lievaart said:
Btrees are used all over the place, not just on-disk. OTOH, in memory you
are much more likely to actually use a red-black tree or similar, but the
interface and performance characteristics are so similar to a btree that
they are often called btrees as well.

A 'red-black tree' is a balanced, binary search tree with the
'red-black' referring to a specific balancing algorithm.

Martijn Lievaart · May 14, 2012

A 'red-black tree' is a balanced, binary search tree with the
'red-black' referring to a specific balancing algorithm.

I actually ment skip lists (brain fart), but the statement is still not
incorrect.

M4

Rainer Weikusat · May 14, 2012

Martijn Lievaart said:
I actually ment skip lists (brain fart), but the statement is still not
incorrect.

Since a B-tree is a more general tree structure than a binary search
tree, every binary search tree is also a B-tree, just not vice
versa. While that's not consistent with a statement you made in other
subthread, it is surely 'not incorrect'. But this "data structures I
heard of!"-bingo is IMO pretty pointless.

Martijn Lievaart · May 15, 2012

Since a B-tree is a more general tree structure than a binary search
tree, every binary search tree is also a B-tree, just not vice versa.
While that's not consistent with a statement you made in other
subthread, it is surely 'not incorrect'. But this "data structures I
heard of!"-bingo is IMO pretty pointless.

Well, that was not my intention. Mind responding to the point?

M4

Rainer Weikusat · May 15, 2012

Martijn Lievaart said:
Well, that was not my intention. Mind responding to the point?

Which point?

Martijn Lievaart · May 15, 2012

Which point?

Never mind. I don't feel like explaining everything.

M4

Rainer Weikusat · May 16, 2012

Martijn Lievaart said:
Never mind. I don't feel like explaining everything.

And I 'feel' that you neither wrote anything even remotely related to
the original lookup question nor something which could be considered
'a point' at all, just a bunch of rambling statements about various
data structures. This may be wrong but if you can't be bothered with
expressing yourself clearly in face of a misunderstanding, whatever
you meant to express doesn't matter.

Quick sort algorithm	1	Feb 22, 2023
btree	8	May 11, 2012
Hash key types and equality of hash keys	2	Mar 1, 2012
Hash array with variable size?	5	Feb 28, 2011
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
perl hash: low-level implementation details?	19	Nov 23, 2009
Push regex search result into hash with multiple values	14	May 19, 2014
FAQ 4.60 How do I sort a hash (optionally by value instead of key)?	0	Mar 14, 2011

hash

George Mpouras

Jürgen Exner

Keith Thompson

Jürgen Exner

George Mpouras

Peter J. Holzer

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Martijn Lievaart

Peter J. Holzer

Martijn Lievaart

Rainer Weikusat

Rainer Weikusat

Martijn Lievaart

Rainer Weikusat

Martijn Lievaart

Rainer Weikusat

Martijn Lievaart

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads