which hash table - chained or open-addressed?

J

Julia

Hi, there,

I have a question about hash table:

When I tried to use hash table into my program, it is hard for me to
choose between chained hash table and open-addressed hash table. Can
any one give me some ideas, like what advantages and disadvantages of
each type of hash table? Is there a suggestive way to choose one of
them or is it just a personal preference?

Thanks in advance,

Julia
 
U

user923005

Julia said:
Hi, there,

I have a question about hash table:

When I tried to use hash table into my program, it is hard for me to
choose between chained hash table and open-addressed hash table. Can
any one give me some ideas, like what advantages and disadvantages of
each type of hash table? Is there a suggestive way to choose one of
them or is it just a personal preference?

You want
I suggest that you use skiplists for chaining. That way, even if every
single item hashed to the same slot, you would still have O(log(n))
access instead of O(n).

The simplest way to do it is to simply create an array of skiplists,
and use the hash value to choose which skiplist to use. As a bonus, if
you need to you can put the addresses of the skiplists into a priority
queue and be able to fully order the data if you need to.

Anyway, you knocked on the wrong door. Your question is not about the
C language, but about programming in general.
 
C

CBFalconer

Julia said:
When I tried to use hash table into my program, it is hard for me to
choose between chained hash table and open-addressed hash table. Can
any one give me some ideas, like what advantages and disadvantages of
each type of hash table? Is there a suggestive way to choose one of
them or is it just a personal preference?

Download <http://cbfalconer.home.att.net/download/hashlib.zip>

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
W

websnarf

Julia said:
When I tried to use hash table into my program, it is hard for me to
choose between chained hash table and open-addressed hash table. Can
any one give me some ideas, like what advantages and disadvantages of
each type of hash table? Is there a suggestive way to choose one of
them or is it just a personal preference?

I've measured the performance of both, and under ideal conditions for
both, they are essentially identical, because the main overhead is
identical in both (the hash and the compare functions). Chained hash
tables are generally much easier to code. With "open address" hash
tables, implementation requires a very complicated and carefully
designed hashing system (dealing with deletes, and making a correct
chaining mechanism, for example). Open address hash tables have the
advantage that you can embed the contents of a struct in each entry
while with chained hashing you have to have either an additional pair
of "next, prev" pointers in your struct or else otherwise provide an
additional container for each entry.

There are a lot of classic mistakes made during implementation of hash
tables that can reduce their performance. 1) If you use a prime-sized
table (as some books recommend) you will pay a horrendous penalty for
the % operator on every access; always use a power of 2 sized table
(and an odd, or in fact, prime numbered skip value for the open address
hash table.) 2) If you fail to store the full hash function value
along with the entry, you will be forced to call the native comparison
function for each associated entry during a search -- by pre-screening
by the original hash value, if your hash function is of high quality,
you will reduce the cost by almost never calling the comparison
function *except* on exact matches (thus also improving whatever the
branch prediction performance is in your comparison function).

Remember that, essentially, the whole purpose of implementing a hash
table is to use the computer's underlying hardware (in this case,
random access to memory) to turn what is, in theory, supposed to be a
O(ln(n)) problem (via balanced tree based implementations) to a O(1)
problem. So in a sense, you are not creating some theoretical piece of
art -- leaning on the hardware or taking the above things into account
is thematic to what you are doing.
 
W

William Ahern

Hi, there,

I have a question about hash table:

When I tried to use hash table into my program, it is hard for me to
choose between chained hash table and open-addressed hash table. Can
any one give me some ideas, like what advantages and disadvantages of
each type of hash table? Is there a suggestive way to choose one of
them or is it just a personal preference?

I wrote out a long post, then deleted it. This article should have all or
most of the answers you desire:

http://en.wikipedia.org/wiki/Hash_table#Chaining
http://en.wikipedia.org/wiki/Hash_table#Open_addressing

Although maybe it doesn't clearly state the most important, and I would
hope obvious, point: it depends.
 
E

Eric Sosman

[...]
There are a lot of classic mistakes made during implementation of hash
tables that can reduce their performance. 1) If you use a prime-sized
table (as some books recommend) you will pay a horrendous penalty for
the % operator on every access; always use a power of 2 sized table
(and an odd, or in fact, prime numbered skip value for the open address
hash table.)

The horrendousness of the penalty for division should be
assessed in light of what else is afoot. It may be drowned
by the computation of the hash function itself, for example.
Also, there's a trade-off: If the division costs you N cycles
more than an AND would but spreads the data better and saves
you an average of C comparisons, it may be worth while. Note,
too, that an AND simply discards bits from the hash value;
if you engage in a more complex "folding" operation to let
all the bits participate, the cycle count starts creeping up
toward that of the division.

If you choose a power-of-two table size, it doesn't matter
whether the skip value is prime or not: as long as it's odd,
it'll work just fine. More generally, if you choose a table
of size T, the skip distances Si should be relatively prime to
T but need not be prime. The attraction of a prime T is that
it's easy to find Si that are relatively prime to it!
2) If you fail to store the full hash function value
along with the entry, you will be forced to call the native comparison
function for each associated entry during a search -- by pre-screening
by the original hash value, if your hash function is of high quality,
you will reduce the cost by almost never calling the comparison
function *except* on exact matches (thus also improving whatever the
branch prediction performance is in your comparison function).

This is another interesting trade-off. In an open-addressed
table that holds pointers to the "payload" data (rather than the
data itself), holding on to the computed hash alongside the data
pointer will probably double the size of the table. And there's
where the interesting part comes in: For the same expenditure of
storage, you could have a pointers-only table with twice as many
entries, or half the "load factor" for the same number of stored
items. Search and insertion times rise steeply at high load
factors (in ways that depend on the algorithms), so the potential
reduction in probe count might make up for the increased number
of item comparisons per probe.
Remember that, essentially, the whole purpose of implementing a hash
table is to use the computer's underlying hardware (in this case,
random access to memory) to turn what is, in theory, supposed to be a
O(ln(n)) problem (via balanced tree based implementations) to a O(1)
problem. So in a sense, you are not creating some theoretical piece of
art -- leaning on the hardware or taking the above things into account
is thematic to what you are doing.

Agreed. But "the hardware" behaves differently with different
arrangements of data, so reliable predictions are not easy. Also,
"random access" memory nowadays seldom has uniform performance: Not
only are there NUMA architectures to contend with, but even on
supposedly-uniform implementations the behavior of the caches can
influence the running time strongly. Measure, measure, measure!
 
C

CBFalconer

.... snip ...

There are a lot of classic mistakes made during implementation of hash
tables that can reduce their performance. 1) If you use a prime-sized
table (as some books recommend) you will pay a horrendous penalty for
the % operator on every access; always use a power of 2 sized table
(and an odd, or in fact, prime numbered skip value for the open address
hash table.) 2) If you fail to store the full hash function value
along with the entry, you will be forced to call the native comparison
function for each associated entry during a search -- by pre-screening
by the original hash value, if your hash function is of high quality,
you will reduce the cost by almost never calling the comparison
function *except* on exact matches (thus also improving whatever the
branch prediction performance is in your comparison function).

The cost of the modulo operation depends on the cost of the
division operation. Most machines today can do very fast
division. The prime table size guarantees that rehashed values
will never miss a possible slot.

When the 'misses' (i.e. failure to find an item on the first probe)
are low, there is no point whatsoever to saving the hash value. In
addition (see hashlib) that is computed only once for a search, as
is the rehash value. At most your recommendation will remove one
calls to the hash function(s) since you have to compute the first
value anyway.

At any rate, the time to worry about such things is when the
operation is unsatisfactorily slow or memory hungry. The first
requirement is correctness.

These are not 'classic mistakes', they are deliberate choices.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
W

websnarf

Eric said:
[...]
There are a lot of classic mistakes made during implementation of hash
tables that can reduce their performance. 1) If you use a prime-sized
table (as some books recommend) you will pay a horrendous penalty for
the % operator on every access; always use a power of 2 sized table
(and an odd, or in fact, prime numbered skip value for the open address
hash table.)

The horrendousness of the penalty for division should be
assessed in light of what else is afoot. It may be drowned
by the computation of the hash function itself, for example.

True, but in those cases, using a hash instead of a balanced tree is
probably not the real performance issue anyways. Hash tables are about
hard core "to the metal" optimizing -- if not, then you are on
theoretically very shakey ground anyways.
Also, there's a trade-off: If the division costs you N cycles
more than an AND would but spreads the data better and saves
you an average of C comparisons, it may be worth while.

Ok, but versus a good design, this cannot be the case.
[...] Note,
too, that an AND simply discards bits from the hash value;
if you engage in a more complex "folding" operation to let
all the bits participate, the cycle count starts creeping up
toward that of the division.

Real hash functions don't require folding.
If you choose a power-of-two table size, it doesn't matter
whether the skip value is prime or not: as long as it's odd,
it'll work just fine.

This is not correct. The reason takes a little thought. By using a
primes, you reduce the probability that any distinct pair of skip
values with have a low gcd(). So if you happen to have a few overly
long chains "near" each other the probability that they interefere with
each other multiple times within a chained sequence is minimized. This
is seen most clearly by picking the numbers 3 and 9, and assuming that
two chains are the same offset modulo 3. If the chain with the skip of
3 is really long (15 items, say) then one third of them (5 items) will
intersect with the chain with the skip of 9. If the skip values are
distinct primes, then successive interference can only happen at most
every skip1 * skip2 items. Thus using prime skip values will tend to,
in fact, spread the distribution of entries a better and consequently
reduce the maximum chain length on average.
[...] More generally, if you choose a table
of size T, the skip distances Si should be relatively prime to
T but need not be prime. The attraction of a prime T is that
it's easy to find Si that are relatively prime to it!

Easy but costly. To find small primes for skip values (it turns out
that you never need more than 512 or so) one is typically at worst
going to do a (fits in a tiny corner of the L1 cache) table look up.
There turns out to be a perfectly balanced way of doing this if you
have a good 32-bit (or higher) hash function -- you use the lower bits
of the hash for the initial index, and the upper bits for the index
into the prime table to determine the skip value. So you only need a
single hash function.
This is another interesting trade-off. In an open-addressed
table that holds pointers to the "payload" data (rather than the
data itself), holding on to the computed hash alongside the data
pointer will probably double the size of the table.

I don't know what kind of thing you want to hash, where the index and
stored hash is not dominated by the actual raw data that you are
hashing in the first place.
[...] And there's
where the interesting part comes in: For the same expenditure of
storage, you could have a pointers-only table with twice as many
entries,

Huh? In a chained hash solution, you don't have chains that point at
NULL. And aliased entries will map to the same place. That means you
*can't* have twice as many entries, if you are memory limited. The
worst case scenario is if you were hashing an integer or a float or
something, but in those cases, you would want a special design and are
likely to have a specially crafted hash function (i.e., the indentity
function, for example). Besides such marginal cases, your reasoning
fails to kindergarten arithmetic.
[...] or half the "load factor" for the same number of stored
items. Search and insertion times rise steeply at high load
factors (in ways that depend on the algorithms), so the potential
reduction in probe count might make up for the increased number
of item comparisons per probe.

You must be smoking pot. You are positing that the cost of additional
hash value storage (keep in mind that in a hash table, there *IS* no
locality, unless you do repeated requests, in which case, there is
trivial locality) is in some way comparable to calling the element
comparison function. That's generally just going to be ridiculous.
 
W

websnarf

CBFalconer said:
The cost of the modulo operation depends on the cost of the
division operation. Most machines today can do very fast
division.

Excuse me?!?! Do you have even the first clue who you are talking to?
Can you explain to me the qualifications you have that justify an
outrageous statement like that? I'll admit my information is second
hand, but at least its backed up by academic papers, the technical
specifications, direct discussions with the CPU implementors themselves
and actual measurement. Executive summary: Division is relatively slow
(around 25 clocks), and there is precious little you can do about it
(successive generations of CPUs won't get any better.)
[...] The prime table size guarantees that rehashed values
will never miss a possible slot.

This guarantee can be provided by any number of methods. The power of
2 tables with odd skip values (and in fact odd prime skip values) has
exactly the same property.
When the 'misses' (i.e. failure to find an item on the first probe)
are low, there is no point whatsoever to saving the hash value.

That's retarded. All fresh inserts lead to misses. As a generic hash
table implementor yourself, you ought to know that you don't get to
pick and choose what scenarios your hash table will be used in.
[...] In
addition (see hashlib) that is computed only once for a search, as
is the rehash value. At most your recommendation will remove one
calls to the hash function(s) since you have to compute the first
value anyway.

After all this time you are still not getting it. It reduces the
number of *ELEMENT COMPARISONS*. You know the "int (*cmp)(void *,
void*)" function that you need to pass in to tell if two elements are
the same? *That* is the call that I am interested in minimizing calls
to. (There are precidents for this -- like every single academic paper
ever written on sorts, for example.) The technique above leverages the
fact that if two elements are the same, their hash is the same, but
comparing two hash values is much faster than any other kind of
comparison (except in trivial cases where you are hashing scalars --
but here you would use the scalar value itself as the full hash).

On a typically loaded hash table such as yours which auto-resizes, you
can expect *about* 1.4 slot checks per element scan (this depends on
the scenario, but this is fair ballpark for common ones). If you don't
do hash prescanning, then on hits you are calling the raw comparison
function 40% more often than you need to, and on misses -- basically
you are calling it infinitely more often (on misses, with stored hash
prescanning, you almost never need to call the comparison function.)
At any rate, the time to worry about such things is when the
operation is unsatisfactorily slow or memory hungry. The first
requirement is correctness.

If you gave up using a balanced tree (which is theoretically sound for
all sizes), and are using a hash table (whose theoretical performance
is highly conditional on the scenario), then you have already committed
to low-level optimization as your implementation strategy.
 
S

Spiros Bousbouras

Excuse me?!?! Do you have even the first clue who you are talking to?
Can you explain to me the qualifications you have that justify an
outrageous statement like that? I'll admit my information is second
hand, but at least its backed up by academic papers, the technical
specifications, direct discussions with the CPU implementors themselves
and actual measurement.

Can you provide some references ?
 
C

CBFalconer

... snip ...

Easy but costly. To find small primes for skip values (it turns out
that you never need more than 512 or so) one is typically at worst
going to do a (fits in a tiny corner of the L1 cache) table look up.
There turns out to be a perfectly balanced way of doing this if you
have a good 32-bit (or higher) hash function -- you use the lower bits
of the hash for the initial index, and the upper bits for the index
into the prime table to determine the skip value. So you only need a
single hash function.

Fatal to the clustering problem. This means that the skip value
only depends on the original hash, and thus everything with that
particular original hash interfere. That is why an independent
rehash function is used.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
W

websnarf

Spiros said:
Can you provide some references ?

1. http://citeseer.ist.psu.edu/oberman95division.html
2.
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
3. http://www.agner.org/optimize/optimizing_assembly.pdf

The instruction latency listed for an Opteron for 32 bits is 39 clocks
(apparently, its gotten slower over time). Compare this to a mask
operation, which is a third of a clock. So we're talking about a
difference of up to 117 times faster.
 
W

websnarf

CBFalconer said:
Fatal to the clustering problem.

Blatantly incorrect. Its only fatal if your hash function is retarded.
I've tested this myself, and even in the worst case of degradation
(i.e., starting to use all of your address space) the design remains as
robust is possible.
[...] This means that the skip value
only depends on the original hash, and thus everything with that
particular original hash interfere.

Right, and if you knew anything about hash functions you would realize
that this is why it works, not why it is a problem.
[...] That is why an independent rehash function is used.

Seriously -- where do you get these voodoo theories of yours from? Any
hash function with reasonable "avalanche properties" or "low
funnelling", as Bob Jenkins calls it, (otherwise known as "a hash
function") gives you independent bits to begin with.
 
C

CBFalconer

CBFalconer said:
Fatal to the clustering problem.

Blatantly incorrect. Its only fatal if your hash function is retarded.
I've tested this myself, and even in the worst case of degradation
(i.e., starting to use all of your address space) the design remains as
robust is possible.
[...] This means that the skip value
only depends on the original hash, and thus everything with that
particular original hash interfere.

Right, and if you knew anything about hash functions you would
realize that this is why it works, not why it is a problem.
[...] That is why an independent rehash function is used.

Seriously -- where do you get these voodoo theories of yours from?
Any hash function with reasonable "avalanche properties" or "low
funnelling", as Bob Jenkins calls it, (otherwise known as "a hash
function") gives you independent bits to begin with.

Your ignorance and inexperience is showing. Stick to things you
are knowledgeable about, which apparently includes at least some
mathematics. It does not include elementary manners.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
R

Richard Harter

Fatal to the clustering problem. This means that the skip value
only depends on the original hash, and thus everything with that
particular original hash interfere. That is why an independent
rehash function is used.

Not so. In effect he is saying to use a hash function value that is
wide enough to hold both the original hash and the rehash.

BTW: Isn't this off topic?
 
R

Richard Heathfield

(e-mail address removed) said:

Excuse me?!?! Do you have even the first clue who you are talking to?

Unless you're Dennis Ritchie, I don't see why it matters who he's talking
to. Do the relevant facts change because of who you are?

<snip>
 
W

websnarf

Richard said:
(e-mail address removed) said:



Unless you're Dennis Ritchie, I don't see why it matters who he's talking
to. Do the relevant facts change because of who you are?

He did not *ASK* me what the performance of % was (a reasonable thing
to do) he *TOLD* me something that is ridiculously untrue. Now I
understand that the people in this group necessarily take the position
that they have no idea what the performance of their underlying
hardware is, but he was taking the position that he knows. And that
was in response to someone who really does know -- more to the point,
everyone *KNOWS* that I know. So his position was not only untrue, it
is an assertion that he knows the fact in direct contradiction of what
I (whose raison d'etre, could be considered knowing this sort of thing)
just told him.

And what makes you think Dennis Ritchie knows what the performance of %
is? I mean, if he takes the same position as most of you here in
c.l.c. (and in fact most programmers) then presumably he doesn't know
and remains actively ignorant. But perhaps you know something about
him that I don't.
 
C

CBFalconer

He did not *ASK* me what the performance of % was (a reasonable thing
to do) he *TOLD* me something that is ridiculously untrue. ...

You are a blithering idiot, besides an unmannered boor. You have
no idea whom you are talking to, not that it matters. After
something like 50 years of building machines, and getting maximum
performance out of them, I think I have some idea of the
fundamentals and their relative importance. You, on the other
hand, do not appear to.

(Probably spoiled and coddled from infancy by over doting parents)

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
J

Julia

I checked the wikipedia link and it helps a lot. In my case, the key
value is relatively continuous but with a few empty slots and a few
outside-range values, for example, 3, 102, 103, 104, 106, 107, 108,
109, 110, 112, 113, 114, 115, 116, ....120, 13000. However, for the
different calling to hash table, the starting value can be very
different, for example, 102 at one calling, and 20333556 at another
calling. I guess open-addressing hash table should be OK for my
application. Am I right?

Best regards,

Julia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top