Hash table: linear probe vs. chaining

I

Ian Pilcher

Background:

I'm working on a Java program, and I need to associate some additional
data with a large number of objects. I do not have the freedom to
modify or wrap the existing object classes. Additionally, I don't want
to interfere with the garbage collection of the objects.

I've come to the conclusion that I need a "WeakIdentityHashMap," a data
structure which combines the features of an IdentityHashMap and a
WeakHashMap. After a couple of attempts to build such a thing on top of
one of the existing classes, I don't think that enough of the internals
are exposed, and I'm going to have to create my own hash table.


The question:

Sun's documentation states that they use a linear-probe hash table for
the IdentityHashMap; the GNU Classpath does the same thing. As far as I
can tell, the other hash table-based collections use chaining.

It seems to me that a linear-probe hash table is a bad idea if mappings
are going to be removed from the table very often. When searching for a
given key, the search must continue until the key itself is found or a
slot which has NEVER been occupied is found. (Finding a slot which is
CURRENTLY empty is not enough, since it may have been occupied when the
key being sought was inserted into the map.)

Anyone see a flaw in my reasoning?

Thanks!
 
B

Ben Pfaff

Ian Pilcher said:
It seems to me that a linear-probe hash table is a bad idea if mappings
are going to be removed from the table very often. When searching for a
given key, the search must continue until the key itself is found or a
slot which has NEVER been occupied is found. (Finding a slot which is
CURRENTLY empty is not enough, since it may have been occupied when the
key being sought was inserted into the map.)

Anyone see a flaw in my reasoning?

You are overlooking the algorithm for removing an item from a
linear-probing hash table. In Knuth this is Algorithm 6.4R
(Deletion with linear probing).
 
M

MSCHAEF.COM

Ian Pilcher said:
It seems to me that a linear-probe hash table is a bad idea if mappings
are going to be removed from the table very often.

As Ben Pfaff pointed out, there are ways to solve this.

One advantage of linear probing is that it can have better performance on
modern hardware. Linear scans through memory are significantly cheaper
than linked list traversals, even if the linked list nodes are adjacent in
memory. The linked list really has two disadvantages:

* It requires a both a pointer access and a comparison for each node
checked. This means that morr data has to be loaded from memory to
check a list node.

* In linear probing, adjacent nodes are guaranteed to be adjacent to each
other in the address space. This makes it more likely that the memory
subsystem can predict the access pattern and prefetch data before
you need it.

I've done some simple tests on my Centrino laptop: A linked list scan of
adjacent nodes was half the performance [1] of a linear array scan.
Randomly ordering the list elements in memory could drop performance to
(not by) 4-5% of the array scan.

-Mike

1] This was a simple microbenchmark involving adding lists of 32-bit
integers. Pointers were also 32-bits, so a linked list having half the
performance of an array is not too suprising.
 
I

Ian Pilcher

Ben said:
You are overlooking the algorithm for removing an item from a
linear-probing hash table. In Knuth this is Algorithm 6.4R
(Deletion with linear probing).

Only because I don't know about it. (I'm not a professional programmer;
what knowledge I have about hash tables comes from Google and the GNU
Classpath source code.)

Can you point me to a description of this algorithm on the web?

Trying to think it through myself, the goal would presumably be to
eliminate "holes" caused by removing a mapping that has previously
caused an entry to "bounce" to a subsequent slot. How might one go
about this...

When an entry is removed, search "forward" in the table for an entry
that could occupy the newly freed slot. ("Forward" being the direction
in which entries "bounce".) In this case, the search can end at any
empty slot.

If an eligible entry is found, move it to the newly freed slot and
recursively remove it from its original slot.

How does that sound?
 
I

Ian Pilcher

MSCHAEF.COM said:
One advantage of linear probing is that it can have better performance on
modern hardware. Linear scans through memory are significantly cheaper
than linked list traversals, even if the linked list nodes are adjacent in
memory. The linked list really has two disadvantages:

* It requires a both a pointer access and a comparison for each node
checked. This means that morr data has to be loaded from memory to
check a list node.

* In linear probing, adjacent nodes are guaranteed to be adjacent to each
other in the address space. This makes it more likely that the memory
subsystem can predict the access pattern and prefetch data before
you need it.

I think the fact that this is Java, may reduce the impact of these
points, particularly since my keys are going to be Java WeakReference
objects.

In the linear probing case, each key in the table will actually be a
reference to a WeakReference (or a subclass thereof), which can be used
to obtain a reference to the actual key. In the chaining case, I can
use a subclass of WeakReference as my list node, so the number of
objects will be the same, as will the number of references that need to
be traversed per comparison.

If Java would allow me to create an actual array of objects, I think
your logic would hold, but since objects are *always* accessed through
references in Java, I think that the chained table should perform just
as well.

What do you think?

Thanks for the feedback, BTW!
 
T

Thomas Hawtin

Ian said:
I think the fact that this is Java, may reduce the impact of these
points, particularly since my keys are going to be Java WeakReference
objects.

As your key? Conventionally entries subclass WeakReferences to point to
your key, with a conventional strong reference to the value. Although I
guess a custom implementation could collapse the key into the entry.

I don't understand why Sun have not added a WeakIdentityHashMap.
WeakHashMap only works if you have complete control.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542
In the linear probing case, each key in the table will actually be a
reference to a WeakReference (or a subclass thereof), which can be used
to obtain a reference to the actual key. In the chaining case, I can
use a subclass of WeakReference as my list node, so the number of
objects will be the same, as will the number of references that need to
be traversed per comparison.

The current custom weak identity hash table in ThreadLocal has a
WeakReference for an entry, but does linear probing. I can't quite work
out why. Neither that implementation nor my (leak-free) cyclic-list
reimplementation are nice when it comes to expired WeakReferences.
If Java would allow me to create an actual array of objects, I think
your logic would hold, but since objects are *always* accessed through
references in Java, I think that the chained table should perform just
as well.

Creating (resettable) arrays of WeakReferences with good performance
would, I think, be challenging.

It's perhaps surprising that HashMap is not implemented like
IdentityHashMap. Even implemented with chains using offsets into the
array would appear to be better.

Tom Hawtin
 
I

Ian Pilcher

Thomas said:
As your key? Conventionally entries subclass WeakReferences to point to
your key, with a conventional strong reference to the value. Although I
guess a custom implementation could collapse the key into the entry.

If I follow the linear probe path, I will likely just use an Object[]
twice as large as the capacity of the hash table. The "keys," at even
indices, would be references to WeakReference objects (or a simple sub-
class); the "values," at odd indices, would be references to the value
objects.
I don't understand why Sun have not added a WeakIdentityHashMap.
WeakHashMap only works if you have complete control.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542

LMAO. The best part is that their own evaluation, from back in 2001 or
so, agrees with your assessment.
Creating (resettable) arrays of WeakReferences with good performance
would, I think, be challenging.

Not sure I follow, but I don't think it matters. My point was that the
"locality" argument for a linear probe table only holds water when the
keys are actually *in* the array. When the array only holds references
to the keys, examining all the keys that correspond to a particular hash
code might actually be marginally faster in a chained hash table.
 
A

Arash Partow

Hi Ian,

The fact of the matter is that memory is cheap these days so USE it!.
Implement chaining, as a resolution to hash collisions, DO NOT use
linked lists, instead use dynamic arrays (in the case of java use
a vector as your bucket). Assess the data or types of data you
will be hashing with various types of hash functions to find a
balance between data, hash function and initial bucket sizes.
A good hash function would have an average of 1-2 in chaining
length with an upper 1 STD chaining length of about 5 (aka less
that 0.1 % of the buckets should be of that chain length).

DO NOT use linear probing - as it is only a mental exercise for
computational theorists and is in the over-whelming majority
of cases totally and utterly useless.

You also mentioned weak identity? could this mean you are only
after existence information and not the data that is represented
by the key? If so, a bloom filter is a much faster and less
processor and memory intensive way of achieving such functionality.

The following url contains information regarding these topics:
http://www.partow.net/programming/hashfunctions/index.html





Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 
B

Ben Pfaff

Ian Pilcher said:
Only because I don't know about it. (I'm not a professional programmer;
what knowledge I have about hash tables comes from Google and the GNU
Classpath source code.)

Can you point me to a description of this algorithm on the web?

I couldn't easily find a description of it on the web. There is
an implementation of it in a hash table of mine (in C), which can
be viewed here:
http://cvs.savannah.gnu.org/viewcvs/pspp/pspp/src/hash.c?rev=1.17&view=markup
 
T

Thomas Hawtin

Arash said:
The fact of the matter is that memory is cheap these days so USE it!.

Memory is cheap, but it is not free. Cache is relatively expensive. The
crush point these days seems to be memory bandwidth. Look at the Niagra.
Each core has four hardware threads per core to smooth over memory
latency (Niagra II is planned to have eight, apparently). The chip has
four DDR2 channels. I assume the designers did not do that just because
it was fun and inexpensive.

If you are witting a class for wide and frequent use, you should be
thinking about making it fast and compact.
Implement chaining, as a resolution to hash collisions, DO NOT use
linked lists, instead use dynamic arrays (in the case of java use
a vector as your bucket). Assess the data or types of data you
will be hashing with various types of hash functions to find a
balance between data, hash function and initial bucket sizes.
A good hash function would have an average of 1-2 in chaining
length with an upper 1 STD chaining length of about 5 (aka less
that 0.1 % of the buckets should be of that chain length).

Short chains are good. With a short length, there is little advantage of
an array over a linked list. A Vector would require even more overhead
(plus synchronisation, did you mean ArrayList?). When it comes to
rehashing, a linked list requires no extra allocations. Vectors would
take even more memory.

Fortunately the mean bucket length should be less than an entry (with
0.75 load factor threshold; actual load should be between 0.375 and 0.75
with resizing doubling table length).
DO NOT use linear probing - as it is only a mental exercise for
computational theorists and is in the over-whelming majority
of cases totally and utterly useless.

Write once. Improve performance everywhere.
You also mentioned weak identity?

Weak references to keys, with only object identity of importance, yes.
could this mean you are only
after existence information and not the data that is represented
by the key?

No. We are talking maps. For instance, I mentioned a thread-local
implementation that has one-per-thread maps from thread-local object to
value. If the thread-local object disappears, then its entries should be
removed.
If so, a bloom filter is a much faster and less
processor and memory intensive way of achieving such functionality.

But isn't much cop where objects can disappear without notice, or where
an exact mapping is required.

Tom Hawtin
 
R

Russell Shaw

Arash said:
Hi Ian,

The fact of the matter is that memory is cheap these days so USE it!.
Implement chaining, as a resolution to hash collisions, DO NOT use
linked lists, instead use dynamic arrays

Why? Realloc'ing an array to make a longer one often means a new malloc
followed by a behind-the-scenes memmove() which is all awfully expensive
compared to doing a linked list.
 
R

Rob Thorpe

Russell said:
Why? Realloc'ing an array to make a longer one often means a new malloc
followed by a behind-the-scenes memmove() which is all awfully expensive
compared to doing a linked list.

It is, but it should not happen often. For example, lets say that the
implementation initially allocates a 5 element array. It may at some
point in the future expand it in the inefficient way you mention, but
only if 5 keys have hit the same slot. If this has happened then the
hash table is probably too small for the data and should be rehashed,
in this case many methods become slow, especially linear probing.

The best argument against using variable length arrays is that the
intial length in many languages is probably more like 255 elements,
most of these will be wasted and would be much better used making the
hash table bigger.
 
R

Rob Thorpe

Arash said:
Hi Ian,

The fact of the matter is that memory is cheap these days so USE it!.
Implement chaining, as a resolution to hash collisions, DO NOT use
linked lists, instead use dynamic arrays (in the case of java use
a vector as your bucket). Assess the data or types of data you
will be hashing with various types of hash functions to find a
balance between data, hash function and initial bucket sizes.
A good hash function would have an average of 1-2 in chaining
length with an upper 1 STD chaining length of about 5 (aka less
that 0.1 % of the buckets should be of that chain length).

Sounds reasonable.
DO NOT use linear probing - as it is only a mental exercise for
computational theorists and is in the over-whelming majority
of cases totally and utterly useless.

I'm not sure about that.
The point about memory is that if you use Y bytes outside the hash
table in arrays you have to know that it is more worthwhile than using
those bytes to make a bigger hash table.

In the example above, lets say that a 5 element array is attached to
each bucket. This may well work very well, but does it work better
than a non-chained hash that is 5 times the size? (Obviously you could
improve this comparison by only allocating an array when it's needed).

I expect the answer depends on the workload.
 
C

Chuck F.

Ben said:
hashlib doesn't implement the algorithm in question.

True. However it does implement deletions in a linear probing
table, which is not exactly especially common.
 
C

Chuck F.

Ben said:
hashlib doesn't implement the algorithm in question.

True. However it does implement deletions in a linear probing
table, which is not exactly especially common.
 
C

Charles Richmond

Chuck F. said:
He can also look at the coding for my hashlib package, which allows
deletions. See:

<http://cbfalconer.home.att.net/download/hashlib.zip>

Welcome back, CBF!!!


I have been kicking around the idea of creating a hash table myself. I *need* to
learn more about hashing, so I plan to study Knuth on the subject. Of course, I
intend to study *your* hashlib also, CBF.

I do *not* need deletion, but I plan to use a re-hash function in case of a
collision.
 
C

Charles Richmond

Chuck F. said:
He can also look at the coding for my hashlib package, which allows
deletions. See:

<http://cbfalconer.home.att.net/download/hashlib.zip>

Welcome back, CBF!!!


I have been kicking around the idea of creating a hash table myself. I *need* to
learn more about hashing, so I plan to study Knuth on the subject. Of course, I
intend to study *your* hashlib also, CBF.

I do *not* need deletion, but I plan to use a re-hash function in case of a
collision.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Hash table Implementation 3
hash table - simple implementation 16
Hash table 5
Hash table 4
Linear thinking vs essential thinking 4
Hash table implementation. 38
File Contents into Hash Table? 6
hash table usage questions 41

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top