How to implement a Hash Table in C

R

ravi

Hi
can anybody tell me that which ds will be best suited to implement a
hash table in C/C++

thanx. in advanced
 
M

Malcolm McLean

ravi said:
Hi
can anybody tell me that which ds will be best suited to implement a
hash table in C/C++

thanx. in advanced
Which ds?
You can read all about hash tables in my book, Basic Algorithms. The hash
tables chapter is free.
 
S

santosh

B

Ben Bacarisse

Malcolm McLean said:
Which ds?
You can read all about hash tables in my book, Basic Algorithms. The
hash tables chapter is free.

I think it should be pointed out that your hash table is not like
most. Most hash tables are dynamic structures that can grow as data
are added. Yours is fixed. The interface should have a clear warning
that the table fails (silently) if an attempt is made to add an item
beyond the initial capacity.

The point I made before (that the function loop indefinitely when the
table becomes full) is still true, but I now see that the add function
will silently fail long before that point is reached. If this limit
is by design, I think the text should make that limitation very clear.

It is odd (and probably very confusing to a beginner) that the add
function seems to return a success/fail indication, but that it does
not use it to signal this disastrous case.

[Aside: shouldn't a book on basic algorithms illustrate (or at least
describe) how a hash table can grow to accommodate more data?]

To the OP:
1) A proper hash table can grow as data are added.

2) The add function should return an error if the data can't be added
-- for example, some do if the key already exists but it certainly
should if the operation would overflow allocated storage.

3) Most designers would make a hash table, in C at least, that did not
duplicate the key and the data. The idea being to leave the actual
data allocation up to the table user. The table itself would just
store pointers.
 
M

Malcolm McLean

Ben Bacarisse said:
I think it should be pointed out that your hash table is not like
most. Most hash tables are dynamic structures that can grow as data
are added. Yours is fixed. The interface should have a clear warning
that the table fails (silently) if an attempt is made to add an item
beyond the initial capacity.

The point I made before (that the function loop indefinitely when the
table becomes full) is still true, but I now see that the add function
will silently fail long before that point is reached. If this limit
is by design, I think the text should make that limitation very clear.
I think your understanding of hash tables might need to be raised a notch.
If a hash table beomes over full then the algorithm becomes pathological.
Just as if there are only a few empty spaces in a car park, you spend
virtulaly all your time crawling around looking for a slot.
So the table length is twice the capcity, passed into the constructor.
However if the table items are large, that wastes space. So the able itself
id a list of pointers. However calling malloc() for each item would be
inefficient unless the items are extremely large. So we use the fixed
allocator (in the chapter memory games, also free).
This will be exhaused once the table capacity is reached, so the algorithm
will never reach the pathological condition. I admit that this code might be
a bit hairy. It is probably best to add an explicit test against table
capacity to make clear that the table cannot be over-filled.
The initial code is a simlle hash table tyhat uses linear probing, and has a
fixed capacity. The second implenetation introduces quadratic probing and
dynamic tables. These do raise a lot of memory management and efficiency
issues, which are evaded by making th second table store only pointers
It is odd (and probably very confusing to a beginner) that the add
function seems to return a success/fail indication, but that it does
not use it to signal this disastrous case.
The "disastrous case" can't happen. The fixed allocator will run out of
memory and return null when capity is reached. Please avoid this sort of
hyperbole.
[Aside: shouldn't a book on basic algorithms illustrate (or at least
describe) how a hash table can grow to accommodate more data?]
It does. That's in the second half of the chapter. I introduce the central
idea first, and then discuss some refinements and complications
To the OP:
1) A proper hash table can grow as data are added.
Done

2) The add function should return an error if the data can't be added
-- for example, some do if the key already exists but it certainly
should if the operation would overflow allocated storage.
Done.

3) Most designers would make a hash table, in C at least, that did not
duplicate the key and the data. The idea being to leave the actual
data allocation up to the table user. The table itself would just
store pointers.
The second table does store pointers. The best representation of keys is a
whole separate issue, which I don't go into in any detail.The other basic point you've missed is that the code is designed to
introduce hash tables gradually. So a few details are missing from the first
examples, so that the central concept of hashing can be brought across.
If you want to go from a state of not understanding hash tables to
understanding how to code them, go to my book. If you want a drop-in hash
table to put into production C code, Chuck's table is better. There is no
point in us duplicating our efforts.
 
A

Ark Khasin

Ben said:
Malcolm McLean said:
Which ds?
You can read all about hash tables in my book, Basic Algorithms. The
hash tables chapter is free.

I think it should be pointed out that your hash table is not like
most. Most hash tables are dynamic structures that can grow as data
are added. Yours is fixed. The interface should have a clear warning
that the table fails (silently) if an attempt is made to add an item
beyond the initial capacity.

The point I made before (that the function loop indefinitely when the
table becomes full) is still true, but I now see that the add function
will silently fail long before that point is reached. If this limit
is by design, I think the text should make that limitation very clear.

It is odd (and probably very confusing to a beginner) that the add
function seems to return a success/fail indication, but that it does
not use it to signal this disastrous case.

[Aside: shouldn't a book on basic algorithms illustrate (or at least
describe) how a hash table can grow to accommodate more data?]

To the OP:
1) A proper hash table can grow as data are added.

2) The add function should return an error if the data can't be added
-- for example, some do if the key already exists but it certainly
should if the operation would overflow allocated storage.

3) Most designers would make a hash table, in C at least, that did not
duplicate the key and the data. The idea being to leave the actual
data allocation up to the table user. The table itself would just
store pointers.
Mr McLean appears to be a man of strong faith. As such, he is not
persuaded/dissuaded by rational argument of which elsethread is plenty.
 
C

CBFalconer

Malcolm said:
.... snip ...

I think your understanding of hash tables might need to be raised
a notch. If a hash table beomes over full then the algorithm
becomes pathological. Just as if there are only a few empty spaces
in a car park, you spend virtulaly all your time crawling around
looking for a slot.

Not necessarily so. For example, the code in hashlib automatically
expands the table, so that the system is normally between roughly
44 and 88% full. This expansion can continue to ridiculous limits.
 
K

Keith Thompson

ravi said:
can anybody tell me that which ds will be best suited to implement a
hash table in C/C++

Until I read some of the responses, I had no idea what you meant by
"ds" (apparently it's an abbreviation for "data structure").

Please don't use abbreviations like that unless you're sure everyone
is going to understand what they mean. It doesn't take very long to
type out the words "data structure".
 
B

Ben Bacarisse

Malcolm McLean said:
I think your understanding of hash tables might need to be raised a
notch.

If you are going to be like that...

The "disastrous case" can't happen. The fixed allocator will run out
of memory and return null when capity is reached. Please avoid this
sort of hyperbole.

OK. Less hyperbole. Your code exhibits undefined behaviour. You
'dereference' a null pointer. On my system I get a segmentation
violation before hash_add gets any say in the matter. If or when you
find the error, you will say it is a detail, a typo, whatever. I
don't think you worry about details like this.

Do you care, for example, that your statement:

The test for primality is interesting. The only absolutely sure way
of determining whether a number is prime, as far as is known, is to
attempt to divide by all prime numbers below or equal to its square
root.

is not true? Probably not. I will try, in future, to leave you alone.
 
M

Malcolm McLean

Ben Bacarisse said:
If you are going to be like that...



OK. Less hyperbole. Your code exhibits undefined behaviour. You
'dereference' a null pointer. On my system I get a segmentation
violation before hash_add gets any say in the matter. If or when you
find the error, you will say it is a detail, a typo, whatever. I
don't think you worry about details like this.
Obviously if you say things like "the function will not return at all when
the table is full" you cannot understand that a hash table (to be pedantic,
one without "perfect hashing") cannot be more than about 88% full before the
algorithm becomes pathological. In fact I pass in a "capacity" argument
which is half table size.
I think you need to read the chapter again, as a pupil rather than as a
critic.

As for the derefencing null pointer, that will happen if the constructor
fails, and return NULL. That is standard behaviour throughout the book. If
objects cannot be created, the constructing function returns a null pointer
to show they have failed.
I couldn't see another. The code has been tested, but only on two systems (a
UNIX mainframe and Windows) so that doesn't mean no errors remain, and of
course typographical errors do creep in in the process of reformatting for
print - I wish I had a tool to format source code automatically but I don't.
The normal thing when you an error such as dereference of a null is to say
"your code derererences a null" rather than to talk airly about "catastropic
behaviour" in an arrogant manner.
Do you care, for example, that your statement:

The test for primality is interesting. The only absolutely sure way
of determining whether a number is prime, as far as is known, is to
attempt to divide by all prime numbers below or equal to its square
root.

is not true? Probably not. I will try, in future, to leave you alone.
You mean randomised primality testing? It's not absolutely sure.
I am no mathematician, maybe you could tell us your method?
 
R

Richard Tobin

Malcolm McLean said:
You mean randomised primality testing? It's not absolutely sure.
I am no mathematician, maybe you could tell us your method?

Google for "primality testing". There are deterministic tests faster
than testing all the primes < sqrt(n), and the probabilistic tests can
be used to reduce the probability as much as you like (for example, to
less than the chance of your computer being struck by lightning).

-- Richard
 
J

James Dow Allen

Hash tables are frequently discussed in this and "nearby"
ngs, but it is clear that most programmers are unaware of
innovations like Cuckoo Hash or "Compact Hash", etc.
These aren't just theoretical curiosities but have important
advantages over hash tables described in old textbooks.
you cannot understand that a hash table (to be pedantic,
one without "perfect hashing") cannot be more than
about 88% full before the algorithm becomes pathological.

Cuckoo hash, just to give one example, can operate
well above 88%. ("Pathology" is a matter of semantics
since many methods can encounter O(n) behavior even
at low utilization, with non-zero probability,
due to very bad "luck".) Cuckoo hash, BTW, has
another very important space-savings advantage over
ordinary open-address hashing. I'll leave the
identification of this advantage as an exercise.... :)

And popular "chained hashing" methods simply do
not suffer from the pathology you imply.
The test for primality is interesting.

Very interesting, though I know little about it.

The largest *certainly* known non-Marsenne prime is
19249*2^13018586 + 1
a number with almost 4,000,000 digits.
I've no idea how its primality was proved,
but if they...

.... they would have had to perform about 10^2000
long divisions (very long!). Assuming they'd
recruited a googol monkeys, and each monkey
could do a googol divisions per year, they'd
need 10^1800 years to prove primality your way!

I'm no primality expert, but it took me only a
few seconds at Wikipedia to discover:
The first deterministic primality test significantly
faster than the naïve methods was the cyclotomy test;
its runtime can be proven to be
O((log n) ^ clog log log n),
where n is the number to test for primality and
c is a constant independent of n.

Wikipedia (and Google) is your friend! The bad rep
Wikipedia has in recent media reports does not apply,
I think, to most science pages.

I didn't invent Cuckoo Hash (wish I had), learned
of it after Googling in a dialog with another computer
scientist (who hadn't heard of Cuckoo either, even
though he wrote papers on Hash Tables!)

I don't think I'm the only one whose posting
quality would go up if time wasted on Usenet
invective were spent instead on research.

James Dow Allen
 
M

Malcolm McLean

Hash tables are frequently discussed in this and "nearby"
ngs, but it is clear that most programmers are unaware of
innovations like Cuckoo Hash or "Compact Hash", etc.
These aren't just theoretical curiosities but have important
advantages over hash tables described in old textbooks.
Cuckoo hash, just to give one example, can operate
well above 88%. ("Pathology" is a matter of semantics
since many methods can encounter O(n) behavior even
at low utilization, with non-zero probability,
due to very bad "luck".) Cuckoo hash, BTW, has
another very important space-savings advantage over
ordinary open-address hashing. I'll leave the
identification of this advantage as an exercise.... :)

And popular "chained hashing" methods simply do
not suffer from the pathology you imply.
The book is Basic Algorithms. I am aware that there are more sophisticated
methods of hashing, though I am not an expert in them.
However someone who doesn't understand that a simple hash table breaks down
when it becomes too full obviously needs their understanding even of the
basic algorithm raised a notch, as I said.
Very interesting, though I know little about it.
There are certain advantages in making the hash table length prime. However
it is a chapter on hash tables, not prime numbers. So I present a naive test
but mention that there are better ones.
In fact I was wrong about the naive test being the only deterministic one.
Sorry about that. There's no chapter on number theory, in case you are
wondering.
 
W

websnarf

The hash table is itself a data structure. It works in conjunction with
hash function.

It sounds like he is asking what *other* data structures could be used
to implement it. Reallocatable arrays, pointers and structs seems
like the best choices to me. :) Yeah, I know, that's basically most
of what C has, but it seems they are all necessary to solve the
problem anyways. He also asked about C++, in which case you might
like to use a vector instead of a reallocatable array, but that's
apparently off topic here anyways.

The real magic of a hash table, of course, is the internal mapping
structure, the hash function and the operating methods that you decide
to allow for the hash table. But he didn't ask about that.
 
E

Eric Sosman

Malcolm said:
Obviously if you say things like "the function will not return at all
when the table is full" you cannot understand that a hash table (to be
pedantic, one without "perfect hashing") cannot be more than about 88%
full before the algorithm becomes pathological.

Nonsense. Utter uninformed nonsense. "Hash table" covers
a wide variety of data structures and associated algorithms,
with different characteristics. There's an excellent chapter
in TAOCP, and from the above quote there's no evidence that
you've understood any of it.
I think you need to read the chapter again, as a pupil rather than as a
critic.

I have subjected myself to about one and a half chapters
of your awful book already, and will not willingly read more.
The normal thing when you an error such as
dereference of a null is to say "your code derererences a null" rather
than to talk airly about "catastropic behaviour" in an arrogant manner.

Is dereferencing a null pointer provably non-catastrophic?
Is it even "proBably" non-catastrophic? What guarantees apply
to undefined behavior that might limit its damage? What part
of "undefined" are you having trouble with? What part of "un-"
are you failing to grasp?
I am no mathematician, [...]

I'd guessed as much.
 
B

Ben Bacarisse

Malcolm McLean said:
As for the derefencing null pointer, that will happen if the
constructor fails, and return NULL. That is standard behaviour
throughout the book. If objects cannot be created, the constructing
function returns a null pointer to show they have failed.
I couldn't see another. The code has been tested, but only on two
systems (a UNIX mainframe and Windows) so that doesn't mean no errors
remain, and of course typographical errors do creep in in the process
of reformatting for print - I wish I had a tool to format source code
automatically but I don't. The normal thing when you an error such as
dereference of a null is to say "your code derererences a null" rather
than to talk airly about "catastropic behaviour" in an arrogant
manner.

I am sorry you do not like my tone. I have certainly got worked up
about this subject. Your fixed allocator has an error. The code
cannot return a null pointer without causing undefined behaviour.
Calm enough? You decide how seriously you categorise it.

I think you should fix that and the other clear errors[1] that have been
pointed out before you call other people arrogant. You say in the
introduction that bugs can be "corrected very quickly", but you don't
seem to have corrected anything.

One reason I have got so worked up about "details" is that your post
was in c.l.c. The things that I object to most strongly in the book
have nothing to do with C and so a lot of my objections get magnified
through those that *are* topical here. I will try to leave the
subject alone. I think we both know how the other feels about this
subject.
You mean randomised primality testing? It's not absolutely sure.

No. I mean simply that it is not a true statement. There are other
non-probabilistic methods for testing for primality.
I am no mathematician, maybe you could tell us your method?

None is "my" method. They are all easy to find via a quick web
search. The two most famous are the recent Agrawal, Kayal and Saxena
algorithm (that finally proved that primality testing was in P as had
been suspected) and the much older (and currently the fastest) method
based on the work of Adleman, Pomerance and Rumely, improved by Cohen
and Lenstra (this later is sometimes called the APR-CL test).

[1] By "clear error" I mean things that are not a matter of style or
pedagogy. My list is:

(1) redundant (i.e. provably true) if (ptr > str) in strtrim

(2) array[y*10=x];

(3) if (8str == '}') in checkbalanced

(4) incorrect quotes on some char constants in the above function
(only visible in the book version)

(5) int expy;; in floating add (not an error, just an obvious typo).

(6) if(x & 0x7FFFFFFFF < y & 0x7FFFFFFF) in same (two errors here).

(7) if(tail == 100] in add (for a queue)

(8) no test for queue being full or queue being empty

(9) if(ptr == room->baddies) room->baddies = ptr; in removebaddy

(10) missing ";" after assert( x > 0.0) squareroot

(11) null pointer dereference in fixedallocate

(12) The statement above about prime testing.
 
M

Mark McIntyre

Is dereferencing a null pointer provably non-catastrophic?

I guess the point is, when you're commenting on something its
generally considered more productive to use non-pejorative language if
you want to cause the writer to change. Obviously if your only
intention is to highlight the error and complain about the author, the
other method works fine...
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
 
M

Malcolm McLean

Ben Bacarisse said:
I am sorry you do not like my tone. I have certainly got worked up
about this subject. Your fixed allocator has an error. The code
cannot return a null pointer without causing undefined behaviour.
Calm enough? You decide how seriously you categorise it.
Yes, that is another bug.
fixedallocate should have an if(answer) before setting top to answer->next.
Why not say that instead of all that vague talk about catastrophic behaviour
when the hash table gets full?
(The other bug complaint about the code getting stuck in a loop when the
table gets full is false. Which I don't mind, we can all make mistakes, but
at least admit that you are wrong. Otherwise people wonder if you really
understand hash tables at all.)
 
F

Flash Gordon

Malcolm McLean wrote, On 12/08/07 19:22:
Yes, that is another bug.
fixedallocate should have an if(answer) before setting top to answer->next.

So fix it and all the other reported bugs. You should also add a change
record to the book so that people know what you have fixed.
Why not say that instead of all that vague talk about catastrophic
behaviour when the hash table gets full?

Perhaps he thought you should actually test the code in your book when
you are charging for it? Proper testing would have included filling the
hash table.
(The other bug complaint about the code getting stuck in a loop when the
table gets full is false. Which I don't mind, we can all make mistakes,
but at least admit that you are wrong. Otherwise people wonder if you
really understand hash tables at all.)

Actually, I would say it shows that your implementation makes it harder
for people to see how the algorithm works. A decent teaching
implementation of a hash would, IMHO, include the check rather than
relying on how another non-standard (although provided) function behaves.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top