hash table - simple implementation


W

wkevin

Hi,
I tried to look for as much as simple implementation of a hash table in "c".

The performance and quality of the hash function do not interest me for this
task.

I found quite a lot of such example. When I reviewed the code, I always found
that something seems to be more complex than needed, and some logic which seems
that can be simplified, some redundant code, etc.

So I tried writing my own.

I tried it on some test cases and it was ok.
I would appreciate if you could review it correctness; I mean, are there
inputs where it will not work correctly?


Here is the code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

struct entry {
char *key;
char *val;
struct entry *next;
};


typedef struct entry entry_t;

struct hashtable {
struct entry **table;
int size;
};

typedef struct hashtable hashtable_t;

int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)
sum+=key;
return sum%htable->size;
}

void insert(hashtable_t *htable, char *key, char *val)
{
int hash;
hash = hashfn(key,htable);
entry_t *entry = htable->table[hash];
entry_t *last;
entry_t *new_entry;
while (entry != NULL) {
// replace
if (strcmp(entry->key,key) == 0) {
entry->val = val;
return;
}
last = entry;
entry = entry->next;
}

new_entry = malloc(sizeof(entry_t));
if (!new_entry) {
printf("could not allocate memory for new_entry\n");
return;
}

new_entry->key = key;
new_entry->val = val;
new_entry->next = NULL;
if (htable->table[hash] == NULL)
htable->table[hash] = new_entry;
else
{
// first
if (entry == htable->table[hash])
htable->table[hash] = new_entry;
else
last->next = new_entry;
}
}

char *get(hashtable_t *htable, char *key)
{
int hash = hashfn(key, htable);
entry_t *entry = htable->table[hash];
while (entry != NULL) {
if (strcmp(entry->key,key) == 0)
return entry->val;
entry = entry->next;
}
return NULL;
}


hashtable_t *ht_create(int size)
{
int i;
hashtable_t *htable = malloc(sizeof(hashtable_t));
if (!htable)
return NULL;
htable->table = malloc(size * sizeof(entry_t *));
if (!htable->table)
return NULL;
for (i=0; i<size; i++)
htable->table = NULL;
htable->size = size;
return htable;
}

int main()
{
hashtable_t *htable = ht_create(10);
if (!htable)
return;
insert(htable,"aba","111");
insert(htable,"bbb","222");
insert(htable,"aab","333");
printf("get(aba)=%s\n", get(htable,"aba"));
printf("get(bbb)=%s\n", get(htable,"bbb"));
printf("get(aab)=%s\n", get(htable,"aab"));
}



rgs,
Kevin
 
Ad

Advertisements

P

Phil Carmody

wkevin said:
int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)

I know you said that performance didn't interest you, but I'm always
hypersensitive to unnecessary use of strlen. There's no point iterating
down the string twice. You don't need the length, you know when to stop
already - when you reach the \0.

for (i=0; key; i++)
sum+=key;
return sum%htable->size;
}


Phil
 
B

Ben Bacarisse

wkevin said:
Hi,
I tried to look for as much as simple implementation of a hash table in "c".

The performance and quality of the hash function do not interest me for this
task.

I found quite a lot of such example. When I reviewed the code, I
always found that something seems to be more complex than needed, and
some logic which seems that can be simplified, some redundant code,
etc.

I think you have some redundant code too.
So I tried writing my own.

I tried it on some test cases and it was ok.
I would appreciate if you could review it correctness; I mean, are there
inputs where it will not work correctly?

You don't give a specification so there is nothing to verify the code
against. There certainly are parameter values that can be supplied that
make the code go wrong. For example, a zero or negative size for
ht_create, or a null pointer used as a table pointer or as a key. But
the most significant problem is that there are also some key strings
that could cause problems on some systems (mine for example!).

You didn't ask for any code improvement but I've made a couple of
suggestions none the less.
Here is the code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

struct entry {
char *key;
char *val;
struct entry *next;
};


typedef struct entry entry_t;

struct hashtable {
struct entry **table;
int size;
};

typedef struct hashtable hashtable_t;

int hashfn(char *key, hashtable_t *htable)

I'd use const where appropriate:

int hashfn(const char *key, const hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)
sum+=key;
return sum%htable->size;


The problem here is that char can be signed so the resulting index may
be negative. All you need do is declare sum as unsigned.
}

void insert(hashtable_t *htable, char *key, char *val)

Again, I'd use const here (three times).

Also, whenever I write a function that does not return a value, I feel
I've probably gone wrong. Is there not some value that could be useful
to the caller? Here I think you could usefully return the old string
(if there is one). If the strings are allocated, your code will
introduce a memory leak because the old pointer gets lost forever.
{
int hash;
hash = hashfn(key,htable);
entry_t *entry = htable->table[hash];
entry_t *last;
entry_t *new_entry;
while (entry != NULL) {
// replace
if (strcmp(entry->key,key) == 0) {
entry->val = val;
return;
}
last = entry;
entry = entry->next;
}

new_entry = malloc(sizeof(entry_t));

I like to write

new_entry = malloc(sizeof *new_entry);

so that I don't have to check the type -- I know the size is right just
be looking here.
if (!new_entry) {
printf("could not allocate memory for new_entry\n");

Errors should be output on stderr. Also, I'd be tempted to exit here.
If you are not going to signal the error to the caller, there's probably
very little point in carrying on.
return;
}

new_entry->key = key;
new_entry->val = val;
new_entry->next = NULL;
if (htable->table[hash] == NULL)
htable->table[hash] = new_entry;
else
{
// first
if (entry == htable->table[hash])
htable->table[hash] = new_entry;
else
last->next = new_entry;

This second "if" is redundant -- only the "else" branch can ever be
taken.

Also, you may be over-complicating the code since you don't have any a
priori reason to suppose that adding the new entry to the end of the
chain is any better that putting it at the front.
}
}

char *get(hashtable_t *htable, char *key)
{
int hash = hashfn(key, htable);
entry_t *entry = htable->table[hash];
while (entry != NULL) {
if (strcmp(entry->key,key) == 0)
return entry->val;
entry = entry->next;
}
return NULL;
}


hashtable_t *ht_create(int size)

I'd copy this prefix and use it everywhere: ht_get, ht_insert etc.

You might want to review your use of signed and unsigned types. If you
want to put the onus of correctness on the caller, you could declare
this parameter to be an unsigned integer type (and I'd then opt for
size_t). It won't prevent problems, of course -- passing -1 will still
be wrong, but it alerts the user of the code to beware. If you leave it
as a signed type, then you should probably check for < 0. In both cases
(signed and unsigned) I'd check for 0.
{
int i;
hashtable_t *htable = malloc(sizeof(hashtable_t));
if (!htable)
return NULL;
htable->table = malloc(size * sizeof(entry_t *));

You can use 'htable->table = malloc(size * sizeof *htable->table);' to,
again, avoid making the reader check the type. (Ditto three lines
above.)
if (!htable->table)
return NULL;

Here you have a memory leak, though maybe not a very significant
one. Still, I'd want to free(htable) before returning NULL.
for (i=0; i<size; i++)
htable->table = NULL;
htable->size = size;
return htable;
}

int main()


int main(void) is preferred in C.
{
hashtable_t *htable = ht_create(10);
if (!htable)
return;

You need to return and int! If your compiler did not tell you about
this, turn up the warning level.
insert(htable,"aba","111");
insert(htable,"bbb","222");
insert(htable,"aab","333");
printf("get(aba)=%s\n", get(htable,"aba"));
printf("get(bbb)=%s\n", get(htable,"bbb"));
printf("get(aab)=%s\n", get(htable,"aab"));

Here, you don't need to return anything if you are using C99 or C11 --
the newer standards permit "falling off the end of main", but you don't
use any other features of C99, so I'd add a return here too.
 
I

Ike Naar

wkevin said:
int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)

I know you said that performance didn't interest you, but I'm always
hypersensitive to unnecessary use of strlen. There's no point iterating
down the string twice. You don't need the length, you know when to stop
already - when you reach the \0.

for (i=0; key; i++)
sum+=key;
return sum%htable->size;
}


It's even possible that the number of iterations down the string in

is larger than two. The compiler is allowed to compute strlen(key)
for every iteration of the loop, which would make the runtime
quadratic.
 
E

Eric Sosman

Hi,
I tried to look for as much as simple implementation of a hash table in "c".

The performance and quality of the hash function do not interest me for this
task.
Clearly.

I found quite a lot of such example. When I reviewed the code, I always found
that something seems to be more complex than needed, and some logic which seems
that can be simplified, some redundant code, etc.

So I tried writing my own.

I tried it on some test cases and it was ok.
I would appreciate if you could review it correctness; I mean, are there
inputs where it will not work correctly?

Yes, or at least "Yes, possibly."
Here is the code:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

struct entry {
char *key;
char *val;
struct entry *next;
};


typedef struct entry entry_t;

struct hashtable {
struct entry **table;
int size;
};

typedef struct hashtable hashtable_t;

int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)
sum+=key;
return sum%htable->size;


If the key is long enough that the sum of its character codes
exceeds the capacity of an `int' (which might be only 32767), you
get undefined behavior. One of the more likely outcomes is that
`sum' will become negative, in which case the returned value will
be negative or zero. Since you use the returned value as an index
into the array of list headers, a negative value would send you
completely off the rails.

Even though you say you're not interested in the quality of
the hash function, this one is so appallingly bad that I feel duty-
bound to opine that the word "quality" doesn't even apply.
}

void insert(hashtable_t *htable, char *key, char *val)
{
int hash;
hash = hashfn(key,htable);
entry_t *entry = htable->table[hash];
entry_t *last;
entry_t *new_entry;
while (entry != NULL) {
// replace
if (strcmp(entry->key,key) == 0) {
entry->val = val;
return;
}
last = entry;
entry = entry->next;
}

new_entry = malloc(sizeof(entry_t));
if (!new_entry) {
printf("could not allocate memory for new_entry\n");
return;
}

new_entry->key = key;
new_entry->val = val;
new_entry->next = NULL;
if (htable->table[hash] == NULL)
htable->table[hash] = new_entry;
else
{
// first
if (entry == htable->table[hash])
htable->table[hash] = new_entry;
else
last->next = new_entry;
}

This should work, but it seems to be an example of the "more
complex than needed" code you set out to eliminate. Why do you
need three cases (including one that never executes!) to add one
item to a linked list?
}

char *get(hashtable_t *htable, char *key)
{
int hash = hashfn(key, htable);
entry_t *entry = htable->table[hash];
while (entry != NULL) {
if (strcmp(entry->key,key) == 0)
return entry->val;
entry = entry->next;
}
return NULL;
}


hashtable_t *ht_create(int size)
{
int i;
hashtable_t *htable = malloc(sizeof(hashtable_t));
if (!htable)
return NULL;
htable->table = malloc(size * sizeof(entry_t *));
if (!htable->table)
return NULL;

Memory leak here.
for (i=0; i<size; i++)
htable->table = NULL;
htable->size = size;
return htable;
}

int main()
{
hashtable_t *htable = ht_create(10);
if (!htable)
return;


Compiler should have squawked here.
insert(htable,"aba","111");
insert(htable,"bbb","222");
insert(htable,"aab","333");
printf("get(aba)=%s\n", get(htable,"aba"));

Good thing it didn't return NULL, eh?
printf("get(bbb)=%s\n", get(htable,"bbb"));
printf("get(aab)=%s\n", get(htable,"aab"));
}

Overall impressions: It's missing features one expects in a
hash table (deletion, iteration, resizing), and it's utterly
devoid of defensive coding. Still, it'll probably "work" if you
don't push it very hard.
 
W

Willem

Ike Naar wrote:
)>> for (i=0; i<strlen(key);i++)
)
) is larger than two. The compiler is allowed to compute strlen(key)
) for every iteration of the loop, which would make the runtime
) quadratic.

ITYM: required.

The compiler is allowed to compute it only once, if it can prove that it is
invariant (i.e. that nothing changes the string during the loop).


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Ad

Advertisements

K

Kaz Kylheku

Hi,
I tried to look for as much as simple implementation of a hash table in "c".

The performance and quality of the hash function do not interest me for this
task.

If performance and quality do not interest you, then use an associative
list which is exhaustively searched for items.

Hash tables are used when performance matters.
 
I

Ike Naar

Ike Naar wrote:
)>> for (i=0; i<strlen(key);i++)
)
) is larger than two. The compiler is allowed to compute strlen(key)
) for every iteration of the loop, which would make the runtime
) quadratic.

ITYM: required.

Given the context, no.
The compiler is allowed to compute it only once, if it can prove that it is
invariant (i.e. that nothing changes the string during the loop).

(Nit: if it can prove that nothing changes the *length of the
string* during the loop; one can change a string in ways that leave
the length invariant).

The loop shown in the context of this discussion does not modify
the string, so, in this context, repeated computation of strlen is
not required, but it is allowed.
 
K

Keith Thompson

Phil Carmody said:
wkevin said:
int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)

I know you said that performance didn't interest you, but I'm always
hypersensitive to unnecessary use of strlen. There's no point iterating
down the string twice. You don't need the length, you know when to stop
already - when you reach the \0.

for (i=0; key; i++)
sum+=key;
return sum%htable->size;
}


And as Ike Naar points out, it doesn't just iterate over the string
twice; it iterates over the string once for each iteration of the loop.
(strlen() has to traverse the string to find the terminating '\0'.)

Checking for the '\0' yourself is a good approach, though I'd write it a
bit more explicitly:

for (i = 0; key != '\0'; i ++) ...

Another option is to call strlen(), but only once:

int sum = 0;
size_t len = strlen(key);
for (size_t i = 0; i < len; i ++) {
sum += key;
}

This does traverse the string twice, but it's not nearly as bad as
recomputing strlen() in the for loop header.
 
E

Eric Sosman

[...]
int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)
sum+=key;
return sum%htable->size;


If the key is long enough that the sum of its character codes
exceeds the capacity of an `int' [...]


As Ben Bacarisse points out, even a short string will
get you in trouble on some systems. For a concrete example,

insert(htable, "¢", "cent sign");

has key string of only one character, but may quite possibly
produce a negative hash index. If you're using the gcc compiler
suite, build your code with the `-fsigned-char' option, run this
insertion, sit back and watch the fun ...
 
B

BartC

int hashfn(char *key, hashtable_t *htable)
{
int sum=0, i;
for (i=0; i<strlen(key);i++)
sum+=key;


Someone suggested there could be problems when the key is too long. But I
think there will be problems if the key is short too!

It depends what sort of keys you're using, but if they are names and
identifiers (perhaps no more than a dozen characters), then the sum of those
character codes will be too small compared with the hashtable size.

In fact the the sum might only range from 65 ("A") to 1464 ("zzzzzzzzzzzz").
While all 140,000 3-letter identifiers will share hash values from 195 to
366.

But the nature of hashtables means this will still work; the entries will
just be clustered at the bottom end of the table, and there will be many
clashes.

int main()
{
hashtable_t *htable = ht_create(10);
if (!htable)
return;
insert(htable,"aba","111");
insert(htable,"bbb","222");
insert(htable,"aab","333");
printf("get(aba)=%s\n", get(htable,"aba"));
printf("get(bbb)=%s\n", get(htable,"bbb"));
printf("get(aab)=%s\n", get(htable,"aab"));
}

If this is the only test done, then it's not enough. You need to print the
contents of the entire table (empty entries included), to make sure the
entries are spread out.

But first you need to find a better hash function. Unless you aren't
bothered, in which case just have a hash function that returns zero!
 
Ad

Advertisements

B

BartC

I found quite a lot of such example. When I reviewed the code, I always
found
that something seems to be more complex than needed, and some logic which
seems
that can be simplified, some redundant code, etc.
struct entry {
char *key;
char *val;
struct entry *next;
};

I hadn't looked closely at the rest of the code for my first post. To me,
*your* code seems more complex than needed!

For example, I've never used a 'next' field for a hash table (which will
create a cross between a hash table and a linked list!). And what does the
'next' field point to? I wasn't able to figure that out.

(FWIW, my hash tables use a hash function to get a starting point in the
table. Then entries are examined sequentially, until a match is found, or an
empty entry (which means 'not found' when inserting, or 'insert here' when
adding to the table). If the entries are spread evenly and the table is not
too full, you might only need to look at one or two entries.)
hashtable_t *ht_create(int size)
{
int i;
hashtable_t *htable = malloc(sizeof(hashtable_t));
if (!htable)
return NULL;

This is another complication I wouldn't bother with. The hashtable_t struct
is, what, 8 bytes? And the pointer you're returning here is maybe 4 bytes.
Why bother allocating something that small? ht_create() could just return a
hashtable_t struct directly (and saves an extra level of indirection in the
code, and saves having to free the thing later!)
 
M

Malcolm McLean

But first you need to find a better hash function. Unless you aren't
bothered, in which case just have a hash function that returns zero!
For the purposes of testing and debugging, a bad hash function is actually
preferable, because you want lots of clashes to make sure that the system
can cope.
But for real use, ypu need a hash that distributes the data evenly.
 
W

Willem

BartC wrote:
) )> struct entry {
)> char *key;
)> char *val;
)> struct entry *next;
)> };
)
) I hadn't looked closely at the rest of the code for my first post. To me,
) *your* code seems more complex than needed!
)
) For example, I've never used a 'next' field for a hash table (which will
) create a cross between a hash table and a linked list!). And what does the
) 'next' field point to? I wasn't able to figure that out.

Each hash bucket is a linked list. That's a very common type of hash table.
In some respects, it is the easiest type.

) (FWIW, my hash tables use a hash function to get a starting point in the
) table. Then entries are examined sequentially, until a match is found, or an
) empty entry (which means 'not found' when inserting, or 'insert here' when
) adding to the table). If the entries are spread evenly and the table is not
) too full, you might only need to look at one or two entries.)

That is the other common type.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
B

BartC

Malcolm McLean said:
For the purposes of testing and debugging, a bad hash function is actually
preferable, because you want lots of clashes to make sure that the system
can cope.

I just use a small table size for that. While a bad hash can be emulated by
zeroing out a few extra bits from the hash value.
 
B

BartC

Willem said:
BartC wrote:
) )> struct entry {
)> char *key;
)> char *val;
)> struct entry *next;
)> };
)
) I hadn't looked closely at the rest of the code for my first post. To
me,
) *your* code seems more complex than needed!
)
) For example, I've never used a 'next' field for a hash table (which will
) create a cross between a hash table and a linked list!). And what does
the
) 'next' field point to? I wasn't able to figure that out.

Each hash bucket is a linked list. That's a very common type of hash
table.
In some respects, it is the easiest type.

OK, but I still think it's more complex than needed, despite your saying it
is easier.
 
Ad

Advertisements

E

Eric Sosman

Willem said:
[...]
Each hash bucket is a linked list. That's a very common type of hash
table.
In some respects, it is the easiest type.

OK, but I still think it's more complex than needed, despite your saying
it is easier.

The O.P.'s implementation was more complex than needed,
certainly. But for RAM-resident hash tables I think linked
lists are about the simplest implementation one can imagine.
(It's the first collision resolution technique described by
Knuth, who calls it "perhaps the most obvious" way and further
describes it as "a straightforward combination" of elementary
techniques.)

Linked lists are certainly simpler than open probing with
double hashing!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top