Hash table in C++? STL?

P

pembed2003

Hi All,
Does C++/STL have hashtable where I can do stuff like:

Hashtable h<int>;

h.store("one",1);
h.store("two",2);

and then later retrieve them like:

cout<<h.fetch("one")<<endl;

prints 1

any idea?

Thanks!
 
J

John Harrison

pembed2003 said:
Hi All,
Does C++/STL have hashtable where I can do stuff like:

Hashtable h<int>;

h.store("one",1);
h.store("two",2);

and then later retrieve them like:

cout<<h.fetch("one")<<endl;

prints 1

any idea?

Thanks!

It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.

john
 
K

Kevin Goodsell

John said:
It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.

Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?

-Kevin
 
A

Andre Kostur

Does anyone know what will need to be provided in order to hash on a
particular type? std::map requires a less-than comparison operation --
what operations would a hash-based map need?

That would depend on your implementation of hash_map. I would suspect
some mechanism for generating a hash value from your key type, and I
suppose some sort of comparison on the hash value as well (and probably
you'd still need some sort of comparison function for your keys since they
may need to be compared in the event of hash collisions).
 
C

Claudio Puviani

Kevin Goodsell said:
Does anyone know what will need to be provided in
order to hash on a particular type? std::map requires
a less-than comparison operation -- what operations
would a hash-based map need?

A hashing function and an equality operation are the minimum requirements, but
requiring a relational comparison (such as less_than) instead of equality would
allow for certain optimizations on the part of the implementers (such as using
a tree-like structure instead of lists for the buckets). I don't know whether
the standard will require equality or a relational comparator.

Sadly, the hashing function is where all the complexity lies and I'd be
pleasantly surprised if more than 1% of programmers knew what constitutes a
good hashing function. With a bad hashing function (and a naive bucket
strategy), performance can quickly degenerate to much worse than a balanced
tree on large data sets.

Claudio Puviani
 
K

Kevin Goodsell

David said:
On Wed, 14 Apr 2004 20:54:05 GMT in comp.lang.c++, Kevin Goodsell



Mainly, a hashing function that takes a argument of type Key
and returns an int.

http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1443.html

(Having only glanced over the link so far...)

I thought maybe they'd supply a hashing function that you can use, for
example if there's an operator<<(ostream &, const Key&). The string
representation could be generated with that and hashed. Since the
proposal includes a hash function for std::string, I suppose it would be
fairly easy to create this yourself.

I don't know how well this would work in practice, though. Obviously the
printed form of the Key would need to be unique.

-Kevin
 
P

pembed2003

John Harrison said:
It has a map which has the same functionality as above (but store is called
insert and fetch is called find). Map is usually based on a red-black tree
algorithm. Several library implementers also offer a genuine hash table
(called hash_map) and apparently some version of this will make it into a
future version of the standard.

john

Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!
 
S

Siemel Naran

A hashing function and an equality operation are the minimum requirements, but
requiring a relational comparison (such as less_than) instead of equality would
allow for certain optimizations on the part of the implementers (such as using
a tree-like structure instead of lists for the buckets). I don't know whether
the standard will require equality or a relational comparator.

Sadly, the hashing function is where all the complexity lies and I'd be
pleasantly surprised if more than 1% of programmers knew what constitutes a
good hashing function. With a bad hashing function (and a naive bucket
strategy), performance can quickly degenerate to much worse than a balanced
tree on large data sets.

I don't think I'm in that 1%. A naive guess is to add up all the letters
and modulo by the bucket size, which they say should be prime, but why is
this?

size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}

But we need hash only the first maxchars characters.

To answer Kevin's question about the hash function, it should be like this
in general

class hash {
public:
typedef char char_type;
typedef typename std::vector<T>::size_type hash_type;
explicit hash(size_t maxchars);
hash_type operator()(char_type begin, char_type end, size_t numbuckets)
const;
private:
size_t maxchars;
};

The hash may increase numbuckets dynamically as it grows, so numbuckets is a
parameter to operator(). But the number of chars to hash is a constant, and
should not change as we increase or decrease the number of buckets, so it is
argument to the constructor and gets stored in the class.

In our code above we rely on an implicit conversion from char_type to
hash_type (which seems not reasonable), and hash_type::eek:perator+= (which
seems reasonable as hash_type is just an unsigned integer).
 
K

Kevin Goodsell

Siemel said:
I don't think I'm in that 1%.

I wouldn't dare claim to be either, without doing some research first.
A naive guess is to add up all the letters
and modulo by the bucket size,

I think you mean table size. Each bucket has a separate size determined
by the number of items that hash to that index.

Or, perhaps you meant "the size measured in number of buckets" and I was
just reading it wrong. I could see that.
which they say should be prime, but why is
this?

I'm not sure I ever completely understood that. The one obvious reason
doesn't apply to most implementations. In "probing" collision-resolution
schemes it makes sense, but most use "chaining" or buckets. What I'm
calling "probing" is the situation where a collision (i.e., a hash to a
location that's already taken) is handled by trying other locations. For
example, a naive approach is to try the next slot, until an empty one is
found. A better technique (I think) would be double hashing, where a
second hash function is used to determine the sequence of slots to try.
So if h1() is the primary hash function and h2() is the secondary,
suppose h1(x) = n, but n is already used. Then you would try n + h2(x),
n + 2*h2(x), n + 3*h2(x), etc. In all cases, modding by the table size.

In this case, having the table size prime ensures that you will
eventually try every table slot -- if there's an open one, you'll find it.

But nobody seems to use schemes like this very often, and it seems
obvious why. Why would you want to take up additional slots in the
table, increasing the chances of future collisions, when you can just
chain things outside the table (in buckets)? Lookup complexity for the
element is no worse (may even be better), and the table is less cluttered.

So why the prime table size? I can't see an obvious reason. If hash
values are approximately uniformly distributed between 0 and N, I'd
think that ideal table sizes would be numbers that evenly divide N,
because any remainder would split the table into two areas with
different probabilities of any particular value hashing into them.
size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}

Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.
But we need hash only the first maxchars characters.

Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.

-Kevin
 
C

Claudio Puviani

Kevin Goodsell said:
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.

If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

Claudio Puviani
 
M

Michiel Salters

Hi John,
Can you point me to the documentation where I can find the map and
hash_map data structure? Thanks for the help!

std::map is documented in quite a number of books, e.g. Accelerated C++
(Koenig&Moo) or Generic Programming and the STL(Austern)

http://www.amazon.com/exec/obidos/ASIN/020170353X/
http://www.amazon.com/exec/obidos/ASIN/0201309564/

hash_map isn't in the standard, so there are multiple different
versions around. Get the documentation from the same source as
your code. Your compiler vendor may have included something.

Regards,
Michiel Salters
 
C

Claudio Puviani

Kevin Goodsell said:
So why the prime table size? I can't see an obvious reason.
If hash values are approximately uniformly distributed between
0 and N, I'd think that ideal table sizes would be numbers that
evenly divide N, because any remainder would split the table
into two areas with different probabilities of any particular
value hashing into them.

If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

Claudio Puviani

[somehow, my server missed the first time I sent this reply]
 
K

Kevin Goodsell

Claudio said:
If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

I thought it might be something like that. Although, I disagree that
there's no "ideal" size with uniformly distributed hash values. I
definitely think some sizes are better than others (though it may be a
small difference). As a trivial example, suppose the range of hash
values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
and also from 51 to 75, end up in the lower 25 slots, while only those
hash values in the range 26 to 50 end up in the upper 25 slots. As a
result, the probability of a key hashing to the first 25 slots is twice
as high as the probability of hashing to the last 25.

I doubt this makes any difference in practice, but it is what I had in
mind in my previous post.

-Kevin
 
P

pembed2003

std::map is documented in quite a number of books, e.g. Accelerated C++
(Koenig&Moo) or Generic Programming and the STL(Austern)

http://www.amazon.com/exec/obidos/ASIN/020170353X/
http://www.amazon.com/exec/obidos/ASIN/0201309564/

hash_map isn't in the standard, so there are multiple different
versions around. Get the documentation from the same source as
your code. Your compiler vendor may have included something.

Thanks for everyone's help so it looks like map is the standard and
should be available for any platform? But hash_map is not standard.

The reason I am asking for hash table is because I need to write a
function that finds the overlap of two integer array. I am currently
doing:

#include<stdlib.h>
#include<string.h>
#include<sys/int_limits.h>

#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}

int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!
 
S

Siemel Naran

Claudio Puviani said:
If the hash values are uniformly distributed (which is one characteristic of a
good hashing function), then any table size would be equivalent; i.e., there
would be no "ideal" size. The reason that prime numbers are used for table
sizes is that many hashing functions are NOT good and sometimes generate hash
values that are multiples of one or more numbers. Those values, modulo a
relatively non-prime number will cluster. Those same values modulo a relatively
prime number will uniformly span the entire set of 0..N-1. In other words,
primality is used as insurance against bad hashing functions. You can persuade
yourself of this by modeling it in a spreadsheet.

This makes sense. Can you write the barebones of a C++ program that can
illustrate this? I guess you could write a spreadshet too, but the
newsgroup only wants free text. I can test it if you want.

We could call call a function hash_function, and we could use my simple
function in the other post, maybe multiplying by the constant Kevin
suggested.
 
S

Siemel Naran

Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.

I did some research, and they said multiplying by the alphabet size (ie.
number of valid chars).

Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.

Suppose our company coding standard is to prefix namespace and class names
with ABC_. Thus ABC_Contact, ABC_Thing. Imagine if our hash function used
just the first 3 chars!

So I think the number of chars to use depends on the scenario, but it does
intuitively make sense to me. We could perhaps use the last N chars, middle
N chars, etc. A templated hashing function makes all this possible.
 
S

Siemel Naran

pembed2003 said:
#define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

void find(int* one,int c1,int* two,int c2){
int i;
int* buffer = malloc(SIZE);
memset((void*)buffer,0,SIZE);
for(i = 0; i < c1; i++)
buffer[one + INT16_MAX] = 1;
for(i = 0; i < c2; i++)
if(buffer[two + INT16_MAX])
printf("%d\n",two);
free(buffer);
}


It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
reading something wrong.
int main(int argc,char** argv){
int a[] = {1,3,45,6,4,-1,8};
int b[] = {4,2,4,222,8,45,-1};
find(a,7,b,7);
}

this gives me a O(n) algr. but seems to be wasting a lot of memory so
I am thinking using a hashtable instead. am i in the right track?
Thanks!

If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
Either loop over the chars in 'one' or 'two', then over the chars in the
other loop. If you want to use STL, you can use std::count_if and supply a
binary predicate that employs std::find (the binary predicate will have a
constructor that takes the array 'one' as an argument).

Note you can also sort the elements in 'one' and 'two', then loop over both
containers simultaneously. Running time O(lg(N)) for the sort if using
std::sort which normally uses quicksort, and O(N) for the scan. I don't
think the standard provides such an algorithm though.

You could sort the array in O(N) time using radix sort (but it will use so
much memory, which is what you want to avoid), or some other specialized
algorithm. You could also sort only 'one' and loop over the chars in 'two',
but use binary_search to see if a char in 'one' exists in 'two'.

If the array size is big, what difference will a hashtable make? You're
using the hashtable to store elements visited, right? If the hashtable
bucket size is small then you'll have lots of collisions (lots of elements
mapping to the same element) and looking one element could be linear time
(if the collided elements are stored in a vector). If the hashtable bucket
size is big then you won't have lots of collisions, but then might as well
use your original solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top