SGI hash_map

C

Christian Meier

Hello,

Yes, I know that hash maps aren't in the standard. But I couldn't find any
better newsgroup for this post. (or is there an SGI library newsgroup?)

I am currently testing the hash_map implementation of SGI. And now I am not
sure if it is really true, what I discovered....
Here is my code:


#include <stdint.h> // for uint64_t
#include <iostream>
using namespace std;

#include <ext/hash_map>
using namespace __gnu_cxx;


const int C_BUCKETS = 700;
const int C_INSERTIONS = 800;

struct hashFunc {
size_t operator() (const uint64_t& ujHash) const { return 1; /* return
static_cast<size_t>(ujHash); */ }
};

typedef hash_map<uint64_t, uint64_t, hashFunc> MyHashMap;


int main()
{
MyHashMap myHashMap(C_BUCKETS);

cout << "bucket_count: " << myHashMap.bucket_count() << endl;
for (uint64_t uj = 0; uj < C_INSERTIONS; ++uj) {
myHashMap.insert(make_pair(uj, uj));
} // for
cout << "bucket_count: " << myHashMap.bucket_count() << endl;

} // main()


As you can see, my hash function always returns 1. So all the values go into
the same bucket. When I run this program, I get the following output:
bucket_count: 769
bucket_count: 1543

My question is why??? After inserting the 769th element, the number of
buckets is doubled. I could understand this behaviour if each element went
into a own bucket and all buckets were used. But I use only one bucket
because of my hash function. The hash map never has less buckets than number
of elements.... When I set C_INSERTIONS to (1543 + 1) then the bucket_count
returns 3079...
Now, what am I doing wrong?
Or is this really the meaning of the implementation of the SGI hash_map? If
so, why is this done like that?

Thanks for your answers!

Chris
 
I

Ivan Vecerina

Christian Meier said:
Yes, I know that hash maps aren't in the standard. But I couldn't find any
better newsgroup for this post. (or is there an SGI library newsgroup?)

I am currently testing the hash_map implementation of SGI.
IIRC, hash_map and hash_set (etc) are expected to enter the next C++
standard
library as unordered_map and unordered_set.

[...]
As you can see, my hash function always returns 1. So all the values go
into
the same bucket. When I run this program, I get the following output:
bucket_count: 769
bucket_count: 1543

My question is why???
Your hash function is obviously malformed, and the container does not
expect it. With a uniform hash function, increasing the number of buckets
will uniformly decrease the number of items per bucket.
Even if all items fall into the same bucket for a given bucket count,
the algorithm can legitimately expect that increasing the bucket count
will lead to a somewhat more uniform distribution of elements.
After inserting the 769th element, the number of
buckets is doubled. I could understand this behaviour if each element went
into a own bucket and all buckets were used. But I use only one bucket
because of my hash function. The hash map never has less buckets than
number
of elements.... When I set C_INSERTIONS to (1543 + 1) then the
bucket_count
returns 3079...
Now, what am I doing wrong?
A good hash function is essential for these containers to work correcly.
It is essential that the function returns a relatively uniformly distributed
random value. A minimalistic way to achieve this is to multiply an input
value by some large prime number, better is to use one of the many well-
studied hash functions you'll find on the web.
For an intro, see for example:
http://www.concentric.net/~Ttwang/tech/inthash.htm


I hope this helps,
Ivan
 
C

Christian Meier

Ivan Vecerina said:
Christian Meier said:
Yes, I know that hash maps aren't in the standard. But I couldn't find any
better newsgroup for this post. (or is there an SGI library newsgroup?)

I am currently testing the hash_map implementation of SGI.
IIRC, hash_map and hash_set (etc) are expected to enter the next C++
standard
library as unordered_map and unordered_set.

[...]
As you can see, my hash function always returns 1. So all the values go
into
the same bucket. When I run this program, I get the following output:
bucket_count: 769
bucket_count: 1543

My question is why???
Your hash function is obviously malformed, and the container does not
expect it. With a uniform hash function, increasing the number of buckets
will uniformly decrease the number of items per bucket.
Even if all items fall into the same bucket for a given bucket count,
the algorithm can legitimately expect that increasing the bucket count
will lead to a somewhat more uniform distribution of elements.
After inserting the 769th element, the number of
buckets is doubled. I could understand this behaviour if each element went
into a own bucket and all buckets were used. But I use only one bucket
because of my hash function. The hash map never has less buckets than
number
of elements.... When I set C_INSERTIONS to (1543 + 1) then the
bucket_count
returns 3079...
Now, what am I doing wrong?
A good hash function is essential for these containers to work correcly.

Yes, that's the reason why I didn't delete my origin hash function source
code:
size_t operator() (const uint64_t& ujHash) const { return 1; /* return
static_cast<size_t>(ujHash); */ }

Returning 1 was just for testing purposes.... to be sure that all elements
go into the same bucket.
It is essential that the function returns a relatively uniformly distributed
random value. A minimalistic way to achieve this is to multiply an input
value by some large prime number, better is to use one of the many well-
studied hash functions you'll find on the web.
For an intro, see for example:
http://www.concentric.net/~Ttwang/tech/inthash.htm


I hope this helps,
Ivan

Because there is no hash<uint64_t> function, I wrote my own. As the
hash<int> value for an int of the value 5435438 is 5435438 and for 123456 is
123456, I just return the uint64_t value:
return static_cast<size_t>(ujHash);

My values do not have to be multiplied by a prime number because I get
different values with little difference (not 1000000, 2000000 and 3000000).
And before inserting into the map, each hash value is calculated with:
hash_val %= bucket_count();
And the number of buckets is always a prime number in the SGI
implementation.

In the meantime I looked up the source code of the SGI library. And there is
the function insert_unique which is called by the hash map function
::insert():

pair<iterator, bool> insert_unique(const value_type& __obj)
{
resize(_M_num_elements + 1);
return insert_unique_noresize(__obj);
}

This means: Each time an element is inserted into the hash map, it will be
checked for resizing depending on _M_num_elements. _M_num_elements is the
number of ALL elements in the map. If I have all elements in the same
bucket, the map will be resized after reaching the number of buckets altough
they are all in the same bucket...
I don't know why this is written like this. This implementation is written
for a hash codes which are unique. Well, this is no problem for numeric data
types of smaller size than std::size_t. But this implementation of the hash
map would be quite ugly if I wanted to insert large strings for example.....
Well, I could answer my question by myself. But I do not really understand
why the SGI people want to have as many buckets as elements in every
case....

But thanks for your help anyway!

Greets Chris
 
J

Javier Noval

It's my understanding the hash_map is working correctly. For the
rehashing (thus increasing the number of buckets) it only takes into
account the global usage of the table, not the usage of each bucket or
anything like that. As soon as that usage rises over a certain point,
the table is rehashed, regardless of it being a degenerate case (all
elements into the same bucket) or not.

-- Javier
 
I

Ivan Vecerina

Christian Meier said:
....
Because there is no hash<uint64_t> function, I wrote my own. As the
hash<int> value for an int of the value 5435438 is 5435438 and for 123456
is
123456, I just return the uint64_t value:
return static_cast<size_t>(ujHash);

My values do not have to be multiplied by a prime number because I get
different values with little difference (not 1000000, 2000000 and
3000000).
And before inserting into the map, each hash value is calculated with:
hash_val %= bucket_count();
And the number of buckets is always a prime number in the SGI
implementation.
Depending on how the values are distributed, you may or may not have
a uniform distribution. If you care to check, you could probably write
a program to count the number of buckets that contain multiple items.
In the meantime I looked up the source code of the SGI library. And there
is
the function insert_unique which is called by the hash map function
::insert():

pair<iterator, bool> insert_unique(const value_type& __obj)
{
resize(_M_num_elements + 1);
return insert_unique_noresize(__obj);
}

This means: Each time an element is inserted into the hash map, it will be
checked for resizing depending on _M_num_elements. _M_num_elements is the
number of ALL elements in the map. If I have all elements in the same
bucket, the map will be resized after reaching the number of buckets
altough
they are all in the same bucket...
I don't know why this is written like this.
This ensures that item search is always as efficient as possible (if
this doesn't matter to a program, then std::map may be a better candiadate).
Like for the resizing of std::vector, the number of 'rehasings' in hashmap
is amortized constant relative to the number of contained item. So
this is normally not a problem. (NB: there are some sophisticated hash
table algorithms to dynamically 'redistribute' items, but they only make
sense in specific implementations).
This implementation is written for a hash codes which are unique.
Yes, this is what they are supposed to be !
Well, this is no problem for numeric data
types of smaller size than std::size_t. But this implementation of the
hash
map would be quite ugly if I wanted to insert large strings for
example.....
Again, not really a problem because the number of hasch code computations
is amortized constant (~2) per item inserted.
Well, I could answer my question by myself. But I do not really understand
why the SGI people want to have as many buckets as elements in every
case....
In non-pathological cases (proper hashing) this is what allows hash_map
to perform queries at optimal speed - this is the only benefit of
hash_map. Searching (linearily) through multiple items in the same
bucket can be quite expensive.


Cheers,
Ivan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top