SGI hash_map

Discussion in 'C++' started by Christian Meier, Mar 21, 2005.

  1. Hello,

    Yes, I know that hash maps aren't in the standard. But I couldn't find any
    better newsgroup for this post. (or is there an SGI library newsgroup?)

    I am currently testing the hash_map implementation of SGI. And now I am not
    sure if it is really true, what I discovered....
    Here is my code:


    #include <stdint.h> // for uint64_t
    #include <iostream>
    using namespace std;

    #include <ext/hash_map>
    using namespace __gnu_cxx;


    const int C_BUCKETS = 700;
    const int C_INSERTIONS = 800;

    struct hashFunc {
    size_t operator() (const uint64_t& ujHash) const { return 1; /* return
    static_cast<size_t>(ujHash); */ }
    };

    typedef hash_map<uint64_t, uint64_t, hashFunc> MyHashMap;


    int main()
    {
    MyHashMap myHashMap(C_BUCKETS);

    cout << "bucket_count: " << myHashMap.bucket_count() << endl;
    for (uint64_t uj = 0; uj < C_INSERTIONS; ++uj) {
    myHashMap.insert(make_pair(uj, uj));
    } // for
    cout << "bucket_count: " << myHashMap.bucket_count() << endl;

    } // main()


    As you can see, my hash function always returns 1. So all the values go into
    the same bucket. When I run this program, I get the following output:
    bucket_count: 769
    bucket_count: 1543

    My question is why??? After inserting the 769th element, the number of
    buckets is doubled. I could understand this behaviour if each element went
    into a own bucket and all buckets were used. But I use only one bucket
    because of my hash function. The hash map never has less buckets than number
    of elements.... When I set C_INSERTIONS to (1543 + 1) then the bucket_count
    returns 3079...
    Now, what am I doing wrong?
    Or is this really the meaning of the implementation of the SGI hash_map? If
    so, why is this done like that?

    Thanks for your answers!

    Chris
    Christian Meier, Mar 21, 2005
    #1
    1. Advertising

  2. "Christian Meier" <> wrote in message
    news:d1ltql$kn0$...
    > Yes, I know that hash maps aren't in the standard. But I couldn't find any
    > better newsgroup for this post. (or is there an SGI library newsgroup?)
    >
    > I am currently testing the hash_map implementation of SGI.

    IIRC, hash_map and hash_set (etc) are expected to enter the next C++
    standard
    library as unordered_map and unordered_set.

    [...]
    > As you can see, my hash function always returns 1. So all the values go
    > into
    > the same bucket. When I run this program, I get the following output:
    > bucket_count: 769
    > bucket_count: 1543
    >
    > My question is why???

    Your hash function is obviously malformed, and the container does not
    expect it. With a uniform hash function, increasing the number of buckets
    will uniformly decrease the number of items per bucket.
    Even if all items fall into the same bucket for a given bucket count,
    the algorithm can legitimately expect that increasing the bucket count
    will lead to a somewhat more uniform distribution of elements.

    > After inserting the 769th element, the number of
    > buckets is doubled. I could understand this behaviour if each element went
    > into a own bucket and all buckets were used. But I use only one bucket
    > because of my hash function. The hash map never has less buckets than
    > number
    > of elements.... When I set C_INSERTIONS to (1543 + 1) then the
    > bucket_count
    > returns 3079...
    > Now, what am I doing wrong?

    A good hash function is essential for these containers to work correcly.
    It is essential that the function returns a relatively uniformly distributed
    random value. A minimalistic way to achieve this is to multiply an input
    value by some large prime number, better is to use one of the many well-
    studied hash functions you'll find on the web.
    For an intro, see for example:
    http://www.concentric.net/~Ttwang/tech/inthash.htm


    I hope this helps,
    Ivan
    --
    http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
    Ivan Vecerina, Mar 21, 2005
    #2
    1. Advertising

  3. "Ivan Vecerina" <> schrieb
    im Newsbeitrag news:d1mknc$23p$...
    > "Christian Meier" <> wrote in message
    > news:d1ltql$kn0$...
    > > Yes, I know that hash maps aren't in the standard. But I couldn't find

    any
    > > better newsgroup for this post. (or is there an SGI library newsgroup?)
    > >
    > > I am currently testing the hash_map implementation of SGI.

    > IIRC, hash_map and hash_set (etc) are expected to enter the next C++
    > standard
    > library as unordered_map and unordered_set.
    >
    > [...]
    > > As you can see, my hash function always returns 1. So all the values go
    > > into
    > > the same bucket. When I run this program, I get the following output:
    > > bucket_count: 769
    > > bucket_count: 1543
    > >
    > > My question is why???

    > Your hash function is obviously malformed, and the container does not
    > expect it. With a uniform hash function, increasing the number of buckets
    > will uniformly decrease the number of items per bucket.
    > Even if all items fall into the same bucket for a given bucket count,
    > the algorithm can legitimately expect that increasing the bucket count
    > will lead to a somewhat more uniform distribution of elements.
    >
    > > After inserting the 769th element, the number of
    > > buckets is doubled. I could understand this behaviour if each element

    went
    > > into a own bucket and all buckets were used. But I use only one bucket
    > > because of my hash function. The hash map never has less buckets than
    > > number
    > > of elements.... When I set C_INSERTIONS to (1543 + 1) then the
    > > bucket_count
    > > returns 3079...
    > > Now, what am I doing wrong?

    > A good hash function is essential for these containers to work correcly.


    Yes, that's the reason why I didn't delete my origin hash function source
    code:
    size_t operator() (const uint64_t& ujHash) const { return 1; /* return
    static_cast<size_t>(ujHash); */ }

    Returning 1 was just for testing purposes.... to be sure that all elements
    go into the same bucket.

    > It is essential that the function returns a relatively uniformly

    distributed
    > random value. A minimalistic way to achieve this is to multiply an input
    > value by some large prime number, better is to use one of the many well-
    > studied hash functions you'll find on the web.
    > For an intro, see for example:
    > http://www.concentric.net/~Ttwang/tech/inthash.htm
    >
    >
    > I hope this helps,
    > Ivan
    > --
    > http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
    >
    >


    Because there is no hash<uint64_t> function, I wrote my own. As the
    hash<int> value for an int of the value 5435438 is 5435438 and for 123456 is
    123456, I just return the uint64_t value:
    return static_cast<size_t>(ujHash);

    My values do not have to be multiplied by a prime number because I get
    different values with little difference (not 1000000, 2000000 and 3000000).
    And before inserting into the map, each hash value is calculated with:
    hash_val %= bucket_count();
    And the number of buckets is always a prime number in the SGI
    implementation.

    In the meantime I looked up the source code of the SGI library. And there is
    the function insert_unique which is called by the hash map function
    ::insert():

    pair<iterator, bool> insert_unique(const value_type& __obj)
    {
    resize(_M_num_elements + 1);
    return insert_unique_noresize(__obj);
    }

    This means: Each time an element is inserted into the hash map, it will be
    checked for resizing depending on _M_num_elements. _M_num_elements is the
    number of ALL elements in the map. If I have all elements in the same
    bucket, the map will be resized after reaching the number of buckets altough
    they are all in the same bucket...
    I don't know why this is written like this. This implementation is written
    for a hash codes which are unique. Well, this is no problem for numeric data
    types of smaller size than std::size_t. But this implementation of the hash
    map would be quite ugly if I wanted to insert large strings for example.....
    Well, I could answer my question by myself. But I do not really understand
    why the SGI people want to have as many buckets as elements in every
    case....

    But thanks for your help anyway!

    Greets Chris
    Christian Meier, Mar 21, 2005
    #3
  4. Christian Meier

    Javier Noval Guest

    It's my understanding the hash_map is working correctly. For the
    rehashing (thus increasing the number of buckets) it only takes into
    account the global usage of the table, not the usage of each bucket or
    anything like that. As soon as that usage rises over a certain point,
    the table is rehashed, regardless of it being a degenerate case (all
    elements into the same bucket) or not.

    -- Javier

    Christian Meier wrote:
    > Hello,
    >
    > Yes, I know that hash maps aren't in the standard. But I couldn't find any
    > better newsgroup for this post. (or is there an SGI library newsgroup?)
    >
    > I am currently testing the hash_map implementation of SGI. And now I am not
    > sure if it is really true, what I discovered....
    > Here is my code:
    >
    >
    > #include <stdint.h> // for uint64_t
    > #include <iostream>
    > using namespace std;
    >
    > #include <ext/hash_map>
    > using namespace __gnu_cxx;
    >
    >
    > const int C_BUCKETS = 700;
    > const int C_INSERTIONS = 800;
    >
    > struct hashFunc {
    > size_t operator() (const uint64_t& ujHash) const { return 1; /* return
    > static_cast<size_t>(ujHash); */ }
    > };
    >
    > typedef hash_map<uint64_t, uint64_t, hashFunc> MyHashMap;
    >
    >
    > int main()
    > {
    > MyHashMap myHashMap(C_BUCKETS);
    >
    > cout << "bucket_count: " << myHashMap.bucket_count() << endl;
    > for (uint64_t uj = 0; uj < C_INSERTIONS; ++uj) {
    > myHashMap.insert(make_pair(uj, uj));
    > } // for
    > cout << "bucket_count: " << myHashMap.bucket_count() << endl;
    >
    > } // main()
    >
    >
    > As you can see, my hash function always returns 1. So all the values go into
    > the same bucket. When I run this program, I get the following output:
    > bucket_count: 769
    > bucket_count: 1543
    >
    > My question is why??? After inserting the 769th element, the number of
    > buckets is doubled. I could understand this behaviour if each element went
    > into a own bucket and all buckets were used. But I use only one bucket
    > because of my hash function. The hash map never has less buckets than number
    > of elements.... When I set C_INSERTIONS to (1543 + 1) then the bucket_count
    > returns 3079...
    > Now, what am I doing wrong?
    > Or is this really the meaning of the implementation of the SGI hash_map? If
    > so, why is this done like that?
    >
    > Thanks for your answers!
    >
    > Chris
    >
    >
    Javier Noval, Mar 21, 2005
    #4
  5. "Christian Meier" <> wrote in message
    news:d1mo02$3t7$...
    > "Ivan Vecerina" <>
    > schrieb
    >> It is essential that the function returns a relatively uniformly
    >> distributed
    >> random value. A minimalistic way to achieve this is to multiply an input
    >> value by some large prime number, better is to use one of the many well-
    >> studied hash functions you'll find on the web.
    >> For an intro, see for example:
    >> http://www.concentric.net/~Ttwang/tech/inthash.htm

    ....
    > Because there is no hash<uint64_t> function, I wrote my own. As the
    > hash<int> value for an int of the value 5435438 is 5435438 and for 123456
    > is
    > 123456, I just return the uint64_t value:
    > return static_cast<size_t>(ujHash);
    >
    > My values do not have to be multiplied by a prime number because I get
    > different values with little difference (not 1000000, 2000000 and
    > 3000000).
    > And before inserting into the map, each hash value is calculated with:
    > hash_val %= bucket_count();
    > And the number of buckets is always a prime number in the SGI
    > implementation.

    Depending on how the values are distributed, you may or may not have
    a uniform distribution. If you care to check, you could probably write
    a program to count the number of buckets that contain multiple items.

    > In the meantime I looked up the source code of the SGI library. And there
    > is
    > the function insert_unique which is called by the hash map function
    > ::insert():
    >
    > pair<iterator, bool> insert_unique(const value_type& __obj)
    > {
    > resize(_M_num_elements + 1);
    > return insert_unique_noresize(__obj);
    > }
    >
    > This means: Each time an element is inserted into the hash map, it will be
    > checked for resizing depending on _M_num_elements. _M_num_elements is the
    > number of ALL elements in the map. If I have all elements in the same
    > bucket, the map will be resized after reaching the number of buckets
    > altough
    > they are all in the same bucket...
    > I don't know why this is written like this.

    This ensures that item search is always as efficient as possible (if
    this doesn't matter to a program, then std::map may be a better candiadate).
    Like for the resizing of std::vector, the number of 'rehasings' in hashmap
    is amortized constant relative to the number of contained item. So
    this is normally not a problem. (NB: there are some sophisticated hash
    table algorithms to dynamically 'redistribute' items, but they only make
    sense in specific implementations).

    > This implementation is written for a hash codes which are unique.

    Yes, this is what they are supposed to be !

    > Well, this is no problem for numeric data
    > types of smaller size than std::size_t. But this implementation of the
    > hash
    > map would be quite ugly if I wanted to insert large strings for
    > example.....

    Again, not really a problem because the number of hasch code computations
    is amortized constant (~2) per item inserted.

    > Well, I could answer my question by myself. But I do not really understand
    > why the SGI people want to have as many buckets as elements in every
    > case....

    In non-pathological cases (proper hashing) this is what allows hash_map
    to perform queries at optimal speed - this is the only benefit of
    hash_map. Searching (linearily) through multiple items in the same
    bucket can be quite expensive.


    Cheers,
    Ivan
    --
    http://ivan.vecerina.com/contact/?subject=NG_POST <- email contact form
    Ivan Vecerina, Mar 22, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?ISO-8859-1?Q?S=F6ren?=

    sgi rope experiences?

    =?ISO-8859-1?Q?S=F6ren?=, Jul 29, 2003, in forum: C++
    Replies:
    0
    Views:
    569
    =?ISO-8859-1?Q?S=F6ren?=
    Jul 29, 2003
  2. Chaman Singh

    SGI STL

    Chaman Singh, Apr 12, 2004, in forum: C++
    Replies:
    1
    Views:
    502
    Christopher Benson-Manica
    Apr 13, 2004
  3. JZapin
    Replies:
    2
    Views:
    459
    Jerry Coffin
    Apr 30, 2004
  4. Christopher M. Lusardi
    Replies:
    4
    Views:
    441
    Thomas Matthews
    May 13, 2004
  5. Rajrup Banerjee

    SGI STL Library on Opteron

    Rajrup Banerjee, May 21, 2004, in forum: C++
    Replies:
    1
    Views:
    301
    Christopher Benson-Manica
    May 21, 2004
Loading...

Share This Page