Hash table in C++? STL?

Discussion in 'C++' started by pembed2003, Apr 14, 2004.

  1. pembed2003

    pembed2003 Guest

    Hi All,
    Does C++/STL have hashtable where I can do stuff like:

    Hashtable h<int>;

    h.store("one",1);
    h.store("two",2);

    and then later retrieve them like:

    cout<<h.fetch("one")<<endl;

    prints 1

    any idea?

    Thanks!
     
    pembed2003, Apr 14, 2004
    #1
    1. Advertising

  2. "pembed2003" <> wrote in message
    news:...
    > Hi All,
    > Does C++/STL have hashtable where I can do stuff like:
    >
    > Hashtable h<int>;
    >
    > h.store("one",1);
    > h.store("two",2);
    >
    > and then later retrieve them like:
    >
    > cout<<h.fetch("one")<<endl;
    >
    > prints 1
    >
    > any idea?
    >
    > Thanks!


    It has a map which has the same functionality as above (but store is called
    insert and fetch is called find). Map is usually based on a red-black tree
    algorithm. Several library implementers also offer a genuine hash table
    (called hash_map) and apparently some version of this will make it into a
    future version of the standard.

    john
     
    John Harrison, Apr 14, 2004
    #2
    1. Advertising

  3. John Harrison wrote:

    >
    > It has a map which has the same functionality as above (but store is called
    > insert and fetch is called find). Map is usually based on a red-black tree
    > algorithm. Several library implementers also offer a genuine hash table
    > (called hash_map) and apparently some version of this will make it into a
    > future version of the standard.
    >


    Does anyone know what will need to be provided in order to hash on a
    particular type? std::map requires a less-than comparison operation --
    what operations would a hash-based map need?

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.
     
    Kevin Goodsell, Apr 14, 2004
    #3
  4. pembed2003

    Andre Kostur Guest

    Kevin Goodsell <> wrote in
    news:Nxhfc.9574$:

    > John Harrison wrote:
    >
    >>
    >> It has a map which has the same functionality as above (but store is
    >> called insert and fetch is called find). Map is usually based on a
    >> red-black tree algorithm. Several library implementers also offer a
    >> genuine hash table (called hash_map) and apparently some version of
    >> this will make it into a future version of the standard.
    >>

    >
    > Does anyone know what will need to be provided in order to hash on a
    > particular type? std::map requires a less-than comparison operation --
    > what operations would a hash-based map need?


    That would depend on your implementation of hash_map. I would suspect
    some mechanism for generating a hash value from your key type, and I
    suppose some sort of comparison on the hash value as well (and probably
    you'd still need some sort of comparison function for your keys since they
    may need to be compared in the event of hash collisions).
     
    Andre Kostur, Apr 14, 2004
    #4
  5. "Kevin Goodsell" <> wrote
    > Does anyone know what will need to be provided in
    > order to hash on a particular type? std::map requires
    > a less-than comparison operation -- what operations
    > would a hash-based map need?


    A hashing function and an equality operation are the minimum requirements, but
    requiring a relational comparison (such as less_than) instead of equality would
    allow for certain optimizations on the part of the implementers (such as using
    a tree-like structure instead of lists for the buckets). I don't know whether
    the standard will require equality or a relational comparator.

    Sadly, the hashing function is where all the complexity lies and I'd be
    pleasantly surprised if more than 1% of programmers knew what constitutes a
    good hashing function. With a bad hashing function (and a naive bucket
    strategy), performance can quickly degenerate to much worse than a balanced
    tree on large data sets.

    Claudio Puviani
     
    Claudio Puviani, Apr 14, 2004
    #5
  6. pembed2003

    David Harmon Guest

    On Wed, 14 Apr 2004 20:54:05 GMT in comp.lang.c++, Kevin Goodsell
    <> wrote,
    >Does anyone know what will need to be provided in order to hash on a
    >particular type? std::map requires a less-than comparison operation --
    >what operations would a hash-based map need?


    Mainly, a hashing function that takes a argument of type Key
    and returns an int.

    http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1443.html
     
    David Harmon, Apr 15, 2004
    #6
  7. David Harmon wrote:

    > On Wed, 14 Apr 2004 20:54:05 GMT in comp.lang.c++, Kevin Goodsell
    > <> wrote,
    >
    >>Does anyone know what will need to be provided in order to hash on a
    >>particular type? std::map requires a less-than comparison operation --
    >>what operations would a hash-based map need?

    >
    >
    > Mainly, a hashing function that takes a argument of type Key
    > and returns an int.
    >
    > http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1443.html
    >


    (Having only glanced over the link so far...)

    I thought maybe they'd supply a hashing function that you can use, for
    example if there's an operator<<(ostream &, const Key&). The string
    representation could be generated with that and hashed. Since the
    proposal includes a hash function for std::string, I suppose it would be
    fairly easy to create this yourself.

    I don't know how well this would work in practice, though. Obviously the
    printed form of the Key would need to be unique.

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.
     
    Kevin Goodsell, Apr 15, 2004
    #7
  8. pembed2003

    pembed2003 Guest

    "John Harrison" <> wrote in message news:<c5k82i$2nhvo$-berlin.de>...
    > "pembed2003" <> wrote in message
    > news:...
    > > Hi All,
    > > Does C++/STL have hashtable where I can do stuff like:
    > >
    > > Hashtable h<int>;
    > >
    > > h.store("one",1);
    > > h.store("two",2);
    > >
    > > and then later retrieve them like:
    > >
    > > cout<<h.fetch("one")<<endl;
    > >
    > > prints 1
    > >
    > > any idea?
    > >
    > > Thanks!

    >
    > It has a map which has the same functionality as above (but store is called
    > insert and fetch is called find). Map is usually based on a red-black tree
    > algorithm. Several library implementers also offer a genuine hash table
    > (called hash_map) and apparently some version of this will make it into a
    > future version of the standard.
    >
    > john


    Hi John,
    Can you point me to the documentation where I can find the map and
    hash_map data structure? Thanks for the help!
     
    pembed2003, Apr 15, 2004
    #8
  9. pembed2003

    Siemel Naran Guest

    "pembed2003" <> wrote in message

    > Can you point me to the documentation where I can find the map and
    > hash_map data structure? Thanks for the help!


    It's not in the standard, but lots of compilers have it. Go to the include
    folder like C:\Program Files\Borland\CBuilder6\Include\Stlport and look for
    a file like hash_map. Or try your compiler documentation or
    http://www.sgi.com/tech/stl/HashedAssociativeContainer.html.
     
    Siemel Naran, Apr 15, 2004
    #9
  10. >
    > Hi John,
    > Can you point me to the documentation where I can find the map and
    > hash_map data structure? Thanks for the help!


    http://www.dinkumware.com/refxcpp.html

    but remember because hasp_map is non standard the description here might
    differ from the implementation you have (if you have one at all).

    john
     
    John Harrison, Apr 15, 2004
    #10
  11. pembed2003

    Siemel Naran Guest

    "Claudio Puviani" <> wrote in message news:MMhfc.43722
    > "Kevin Goodsell" <> wrote


    > > Does anyone know what will need to be provided in
    > > order to hash on a particular type? std::map requires
    > > a less-than comparison operation -- what operations
    > > would a hash-based map need?


    > A hashing function and an equality operation are the minimum requirements,

    but
    > requiring a relational comparison (such as less_than) instead of equality

    would
    > allow for certain optimizations on the part of the implementers (such as

    using
    > a tree-like structure instead of lists for the buckets). I don't know

    whether
    > the standard will require equality or a relational comparator.
    >
    > Sadly, the hashing function is where all the complexity lies and I'd be
    > pleasantly surprised if more than 1% of programmers knew what constitutes

    a
    > good hashing function. With a bad hashing function (and a naive bucket
    > strategy), performance can quickly degenerate to much worse than a

    balanced
    > tree on large data sets.


    I don't think I'm in that 1%. A naive guess is to add up all the letters
    and modulo by the bucket size, which they say should be prime, but why is
    this?

    size_t hash(const char * s, int buckets) {
    size_t h = 0;
    for ( ; *s; ++s) h += *s;
    return h%bucket.
    }

    But we need hash only the first maxchars characters.

    To answer Kevin's question about the hash function, it should be like this
    in general

    class hash {
    public:
    typedef char char_type;
    typedef typename std::vector<T>::size_type hash_type;
    explicit hash(size_t maxchars);
    hash_type operator()(char_type begin, char_type end, size_t numbuckets)
    const;
    private:
    size_t maxchars;
    };

    The hash may increase numbuckets dynamically as it grows, so numbuckets is a
    parameter to operator(). But the number of chars to hash is a constant, and
    should not change as we increase or decrease the number of buckets, so it is
    argument to the constructor and gets stored in the class.

    In our code above we rely on an implicit conversion from char_type to
    hash_type (which seems not reasonable), and hash_type::eek:perator+= (which
    seems reasonable as hash_type is just an unsigned integer).
     
    Siemel Naran, Apr 15, 2004
    #11
  12. Siemel Naran wrote:

    >
    > I don't think I'm in that 1%.


    I wouldn't dare claim to be either, without doing some research first.

    > A naive guess is to add up all the letters
    > and modulo by the bucket size,


    I think you mean table size. Each bucket has a separate size determined
    by the number of items that hash to that index.

    Or, perhaps you meant "the size measured in number of buckets" and I was
    just reading it wrong. I could see that.

    > which they say should be prime, but why is
    > this?


    I'm not sure I ever completely understood that. The one obvious reason
    doesn't apply to most implementations. In "probing" collision-resolution
    schemes it makes sense, but most use "chaining" or buckets. What I'm
    calling "probing" is the situation where a collision (i.e., a hash to a
    location that's already taken) is handled by trying other locations. For
    example, a naive approach is to try the next slot, until an empty one is
    found. A better technique (I think) would be double hashing, where a
    second hash function is used to determine the sequence of slots to try.
    So if h1() is the primary hash function and h2() is the secondary,
    suppose h1(x) = n, but n is already used. Then you would try n + h2(x),
    n + 2*h2(x), n + 3*h2(x), etc. In all cases, modding by the table size.

    In this case, having the table size prime ensures that you will
    eventually try every table slot -- if there's an open one, you'll find it.

    But nobody seems to use schemes like this very often, and it seems
    obvious why. Why would you want to take up additional slots in the
    table, increasing the chances of future collisions, when you can just
    chain things outside the table (in buckets)? Lookup complexity for the
    element is no worse (may even be better), and the table is less cluttered.

    So why the prime table size? I can't see an obvious reason. If hash
    values are approximately uniformly distributed between 0 and N, I'd
    think that ideal table sizes would be numbers that evenly divide N,
    because any remainder would split the table into two areas with
    different probabilities of any particular value hashing into them.

    >
    > size_t hash(const char * s, int buckets) {
    > size_t h = 0;
    > for ( ; *s; ++s) h += *s;
    > return h%bucket.
    > }


    Kernighan & Pike's _The Practice of Programming_ mentioned a similar
    hash function, but it included (if I recall correctly) multiplying the
    total by a constant before adding in the next character value. This
    constant has been experimentally determined to be effective for ASCII
    text. I'm not sure if anyone really knows why it works. I don't recall
    what the value was, but I'm sure someone can tell us.

    >
    > But we need hash only the first maxchars characters.


    Actually, I think K&P also talked about this, and suggested it was a bad
    idea. This is from memory again, so I may be wrong, but I believe they
    mentioned Java using a scheme like this originally. It turned out to not
    work well, and was changed to use the entire string.

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.
     
    Kevin Goodsell, Apr 15, 2004
    #12
  13. "Kevin Goodsell" <> wrote
    > So why the prime table size? I can't see an obvious reason.
    > If hash values are approximately uniformly distributed between
    > 0 and N, I'd think that ideal table sizes would be numbers that
    > evenly divide N, because any remainder would split the table
    > into two areas with different probabilities of any particular
    > value hashing into them.


    If the hash values are uniformly distributed (which is one characteristic of a
    good hashing function), then any table size would be equivalent; i.e., there
    would be no "ideal" size. The reason that prime numbers are used for table
    sizes is that many hashing functions are NOT good and sometimes generate hash
    values that are multiples of one or more numbers. Those values, modulo a
    relatively non-prime number will cluster. Those same values modulo a relatively
    prime number will uniformly span the entire set of 0..N-1. In other words,
    primality is used as insurance against bad hashing functions. You can persuade
    yourself of this by modeling it in a spreadsheet.

    Claudio Puviani
     
    Claudio Puviani, Apr 15, 2004
    #13
  14. (pembed2003) wrote in message news:<>...

    > Hi John,
    > Can you point me to the documentation where I can find the map and
    > hash_map data structure? Thanks for the help!


    std::map is documented in quite a number of books, e.g. Accelerated C++
    (Koenig&Moo) or Generic Programming and the STL(Austern)

    http://www.amazon.com/exec/obidos/ASIN/020170353X/
    http://www.amazon.com/exec/obidos/ASIN/0201309564/

    hash_map isn't in the standard, so there are multiple different
    versions around. Get the documentation from the same source as
    your code. Your compiler vendor may have included something.

    Regards,
    Michiel Salters
     
    Michiel Salters, Apr 15, 2004
    #14
  15. "Kevin Goodsell" <> wrote
    > So why the prime table size? I can't see an obvious reason.
    > If hash values are approximately uniformly distributed between
    > 0 and N, I'd think that ideal table sizes would be numbers that
    > evenly divide N, because any remainder would split the table
    > into two areas with different probabilities of any particular
    > value hashing into them.


    If the hash values are uniformly distributed (which is one characteristic of a
    good hashing function), then any table size would be equivalent; i.e., there
    would be no "ideal" size. The reason that prime numbers are used for table
    sizes is that many hashing functions are NOT good and sometimes generate hash
    values that are multiples of one or more numbers. Those values, modulo a
    relatively non-prime number will cluster. Those same values modulo a relatively
    prime number will uniformly span the entire set of 0..N-1. In other words,
    primality is used as insurance against bad hashing functions. You can persuade
    yourself of this by modeling it in a spreadsheet.

    Claudio Puviani

    [somehow, my server missed the first time I sent this reply]
     
    Claudio Puviani, Apr 15, 2004
    #15
  16. Claudio Puviani wrote:

    > "Kevin Goodsell" <> wrote
    >
    >>So why the prime table size? I can't see an obvious reason.
    >>If hash values are approximately uniformly distributed between
    >>0 and N, I'd think that ideal table sizes would be numbers that
    >>evenly divide N, because any remainder would split the table
    >>into two areas with different probabilities of any particular
    >>value hashing into them.

    >
    >
    > If the hash values are uniformly distributed (which is one characteristic of a
    > good hashing function), then any table size would be equivalent; i.e., there
    > would be no "ideal" size. The reason that prime numbers are used for table
    > sizes is that many hashing functions are NOT good and sometimes generate hash
    > values that are multiples of one or more numbers. Those values, modulo a
    > relatively non-prime number will cluster. Those same values modulo a relatively
    > prime number will uniformly span the entire set of 0..N-1. In other words,
    > primality is used as insurance against bad hashing functions. You can persuade
    > yourself of this by modeling it in a spreadsheet.
    >


    I thought it might be something like that. Although, I disagree that
    there's no "ideal" size with uniformly distributed hash values. I
    definitely think some sizes are better than others (though it may be a
    small difference). As a trivial example, suppose the range of hash
    values is 0 to 75, and the table size is 50. Hash values from 0 to 25,
    and also from 51 to 75, end up in the lower 25 slots, while only those
    hash values in the range 26 to 50 end up in the upper 25 slots. As a
    result, the probability of a key hashing to the first 25 slots is twice
    as high as the probability of hashing to the last 25.

    I doubt this makes any difference in practice, but it is what I had in
    mind in my previous post.

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.
     
    Kevin Goodsell, Apr 15, 2004
    #16
  17. pembed2003

    pembed2003 Guest

    (Michiel Salters) wrote in message news:<>...
    > (pembed2003) wrote in message news:<>...
    >
    > > Hi John,
    > > Can you point me to the documentation where I can find the map and
    > > hash_map data structure? Thanks for the help!

    >
    > std::map is documented in quite a number of books, e.g. Accelerated C++
    > (Koenig&Moo) or Generic Programming and the STL(Austern)
    >
    > http://www.amazon.com/exec/obidos/ASIN/020170353X/
    > http://www.amazon.com/exec/obidos/ASIN/0201309564/
    >
    > hash_map isn't in the standard, so there are multiple different
    > versions around. Get the documentation from the same source as
    > your code. Your compiler vendor may have included something.
    >


    Thanks for everyone's help so it looks like map is the standard and
    should be available for any platform? But hash_map is not standard.

    The reason I am asking for hash table is because I need to write a
    function that finds the overlap of two integer array. I am currently
    doing:

    #include<stdlib.h>
    #include<string.h>
    #include<sys/int_limits.h>

    #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))

    void find(int* one,int c1,int* two,int c2){
    int i;
    int* buffer = malloc(SIZE);
    memset((void*)buffer,0,SIZE);
    for(i = 0; i < c1; i++)
    buffer[one + INT16_MAX] = 1;
    for(i = 0; i < c2; i++)
    if(buffer[two + INT16_MAX])
    printf("%d\n",two);
    free(buffer);
    }

    int main(int argc,char** argv){
    int a[] = {1,3,45,6,4,-1,8};
    int b[] = {4,2,4,222,8,45,-1};
    find(a,7,b,7);
    }

    this gives me a O(n) algr. but seems to be wasting a lot of memory so
    I am thinking using a hashtable instead. am i in the right track?
    Thanks!
     
    pembed2003, Apr 15, 2004
    #17
  18. pembed2003

    Siemel Naran Guest

    "Claudio Puviani" <> wrote in message news:Hexfc.66079
    > "Kevin Goodsell" <> wrote


    > > So why the prime table size? I can't see an obvious reason.
    > > If hash values are approximately uniformly distributed between
    > > 0 and N, I'd think that ideal table sizes would be numbers that
    > > evenly divide N, because any remainder would split the table
    > > into two areas with different probabilities of any particular
    > > value hashing into them.

    >
    > If the hash values are uniformly distributed (which is one characteristic

    of a
    > good hashing function), then any table size would be equivalent; i.e.,

    there
    > would be no "ideal" size. The reason that prime numbers are used for table
    > sizes is that many hashing functions are NOT good and sometimes generate

    hash
    > values that are multiples of one or more numbers. Those values, modulo a
    > relatively non-prime number will cluster. Those same values modulo a

    relatively
    > prime number will uniformly span the entire set of 0..N-1. In other words,
    > primality is used as insurance against bad hashing functions. You can

    persuade
    > yourself of this by modeling it in a spreadsheet.


    This makes sense. Can you write the barebones of a C++ program that can
    illustrate this? I guess you could write a spreadshet too, but the
    newsgroup only wants free text. I can test it if you want.

    We could call call a function hash_function, and we could use my simple
    function in the other post, maybe multiplying by the constant Kevin
    suggested.
     
    Siemel Naran, Apr 15, 2004
    #18
  19. pembed2003

    Siemel Naran Guest

    "Kevin Goodsell" <> wrote in message
    news:Gnqfc.11299

    > > size_t hash(const char * s, int buckets) {
    > > size_t h = 0;
    > > for ( ; *s; ++s) h += *s;
    > > return h%bucket.
    > > }

    >
    > Kernighan & Pike's _The Practice of Programming_ mentioned a similar
    > hash function, but it included (if I recall correctly) multiplying the
    > total by a constant before adding in the next character value. This
    > constant has been experimentally determined to be effective for ASCII
    > text. I'm not sure if anyone really knows why it works. I don't recall
    > what the value was, but I'm sure someone can tell us.


    I did some research, and they said multiplying by the alphabet size (ie.
    number of valid chars).


    > > But we need hash only the first maxchars characters.

    >
    > Actually, I think K&P also talked about this, and suggested it was a bad
    > idea. This is from memory again, so I may be wrong, but I believe they
    > mentioned Java using a scheme like this originally. It turned out to not
    > work well, and was changed to use the entire string.


    Suppose our company coding standard is to prefix namespace and class names
    with ABC_. Thus ABC_Contact, ABC_Thing. Imagine if our hash function used
    just the first 3 chars!

    So I think the number of chars to use depends on the scenario, but it does
    intuitively make sense to me. We could perhaps use the last N chars, middle
    N chars, etc. A templated hashing function makes all this possible.
     
    Siemel Naran, Apr 15, 2004
    #19
  20. pembed2003

    Siemel Naran Guest

    "pembed2003" <> wrote in message

    > #define SIZE ((sizeof(int)) * (INT16_MAX) * (2))
    >
    > void find(int* one,int c1,int* two,int c2){
    > int i;
    > int* buffer = malloc(SIZE);
    > memset((void*)buffer,0,SIZE);
    > for(i = 0; i < c1; i++)
    > buffer[one + INT16_MAX] = 1;
    > for(i = 0; i < c2; i++)
    > if(buffer[two + INT16_MAX])
    > printf("%d\n",two);
    > free(buffer);
    > }


    It seems you don't use the first INT16_MAX chars of buffer, or maybe I'm
    reading something wrong.

    > int main(int argc,char** argv){
    > int a[] = {1,3,45,6,4,-1,8};
    > int b[] = {4,2,4,222,8,45,-1};
    > find(a,7,b,7);
    > }
    >
    > this gives me a O(n) algr. but seems to be wasting a lot of memory so
    > I am thinking using a hashtable instead. am i in the right track?
    > Thanks!


    If the array size 'a' or 'one' is small then the O(N^2) algorithm is fine.
    Either loop over the chars in 'one' or 'two', then over the chars in the
    other loop. If you want to use STL, you can use std::count_if and supply a
    binary predicate that employs std::find (the binary predicate will have a
    constructor that takes the array 'one' as an argument).

    Note you can also sort the elements in 'one' and 'two', then loop over both
    containers simultaneously. Running time O(lg(N)) for the sort if using
    std::sort which normally uses quicksort, and O(N) for the scan. I don't
    think the standard provides such an algorithm though.

    You could sort the array in O(N) time using radix sort (but it will use so
    much memory, which is what you want to avoid), or some other specialized
    algorithm. You could also sort only 'one' and loop over the chars in 'two',
    but use binary_search to see if a char in 'one' exists in 'two'.

    If the array size is big, what difference will a hashtable make? You're
    using the hashtable to store elements visited, right? If the hashtable
    bucket size is small then you'll have lots of collisions (lots of elements
    mapping to the same element) and looking one element could be linear time
    (if the collided elements are stored in a vector). If the hashtable bucket
    size is big then you won't have lots of collisions, but then might as well
    use your original solution.
     
    Siemel Naran, Apr 15, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Williams
    Replies:
    2
    Views:
    1,147
    Jacob Yang [MSFT]
    Aug 12, 2003
  2. Rio
    Replies:
    4
    Views:
    1,220
  3. Pieter Claassen
    Replies:
    1
    Views:
    1,129
    CBFalconer
    Aug 4, 2004
  4. rp
    Replies:
    1
    Views:
    557
    red floyd
    Nov 10, 2011
  5. Srijayanth Sridhar
    Replies:
    19
    Views:
    644
    David A. Black
    Jul 2, 2008
Loading...

Share This Page