C++ hash map container

Discussion in 'C++' started by Matthias =?ISO-8859-1?Q?K=E4ppler?=, Nov 29, 2004.

  1. Hi,

    I need to store objects of a FileInfo class containing information about a
    specific file. I can uniquely identify these files by their serial number.
    I thought it would be a good idea to use a hash map so I can access the
    file information in constant time, without having to iterate over a
    (possibly very large) list of files.

    As far as I know, std::map does not hash its entries (I guess it takes
    O(nlogn) time to find an entry). There's an extension class called
    __gnu_cxx::hash_map but it's an SGI extension and I'm not sure in how far
    it is safe to use in terms of portability.
    I tried browsing boost.org but couldn't find an implementation of a hash map
    either.

    Any suggestions?

    Thanks,
    Matthias
    Matthias =?ISO-8859-1?Q?K=E4ppler?=, Nov 29, 2004
    #1
    1. Advertising

  2. "Matthias Käppler" <> wrote in message
    news:cog48r$s0m$02$-online.com...
    > Hi,
    >
    > I need to store objects of a FileInfo class containing information about a
    > specific file. I can uniquely identify these files by their serial number.
    > I thought it would be a good idea to use a hash map so I can access the
    > file information in constant time, without having to iterate over a
    > (possibly very large) list of files.
    >
    > As far as I know, std::map does not hash its entries (I guess it takes
    > O(nlogn) time to find an entry). There's an extension class called
    > __gnu_cxx::hash_map but it's an SGI extension and I'm not sure in how far
    > it is safe to use in terms of portability.
    > I tried browsing boost.org but couldn't find an implementation of a hash map
    > either.


    The recently approved library technical report TR1 contains specifications for
    four hashed container templates: unordered_set, unordered_map,
    unordered_multiset and unordered_multimap.

    An implementation can be found in the Boost Yahoo Files section, under the
    peculiar name jtc1-sc22-wg21-2003-n1456. See

    http://groups.yahoo.com/group/boost/files/jtc1-sc22-wg21-2003-n1456.tar.gz

    (You need to sign up for the boost developers list to access this material. See
    http://www.boost.org/more/mailing_lists.htm#main)

    BTW, I'm not sure the above implementation contains the latest changes to the
    TR1 specs, but it shouldn't matter for your purposes.

    Jonathan
    Jonathan Turkanis, Nov 29, 2004
    #2
    1. Advertising

  3. Matthias =?ISO-8859-1?Q?K=E4ppler?=

    Jerry Coffin Guest

    Matthias Käppler <> wrote in message news:<cog48r$s0m$02$-online.com>...

    [ ... ]

    > As far as I know, std::map does not hash its entries (I guess it takes
    > O(nlogn) time to find an entry).


    No -- it's O(log N). Given that containers are (intended to be)
    contained entirely in addressable memory, the difference between
    logarithmic and constant time is small enough that in real life the
    difference will depend on the details of the implementation as much as
    the computational complexity.

    Just for example, if we assume a million items in the collection, a
    map should require roughly 20 operations for a search. A million items
    suggests about four characters to distinguish between items, so we'll
    average comparing 2 bytes for most of those comparisons, meaning we
    look at about 38 bytes plus the length of the key being searched for.

    For hashing, the initial hash requires looking at the entire key being
    searched for, and in a typical hash table we can expect to do about
    three probes to find the right item, meaning we do about 6 bytes plus
    the length of the key being searched for. If (for example) the key is
    a string that averages 20 bytes, we get 58 bytes looked at for the
    tree, and 29 bytes looked at for the hash table, implying that the
    hash table should be twice as fast.

    Given the exponential growth of a three, all but the last three or
    four levels are likely to be in the cache. That makes those
    comparisons _essentially_ free, since loading even one item from
    memory will probably be slower than traversing whatever part of the
    tree fits in teh cache. In a hash table, if our hash function is good
    at all, we can expect the probes to be distributed almost randomly
    through memory, meaning we get no locality of reference and therefore
    no meaningful use of the cache (unless it's small enough that nearly
    the entire table will fit in the cache).

    This means a million items is an (extremely) rough approximation of a
    break-even point, with the hash table clearly winning if you have
    enough items, but the tree winning at the smaller end.

    It's also worth noting, however, that you can _exepct_ this kind of
    behavior out of a tree. Algorithms for hash tables that need (for
    example) to accomodate arbitrary growth aren't nearly as well known,
    and some algorithms are pretty ugly. The algorithms (of which I'm
    aware) that avoid the worst of the ugliness also lose a fair amount in
    average speed.

    The bottom line is that I wouldn't take for granted that hashing is
    necessary to get decent performance in your situation.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Nov 30, 2004
    #3
  4. That was an interesting read, thanks Jerry.
    Maybe I just stick with std::map.

    Jerry Coffin wrote:

    > Matthias Käppler <> wrote in message
    > news:<cog48r$s0m$02$-online.com>...
    >
    > [ ... ]
    >
    >> As far as I know, std::map does not hash its entries (I guess it takes
    >> O(nlogn) time to find an entry).

    >
    > No -- it's O(log N). Given that containers are (intended to be)
    > contained entirely in addressable memory, the difference between
    > logarithmic and constant time is small enough that in real life the
    > difference will depend on the details of the implementation as much as
    > the computational complexity.
    >
    > Just for example, if we assume a million items in the collection, a
    > map should require roughly 20 operations for a search. A million items
    > suggests about four characters to distinguish between items, so we'll
    > average comparing 2 bytes for most of those comparisons, meaning we
    > look at about 38 bytes plus the length of the key being searched for.
    >
    > For hashing, the initial hash requires looking at the entire key being
    > searched for, and in a typical hash table we can expect to do about
    > three probes to find the right item, meaning we do about 6 bytes plus
    > the length of the key being searched for. If (for example) the key is
    > a string that averages 20 bytes, we get 58 bytes looked at for the
    > tree, and 29 bytes looked at for the hash table, implying that the
    > hash table should be twice as fast.
    >
    > Given the exponential growth of a three, all but the last three or
    > four levels are likely to be in the cache. That makes those
    > comparisons _essentially_ free, since loading even one item from
    > memory will probably be slower than traversing whatever part of the
    > tree fits in teh cache. In a hash table, if our hash function is good
    > at all, we can expect the probes to be distributed almost randomly
    > through memory, meaning we get no locality of reference and therefore
    > no meaningful use of the cache (unless it's small enough that nearly
    > the entire table will fit in the cache).
    >
    > This means a million items is an (extremely) rough approximation of a
    > break-even point, with the hash table clearly winning if you have
    > enough items, but the tree winning at the smaller end.
    >
    > It's also worth noting, however, that you can _exepct_ this kind of
    > behavior out of a tree. Algorithms for hash tables that need (for
    > example) to accomodate arbitrary growth aren't nearly as well known,
    > and some algorithms are pretty ugly. The algorithms (of which I'm
    > aware) that avoid the worst of the ugliness also lose a fair amount in
    > average speed.
    >
    > The bottom line is that I wouldn't take for granted that hashing is
    > necessary to get decent performance in your situation.
    >
    Matthias =?ISO-8859-1?Q?K=E4ppler?=, Nov 30, 2004
    #4
  5. Matthias =?ISO-8859-1?Q?K=E4ppler?=

    Dave O'Hearn Guest

    (Jerry Coffin) wrote in message news:<>...
    > Matthias Käppler <> wrote in message news:<cog48r$s0m$02$-online.com>...
    >
    > [ ... ]
    >
    > > As far as I know, std::map does not hash its entries (I guess it takes
    > > O(nlogn) time to find an entry).

    >
    > No -- it's O(log N). Given that containers are (intended to be)
    > contained entirely in addressable memory, the difference between
    > logarithmic and constant time is small enough that in real life the
    > difference will depend on the details of the implementation as much as
    > the computational complexity.
    >
    > Just for example, if we assume a million items in the collection, a
    > map should require roughly 20 operations for a search. A million items
    > suggests about four characters to distinguish between items, so we'll
    > average comparing 2 bytes for most of those comparisons, meaning we
    > look at about 38 bytes plus the length of the key being searched for.


    What about page faults? A tree can cause a page fault on every one of
    those 20 operations. Most likely, the first couple will be cached, but
    once you get down several levels in the tree, you will start getting
    page faults.

    There are special tree implementations that reduce page faults, by
    making B-Trees where each node has many children, like hundreds of
    children, instead of just a 'left' and a 'right'. It's very unlikely
    that std::set is going to use that optimization, as it only makes
    sense on large data sets.

    I would take advantage of generic programming, and implement the
    application first using std::set, and then with whichever hash_set
    extension you like, and profiling which actually behaves better in
    practice. With generic code and a few typedefs, possibly only one line
    of code will have to be changed to try out another container type.

    --
    Dave O'Hearn
    Dave O'Hearn, Dec 1, 2004
    #5
  6. Matthias =?ISO-8859-1?Q?K=E4ppler?=

    Jerry Coffin Guest

    Matthias Käppler <> wrote in message news:<coh8nr$63f$04$-online.com>...
    > That was an interesting read, thanks Jerry.
    > Maybe I just stick with std::map.


    Well, I'm not sure I'd _stick_ with it, but there's a pretty fair
    chance that I'd at least start with it. When/if a profiler shows a
    reason to do so is soon enough to switch to something else if needed.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Dec 1, 2004
    #6
  7. Matthias =?ISO-8859-1?Q?K=E4ppler?=

    Jerry Coffin Guest

    (Dave O'Hearn) wrote in message news:<>...

    [ ... ]

    > What about page faults? A tree can cause a page fault on every one of
    > those 20 operations. Most likely, the first couple will be cached, but
    > once you get down several levels in the tree, you will start getting
    > page faults.


    Remember that a tree grows exponentially. The single bottom level of a
    (balanced) tree accounts for _half_ the data. If only the bottom four
    levels of the tree have been paged out, that means only 1/16th of the
    data is in memory.

    That can only happen in one of two ways: either your memory is HUGELY
    overcommitted, or your tree is used VERY little (or both). If memory
    is overcommitted so badly that only 1/16th of frequently-used data is
    in memory, NOTHING is going to fast -- and we're still only talking
    about 4 disk accesses, which means the lookup itself is still likely
    to appear instantaneous.

    Alternatively, the data is simply used SO rarely that essentially ALL
    the data has been paged out. First of all, even at that, the penalty
    isn't terrible. With a modern disk a single lookup might be expected
    to take around 15 ms. 20 of those translates to 300 ms total, which is
    long enough for the user to notice a pause, but that's about it. Our
    original posutlate to get to this point says this is happening only
    (say) once in something like a matter of hours or so, at which point
    saving milliseconds starts to become pointless.

    In reality, if these lookups are really that rare, chances are the
    user will notice a lot longer pause, but it'll be while the code to DO
    the lookup gets paged in. In this case, optimizing the lookup itself
    will usually make so little difference the user will never even notice
    -- if the code takes (for example) 4 seconds to page in, then we might
    be looking at 4.1 vs. 4.3 seconds, and the average user won't even
    know which was faster.

    In this case, we're clearly optimizing the wrong thing -- eliminating
    the lookup time entirely still makes less difference than a 10%
    improvement in the code paging time.

    > There are special tree implementations that reduce page faults, by
    > making B-Trees where each node has many children, like hundreds of
    > children, instead of just a 'left' and a 'right'. It's very unlikely
    > that std::set is going to use that optimization, as it only makes
    > sense on large data sets.


    It's certainly true that such things exist, but based on what's been
    said about the intended use, I doubt they're suitable. B-trees (and
    their brethren such as B* and B+ trees) normally make little sense
    unles you know up-front that the majority of the data WILL reside on
    disk.

    Given that he said he's storing information about files, that seems
    hard for me to believe -- just for example, the computer I'm writing
    this on has about 350,000 files and 1 Gig. of RAM. I'd guess typical
    users have an even higher ratio of RAM:file-count than I do (at least
    IME, programmers generally have more, smaller files than "normal"
    users). There are certainly file servers with more files, but they
    typically have more memory as well.

    As such, we more or less have to postulate something like using a
    wimpy workstation to store data about all the files on a big server,
    or else storing a LOT of information about each file. The former case
    sounds to me like a bad enough system design that no data sructure can
    save it. The latter case simply calls for separating the large data
    from the index that finds it (or only putting data into the leaf
    nodes, about like a B+tree).

    > I would take advantage of generic programming, and implement the
    > application first using std::set, and then with whichever hash_set
    > extension you like, and profiling which actually behaves better in
    > practice. With generic code and a few typedefs, possibly only one line
    > of code will have to be changed to try out another container type.


    I quite agree -- in fact, this is pretty much what I said in my
    follow-up.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Dec 2, 2004
    #7
  8. Matthias =?ISO-8859-1?Q?K=E4ppler?=

    Dave O'Hearn Guest

    (Jerry Coffin) wrote:
    > Remember that a tree grows exponentially. The single bottom level of a
    > (balanced) tree accounts for _half_ the data. If only the bottom four
    > levels of the tree have been paged out, that means only 1/16th of the
    > data is in memory.
    >
    > That can only happen in one of two ways: either your memory is HUGELY
    > overcommitted, or your tree is used VERY little (or both). If memory
    > is overcommitted so badly that only 1/16th of frequently-used data is
    > in memory, NOTHING is going to fast -- and we're still only talking
    > about 4 disk accesses, which means the lookup itself is still likely
    > to appear instantaneous.


    Ah. I had recently read a paper, with measurements on trees vs. skip
    lists, and it was demonstrated that 2-way trees caused far more paging
    on large data sets. On second look through, "large data sets" were
    quite large indeed; the behavior didn't manifest until 600,000 ints
    were added to the container. Of course a skip list is not a hashtable,
    but if anything, the hashtable would page less than the skip list.

    Still, it is not going to happen unless the memory is hugely
    overcommitted, as you said.

    --
    Dave O'Hearn
    Dave O'Hearn, Dec 3, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Amit Bhatia

    Hash Map slower than Map.

    Amit Bhatia, Oct 8, 2007, in forum: C++
    Replies:
    7
    Views:
    372
    =?UTF-8?B?RXJpayBXaWtzdHLDtm0=?=
    Oct 9, 2007
  2. kl
    Replies:
    7
    Views:
    1,285
    James Kanze
    Jan 1, 2008
  3. navS
    Replies:
    3
    Views:
    507
    Ismo Salonen
    May 9, 2008
  4. puzzlecracker
    Replies:
    8
    Views:
    300
  5. rp
    Replies:
    1
    Views:
    517
    red floyd
    Nov 10, 2011
Loading...

Share This Page