maximum size of a hash table

Discussion in 'Perl Misc' started by Lee, Feb 24, 2005.

  1. Lee

    Anno Siegel Guest

    It's a simple space/time tradeoff. You only get constant access time
    when collisions are rare, so the number of buckets must be larger than
    the number of keys. Say you expect a million keys and linear search
    is too slow by a factor of 100. Use a hash with 100 buckets (some more
    to compensate overhead). That gives you the necessary speedup with
    almost no memory overhead, as opposed to a hash of more than a million
    buckets for constant access time. That's a useful hash application
    well in the linear range.

    Anno Siegel, Feb 27, 2005
    1. Advertisements

  2. Lee

    John Bokma Guest

    All depends on the hash function of course, but if the size is big (n)
    and the number of collisions is small (constant), you still have O(1)
    It all depends on what you want. If linear search is slow, and you need
    a lot of access it might be way smarter to sort ( O(n log n) ) and use
    binary search for the look ups ( O( log n ) ).

    With 100 buckets you get 100 lists of 10,000 items each. So it takes
    average 5,000 steps to find something.

    Compare that with binary search :-D.
    See above.
    John Bokma, Feb 28, 2005
    1. Advertisements

  3. That's not the result of having a lot of *data* - it's the result of having
    a lot of *collisions*. It can be avoided by choosing a hashing function
    that results in fewer collisions.

    BTW - *what* 32-bit limit? Perl uses native ints for hash codes, so on any
    system where sizeof(void*)<=sizeof(int), there can be at least as many
    buckets as there are bytes of memory. Each bucket uses more than one byte
    of memory, so that's a theoretical limit that can't be reached in practice.

    Sherm Pendley, Feb 28, 2005
  4. Lee

    John Bokma Guest

    Caused by a lot of data.
    Again I was talking of a 32 bit limit. You can probably do the math about
    how much data I am talking.
    Yeah, that has been answered a few days back. Maybe read the whole thread
    before you start some very creative snipping and replying.
    John Bokma, Feb 28, 2005
  5. Is too... Is not... Is too... Is not! :)

    As I (mis?)understand it, the size of a list - *in and of itself* - does not
    cause collisions. Perl's hash codes are native ints. The exact width varies
    with architecture, but it's always wide enough to assign a unique code to
    each and every byte of memory.

    Therefore my assertion, that if you have duplicates it's not the result of
    having more data elements than possible buckets. That, by definition,
    cannot happen with any list that will fit in RAM. So if you have duplicates
    it must therefore be the result of a hashing function that's failing to
    generate as many unique codes as it should, for the given data.
    I don't have an education in formal CS theory, and I'm well aware that there
    could be flaws in my logic. If so, enlighten me and I'll thank you for it.
    But please do stick to logic - the attitude's uncalled-for.

    You posted, I replied, you replied back, I replied again. Each time I quoted
    the relevant bits of the message I was replying to. Nothing "creative"
    about that, at least not in the derogatory sense you're implying. And the
    answer from a few days back was mine, where I pointed out the definition of
    U32 in handy.h, and the comment there that says it's not fixed at 32 bits.

    Sherm Pendley, Feb 28, 2005
  6. Lee

    Anno Siegel Guest

    Collisions *can't* be rare when the number of keys exceeds the number
    of buckets, no matter what the hash function.
    You must keep the list sorted. When insertions and/or deletions are
    frequent, you lose the advantage.

    What's the point? I'm not saying that hashes with large load factors
    are a super trick that solves everything. They can be, and have been
    used that way. You asked for an example.

    Anno Siegel, Feb 28, 2005
  7. Lee

    John Bokma Guest

    It is. I was talking about a situation that a hash table that hashes to
    32 bit values can stop being efficient by a lot of data which cause too
    many collisions.

    Of course there are situations that you can cause a lot of collisions by
    putting the "right" data into a hash table, but that's not what I was
    talking about.
    That was originally unclear, I pinned it at 32 bits, always (but
    remarked that I didn't know if this was the case). And if that was the
    case (which it was not) *then* it is possible to reach the limits of a
    hash table with enough data, at least to me. A hash table that doesn't
    do O(1) look ups is to me not a useful one.
    John Bokma, Feb 28, 2005
  8. Lee

    John Bokma Guest

    You wrote (see above): that "the number of buckets must be larger than
    the number of keys" to make collisions rare. With perfect hashing you
    can hash n elements in n slots without having any collision.

    Moreover, it is possible to have relatively a lot of collision and still
    have O(1) look up. As long as the average number of collisions is a
    (small) constant wrt n, the look up is O(n).
    search, insertion and deletion are all O(log n). ( at most 20 steps )

    Your proposal has O(n) look up (5,000 steps avg), O(1) insertion, and
    O(n) deletion (of a specific element, since you have to look it up
    first, 5,000 steps avg, otherwise O(1)).

    Technically you are using a hash table with 100 buckets, each bucket
    containing a list. Implementing it that way, and sorting each list would
    give a major speed up (max 14 steps to find an item with binary search,
    max 14 steps for insert/delete). [1]
    I can't think of any advantage of an O(n) look up hash table, it works
    like an unsorted list, see above.
    John Bokma, Feb 28, 2005
  9. Lee

    Ted Zlatanov Guest

    I have to disagree with the "any" part. O(n) can mean a lot of things
    in practice. If the constant part is large enough, it can make a
    difference in specific real-life situations. Also if the n is known
    to be in specific ranges, an O(n) algorithm may well be better than an

    Ted Zlatanov, Mar 1, 2005
  10. Lee

    John Bokma Guest

    Well, I can't think of any :-D
    Example? (in the above context)
    John Bokma, Mar 2, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.