maximum size of a hash table

Anno Siegel · Feb 27, 2005

John Bokma said:
[hashes]

Fine. My point is that there are useful applications of hash tables
also in the O(n) range.

Click to expand...

Uhm, like? (Ok, memory stress testing is one).

It's a simple space/time tradeoff. You only get constant access time
when collisions are rare, so the number of buckets must be larger than
the number of keys. Say you expect a million keys and linear search
is too slow by a factor of 100. Use a hash with 100 buckets (some more
to compensate overhead). That gives you the necessary speedup with
almost no memory overhead, as opposed to a hash of more than a million
buckets for constant access time. That's a useful hash application
well in the linear range.

Anno

John Bokma · Feb 28, 2005

Anno said:
John Bokma said:

Anno said:

comp.lang.perl.misc:
Anno Siegel wrote:

comp.lang.perl.misc:

Click to expand...

[hashes]

I am happy with constant avg length of c, and hence O(1) look up.

Fine. My point is that there are useful applications of hash
tables also in the O(n) range.

Click to expand...

Uhm, like? (Ok, memory stress testing is one).

Click to expand...

It's a simple space/time tradeoff. You only get constant access time
when collisions are rare, so the number of buckets must be larger than
the number of keys.

All depends on the hash function of course, but if the size is big (n)
and the number of collisions is small (constant), you still have O(1)
access.

Say you expect a million keys and linear search
is too slow by a factor of 100. Use a hash with 100 buckets (some
more to compensate overhead).

It all depends on what you want. If linear search is slow, and you need
a lot of access it might be way smarter to sort ( O(n log n) ) and use
binary search for the look ups ( O( log n ) ).

With 100 buckets you get 100 lists of 10,000 items each. So it takes
average 5,000 steps to find something.

Compare that with binary search :-D.

That gives you the necessary speedup with
almost no memory overhead, as opposed to a hash of more than a million
buckets for constant access time. That's a useful hash application
well in the linear range.

See above.

Sherm Pendley · Feb 28, 2005

John said:
The poor result I was talking about was when the 32 bit limit on the hash
code results in O(n) hash look ups because there is a lot of data in the
hash. :-D.

That's not the result of having a lot of *data* - it's the result of having
a lot of *collisions*. It can be avoided by choosing a hashing function
that results in fewer collisions.

BTW - *what* 32-bit limit? Perl uses native ints for hash codes, so on any
system where sizeof(void*)<=sizeof(int), there can be at least as many
buckets as there are bytes of memory. Each bucket uses more than one byte
of memory, so that's a theoretical limit that can't be reached in practice.

sherm--

John Bokma · Feb 28, 2005

Sherm said:
That's not the result of having a lot of *data* - it's the result of
having a lot of *collisions*.

Caused by a lot of data.

It can be avoided by choosing a hashing
function that results in fewer collisions.

Again I was talking of a 32 bit limit. You can probably do the math about
how much data I am talking.

BTW - *what* 32-bit limit? Perl uses native ints for hash codes, so on
any system where sizeof(void*)<=sizeof(int), there can be at least as
many buckets as there are bytes of memory. Each bucket uses more than
one byte of memory, so that's a theoretical limit that can't be
reached in practice.

Yeah, that has been answered a few days back. Maybe read the whole thread
before you start some very creative snipping and replying.

Sherm Pendley · Feb 28, 2005

John said:
Caused by a lot of data.

Is too... Is not... Is too... Is not!

As I (mis?)understand it, the size of a list - *in and of itself* - does not
cause collisions. Perl's hash codes are native ints. The exact width varies
with architecture, but it's always wide enough to assign a unique code to
each and every byte of memory.

Therefore my assertion, that if you have duplicates it's not the result of
having more data elements than possible buckets. That, by definition,
cannot happen with any list that will fit in RAM. So if you have duplicates
it must therefore be the result of a hashing function that's failing to
generate as many unique codes as it should, for the given data.

Yeah, that has been answered a few days back. Maybe read the whole thread
before you start some very creative snipping and replying.

I don't have an education in formal CS theory, and I'm well aware that there
could be flaws in my logic. If so, enlighten me and I'll thank you for it.
But please do stick to logic - the attitude's uncalled-for.

You posted, I replied, you replied back, I replied again. Each time I quoted
the relevant bits of the message I was replying to. Nothing "creative"
about that, at least not in the derogatory sense you're implying. And the
answer from a few days back was mine, where I pointed out the definition of
U32 in handy.h, and the comment there that says it's not fixed at 32 bits.

sherm--

Anno Siegel · Feb 28, 2005

John Bokma said:
Anno said:

John Bokma said:

Anno Siegel wrote:

comp.lang.perl.misc:
Anno Siegel wrote:

comp.lang.perl.misc:
[hashes]

I am happy with constant avg length of c, and hence O(1) look up.

Fine. My point is that there are useful applications of hash
tables also in the O(n) range.

Uhm, like? (Ok, memory stress testing is one).

Click to expand...

It's a simple space/time tradeoff. You only get constant access time
when collisions are rare, so the number of buckets must be larger than
the number of keys.

Click to expand...

All depends on the hash function of course, but if the size is big (n)
and the number of collisions is small (constant), you still have O(1)
access.

Collisions *can't* be rare when the number of keys exceeds the number
of buckets, no matter what the hash function.

It all depends on what you want. If linear search is slow, and you need
a lot of access it might be way smarter to sort ( O(n log n) ) and use
binary search for the look ups ( O( log n ) ).

With 100 buckets you get 100 lists of 10,000 items each. So it takes
average 5,000 steps to find something.

Compare that with binary search :-D.

You must keep the list sorted. When insertions and/or deletions are
frequent, you lose the advantage.

What's the point? I'm not saying that hashes with large load factors
are a super trick that solves everything. They can be, and have been
used that way. You asked for an example.

Anno

John Bokma · Feb 28, 2005

Sherm said:
Is too... Is not... Is too... Is not!

It is. I was talking about a situation that a hash table that hashes to
32 bit values can stop being efficient by a lot of data which cause too
many collisions.

Of course there are situations that you can cause a lot of collisions by
putting the "right" data into a hash table, but that's not what I was
talking about.

As I (mis?)understand it, the size of a list - *in and of itself* -
does not cause collisions. Perl's hash codes are native ints. The
exact width varies with architecture, but it's always wide enough to
assign a unique code to each and every byte of memory.

That was originally unclear, I pinned it at 32 bits, always (but
remarked that I didn't know if this was the case). And if that was the
case (which it was not) *then* it is possible to reach the limits of a
hash table with enough data, at least to me. A hash table that doesn't
do O(1) look ups is to me not a useful one.

John Bokma · Feb 28, 2005

Anno said:
John Bokma said:

Anno said:

comp.lang.perl.misc:
Anno Siegel wrote:

comp.lang.perl.misc:
Anno Siegel wrote:

comp.lang.perl.misc:

[hashes]

I am happy with constant avg length of c, and hence O(1) look
up.

Fine. My point is that there are useful applications of hash
tables also in the O(n) range.

Uhm, like? (Ok, memory stress testing is one).

It's a simple space/time tradeoff. You only get constant access
time when collisions are rare, so the number of buckets must be
larger than the number of keys.

Click to expand...

All depends on the hash function of course, but if the size is big
(n) and the number of collisions is small (constant), you still have
O(1) access.

Click to expand...

Collisions *can't* be rare when the number of keys exceeds the number
of buckets, no matter what the hash function.

You wrote (see above): that "the number of buckets must be larger than
the number of keys" to make collisions rare. With perfect hashing you
can hash n elements in n slots without having any collision.

Moreover, it is possible to have relatively a lot of collision and still
have O(1) look up. As long as the average number of collisions is a
(small) constant wrt n, the look up is O(n).

You must keep the list sorted. When insertions and/or deletions are
frequent, you lose the advantage.

search, insertion and deletion are all O(log n). ( at most 20 steps )

Your proposal has O(n) look up (5,000 steps avg), O(1) insertion, and
O(n) deletion (of a specific element, since you have to look it up
first, 5,000 steps avg, otherwise O(1)).

Technically you are using a hash table with 100 buckets, each bucket
containing a list. Implementing it that way, and sorting each list would
give a major speed up (max 14 steps to find an item with binary search,
max 14 steps for insert/delete). [1]

What's the point? I'm not saying that hashes with large load factors
are a super trick that solves everything. They can be, and have been
used that way. You asked for an example.

I can't think of any advantage of an O(n) look up hash table, it works
like an unsorted list, see above.

Ted Zlatanov · Mar 1, 2005

I can't think of any advantage of an O(n) look up hash table, it works
like an unsorted list, see above.

I have to disagree with the "any" part. O(n) can mean a lot of things
in practice. If the constant part is large enough, it can make a
difference in specific real-life situations. Also if the n is known
to be in specific ranges, an O(n) algorithm may well be better than an
O(log(n)).

Ted

John Bokma · Mar 2, 2005

Ted said:
I have to disagree with the "any" part.

Well, I can't think of any :-D

O(n) can mean a lot of things
in practice. If the constant part is large enough, it can make a
difference in specific real-life situations. Also if the n is known
to be in specific ranges, an O(n) algorithm may well be better than an
O(log(n)).

Example? (in the above context)

Measuring a string of text	1	Sep 15, 2022
Getting incorrect output in finding the maximum pair sum in the given array.	7	Apr 6, 2023
Maximum length of array#hash	6	Oct 7, 2010
Optimal way to make a table for large lists	2	Jul 7, 2022
hash table - simple implementation	16	Oct 17, 2012
Hash search and Interpolation search ?	3	Dec 30, 2022
How can I find occurrences of a column name FPPaymentID in the entire database (e.g table, stored procedure etc) in SSMS?	2	Jun 20, 2023
Quick sort algorithm	1	Feb 22, 2023

maximum size of a hash table

Anno Siegel

John Bokma

Sherm Pendley

John Bokma

Sherm Pendley

Anno Siegel

John Bokma

John Bokma

Ted Zlatanov

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads