which hash table - chained or open-addressed?

I

Ian Collins

Spiros said:
At this point in time I see 2 posts by Julia not
counting the opening one and neither contains
top-posting.
Must be a news server thing, I see four, over a period of 25 minutes
where the first three are top posted and the fourth corrected.
 
U

user923005

Ben said:
Why did you divide by 4? (4*a)/(4*b) = a/b, not 4*(a/b).

I assumed that it was done four times per iteration to minimize
loop overhead.


What bothers me with sticking a tree or a skiplist in each hash
node is the memory expense. It's going to cost at least an extra
pointer field or two per hash node. When I'm hashing small data
items that seems unduly expensive. For big data, sure, it makes
sense.

If you are concerned about memory, you can create a union, which has a
flag to say "I am an ordinary data item" or "I am a skiplist of data
items" so that the only table entries that have tree structures in them
are those that actually had to overflow.

At any rate, they turn out to be valuable more often than you might
think.
 
U

user923005

Ben said:
Why did you divide by 4? (4*a)/(4*b) = a/b, not 4*(a/b).

Look at the generated assembly.
I assumed that it was done four times per iteration to minimize
loop overhead.

#include <stdio.h>
#include <time.h>

int x, y; /* Extern to keep compiler from optimizing it out. */

int main () {
int i, cc;
clock_t t0, t1, t2;


x = y = 1;


for (;;) {
for (cc = 1000; ; cc+= cc) {
t0 = clock ();
for (i=0; i < cc; i++) {
x %= y;
}
t1 = clock();
for (i=0; i < cc; i++) {
x &= y;
}
t2 = clock();


if (t1 - t0 > 1000 && t2 > t1) {
printf ("Performance ratio: %f\n", (double)(t1 - t0) /
(double)(t2 - t1));
return 0;
}
}
}
return 0;
}
/*
C:\tmp>timetest
Performance ratio: 20.919355
*/
[snip]
 
L

Lane Straatman

Ben Pfaff said:
(e-mail address removed) writes:
[ built Paul's source without adjustment ]
On my machine here, compiled with full optimization, this most
often prints 58 as the ratio.
http://www.billfordx.net/screendumps/cstuff_5.htm
I'll claim to be Joe Average in this discussion, because I don't know a hash
table from a hash pipe. I accidentally built it in c++ and got a lower
number, but not less than 6.1 . I have the same opinion of Mr. Hsieh that I
did of Dan Pop: brilliant, but he might be passionate to a fault. But as
long as the former keeps posting about poker, he'll be alright with me. LS
 
W

websnarf

user923005 said:
Look at the generated assembly.

That is correct. Because a hash table would not have loop overhead
associated with this operation (it would have other overhead, but
that's not what's being compared by this test).
#include <stdio.h>
#include <time.h>

int x, y; /* Extern to keep compiler from optimizing it out. */

int main () {
int i, cc;
clock_t t0, t1, t2;


x = y = 1;


for (;;) {
for (cc = 1000; ; cc+= cc) {
t0 = clock ();
for (i=0; i < cc; i++) {
x %= y;
}
t1 = clock();
for (i=0; i < cc; i++) {
x &= y;
}
t2 = clock();


if (t1 - t0 > 1000 && t2 > t1) {
printf ("Performance ratio: %f\n", (double)(t1 - t0) /
(double)(t2 - t1));
return 0;
}
}
}
return 0;
}
/*
C:\tmp>timetest
Performance ratio: 20.919355
*/

Why did you remove the unrolling? A realistic hash table would not
have a & operation coupled with loop overhead. You can argue that it
remains serial, which is why I wrote it the way I did, but being dogged
by loop overhead is almost certainly wrong (it still shows a massive
advantage, which is what the real story is here, but its a bit
misleading about the original point).
 
W

websnarf

Lane said:
Ben Pfaff said:
(e-mail address removed) writes:
[ built Paul's source without adjustment ]
On my machine here, compiled with full optimization, this most
often prints 58 as the ratio.
http://www.billfordx.net/screendumps/cstuff_5.htm
I'll claim to be Joe Average in this discussion, because I don't know a hash
table from a hash pipe. I accidentally built it in c++ and got a lower
number, but not less than 6.1 .

Ok, go into Project -> <project-name> Properties then select
Configuration: All Configurations. Then go to C/C++ -> Optimization
and under Optimization select "Maximize Speed /O2". Then select Ok,
then in the tab where it says "Debug", scroll down and select
"Release". Oh yeah, and Ben Bfaff's suggestion about changing 1000 to
CLOCKS_PER_SEC is correct and recommended (I was just being lazy).

Then rerun.
 
W

websnarf

user923005 said:
P.S.
This thread is somewhat overly contentious.

Both C. B. Falconer and Paul Hsieh are intelligent and useful
contributors with a long history of helpful posts to all the USENET
groups to which they contribute.

Neither is stupid or lacking knowlege in the area of hash table
creation. I would rank both of them as expert.

If there is to be an argument about the best way to construct a hash
table, I suggest a contest to see which general purpose hash table
performs best in real world situations (short strings, strings of
arbitrary length, integers, long long integers and doubles with a large
class of disparate distributions given as inputs).

Well wait a second. Do you think such a contest could be done fairly?
After all, he's already written his own generic hash library and done
all the work of analyzing it and put it out there for public scrutiny.
What chance would I have?
 
B

Ben Pfaff

user923005 said:
If you are concerned about memory, you can create a union, which has a
flag to say "I am an ordinary data item" or "I am a skiplist of data
items" so that the only table entries that have tree structures in them
are those that actually had to overflow.

At any rate, they turn out to be valuable more often than you might
think.

Here's an idea that might work well, based on that suggestion.
Let each cell in the hash table have this form:

struct cell {
void *link[2];
void *root;
};

When root is null, link[0] and link[1], if nonnull, point to data
items.

When root is non-null, it points to a data item, and link[0] and
link[1] point to left and right child "struct cells".

Thus, you can chain 2 data items in one cell without having to do
any extra allocation. If you can get an extra bit, either by
stealing one from one of the pointers (non-portably) or by
allocating a separate bit vector (portably), you can even chain 3
data items in a cell without extra allocation.

Naively, I'd suggest a splay tree for balancing the binary search
tree.
 
D

Default User

Spiros said:
At this point in time I see 2 posts by Julia not
counting the opening one and neither contains
top-posting.

I suspect that she deleted some of them, which removed them from the
Google archives, but didn't affect the rest of Usenet. Trust me, she
posted about five, some top-posted, some later ones not.




Brian
 
A

Al Balmer

How do you know that I don't ?


Sadly I have no idea how to access this either using
a newsreader or Google.

If you *had* a newsreader, you might find it takes no more than
clicking on the link.
 
R

Richard Heathfield

Spiros Bousbouras said:
How do you know that I don't ?

Weeeeeerrrrrrrlllllllll, let's see about that, shall we? Reading on, ver'
ver' carefully now...
Sadly I have no idea how to access this either using
a newsreader or Google.

Yup! He's guvment all right.
 
R

Richard Heathfield

(e-mail address removed) said:
He did not *ASK* me what the performance of % was (a reasonable thing
to do) he *TOLD* me something that is ridiculously untrue.

If that is your problem with what he said, it would be better to say that
than to raise spurious questions about identity and authority.
Now I
understand that the people in this group necessarily take the position
that they have no idea what the performance of their underlying
hardware is, but he was taking the position that he knows.

I have a lot of time for Chuck on the hardware side. He may be wrong (or he
may not) but he's always worth listening to. If he *is* wrong, it'll be for
interesting reasons.
And that
was in response to someone who really does know -- more to the point,
everyone *KNOWS* that I know.

Well, that's not true. I don't know that you know. Clearly he doesn't know
that you know. And I'd be surprised if there weren't quite a few other
people that don't know that you know. So no, everyone *DOESN'T* know that
you know.
So his position was not only untrue, it
is an assertion that he knows the fact in direct contradiction of what
I (whose raison d'etre, could be considered knowing this sort of thing)
just told him.

Again you suggest that you have some sort of divine right to make assertions
about the performance of % in C. And yet I can find no normative reference
to substantiate the claim.
And what makes you think Dennis Ritchie knows what the performance of %
is?

I have not claimed that he does. I simply mentioned him as being one of very
few people whose actual identity does matter in comp.lang.c.
I mean, if he takes the same position as most of you here in
c.l.c. (and in fact most programmers) then presumably he doesn't know
and remains actively ignorant. But perhaps you know something about
him that I don't.

I know he doesn't appeal to his own authority in an attempt to win
arguments.
 
R

Richard Bos

Spiros Bousbouras said:
How do you know that I don't ?

You post as "Spiros Bousbouras", not as "Senator Bousbouras, Chair,
House Joint Committee On Un-American Usenet Posts". IOW, you may well
_work_ for the US government - I don't have information either way - but
you don't post here in that function.
Sadly I have no idea how to access this either using
a newsreader or Google.

Then I suggest a rummage through Google's Advanced Group Search. Bad as
Google is for posting and painful as it is for reading, it's still quite
useful for searching for posts.

Richard
 
S

santosh

Richard said:
"Spiros Bousbouras" <[email protected]> wrote:

Then I suggest a rummage through Google's Advanced Group Search. Bad as
Google is for posting and painful as it is for reading, it's still quite
useful for searching for posts.

Bad as it was, it was still usable. The new Groups that Google has
rolled out yesterday, is much worse, nearly unusable now. Time to get a
subscription...
 
A

Al Balmer

You post as "Spiros Bousbouras", not as "Senator Bousbouras, Chair,
House Joint Committee On Un-American Usenet Posts". IOW, you may well
_work_ for the US government - I don't have information either way - but
you don't post here in that function.

He may be working under cover.
 
D

Default User

Spiros Bousbouras wrote:

Sadly I have no idea how to access this either using
a newsreader or Google.

The following will be a copy of the message I replied to originally,
with full header. Now STFU.


From: "Julia" <[email protected]>
Subject: Re: which hash table - chained or open-addressed?
Date: 22 Jan 2007 09:30:34 -0800
Message-ID: <[email protected]>
References: <[email protected]>
<[email protected]>
Lines: 37
Path:
uni-berlin.de!fu-berlin.de!postnews.google.com!s34g2000cwa.googlegroups.
com!not-for-mail
Newsgroups: comp.lang.c
Organization: http://groups.google.com
NNTP-Posting-Host: 192.5.156.252
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
X-Trace: posting.google.com 1169487042 14418 127.0.0.1 (22 Jan 2007
17:30:42 GMT)
X-Complaints-To: (e-mail address removed)
NNTP-Posting-Date: Mon, 22 Jan 2007 17:30:42 +0000 (UTC)
In-Reply-To: <[email protected]>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; Maxthon; .NET CLR 1.1.4322; .NET CLR 2.0.50727),gzip(gfe),gzip(gfe)
Complaints-To: (e-mail address removed)
Injection-Info: s34g2000cwa.googlegroups.com;
posting-host=192.5.156.252;
posting-account=u4JDkAwAAACvsziS7ugRuNEgAx5auEdg
Xref: uni-berlin.de comp.lang.c:822529

I checked the wikipedia link and it helps a lot. In my case, the key
value is relatively continuous but with a few empty slots and a few
outside-range values, for example, 3, 102, 103, 104, 106, 107, 108,
109, 110, 112, 113, 114, 115, 116, ....120, 13000. However, for the
different calling to hash table, the starting value can be very
different, for example, 102 at one calling, and 20333556 at another
calling. I guess open-addressing hash table should be OK for my
application. Am I right?

Best regards,

Julia
 
R

Richard Bos

santosh said:
Bad as it was, it was still usable. The new Groups that Google has
rolled out yesterday, is much worse, nearly unusable now.

I only use the advanced group search, and that works pretty much as it
used to. True, the resulting page is even uglier than before, but as
long as you can search, that's all I want it for anyway.
Time to get a subscription...

Sponsor that lot of Usenet-vandals? Never.

Richard
 
F

Flash Gordon

Richard Bos wrote, On 26/01/07 10:17:
I only use the advanced group search, and that works pretty much as it
used to. True, the resulting page is even uglier than before, but as
long as you can search, that's all I want it for anyway.

That's all I use it for as well, and each update I've seen seems to make
it worse than the last. Were it not for disk space requirements and the
resulting cost I would be inclined to set up my own competing free (and
advertising free) service.
Sponsor that lot of Usenet-vandals? Never.

Santosh did not say how he would get a subscription with. There are
plenty of free or very cheap Usenet providers he could choose from.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,689
Members
48,796
Latest member
Greg L.

Latest Threads

Top