Help migrating hash_set to c++0x

P

Paulo da Silva

Please correct me if I am wrong ...

I replaced hash_set by unordered_set. It seems that hash_set does not
exist using c++0x.

I am having the following problem:

pair<some_container::iterator,bool> r=ct_dir->insert(de);

When "de" already exists, with hash_table r.second contained false, but
using unordered_set r.second contains true, i.e. it accepts duplicates.

How do I fix this to avoid duplicates in the set and also be aware of that?

Thanks
 
R

red floyd

Please correct me if I am wrong ...

I replaced hash_set by unordered_set. It seems that hash_set does not
exist using c++0x.

Just an FYI. hash_set does not exist in c++98 or c++03 either.
 
P

Paulo da Silva

Em 21-12-2010 02:24, red floyd escreveu:
Just an FYI. hash_set does not exist in c++98 or c++03 either.
It existed, at least as an experimental feature.
The problem is what replaced it? In other words, how to implement a hash
set without duplicates?
 
P

Paulo da Silva

Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set. Neverthless it is not working in
this code!

#include <memory>
#include <unordered_set>
#include <iostream>

using namespace std;

class Foo
{public:
string s;
Foo(char const * const sc): s(sc) {}
};

class eqf
{public:
inline bool operator()(Foo const &s1,Foo const &s2) const
{ return (s1.s==s2.s);
}
};

class hf
{public:
inline size_t operator()(Foo const &x) const
{ return hash<char const *>()(x.s.c_str());
}
};

// typedef __gnu_cxx::hash_set<Foo,hf,eqf> MSet;
typedef unordered_set<Foo,hf,eqf> MSet;

int main()
{ MSet mc;
pair<MSet::iterator,bool> r;
r=mc.insert(Foo("xxxx"));
// OK expected and obtained
cout << "xxxx " << (r.second?"OK":"BAD") << endl;
mc.insert(Foo("zzzz"));
// Does not allow duplicates ...
r=mc.insert(Foo("zzzz"));
// BAD (duplicate) expected but OK obtained
cout << "zzzz " << (r.second?"OK":"BAD") << endl;
MSet::const_iterator it=mc.find(Foo("xxxx"));
for (it=mc.begin();it!=mc.end();++it)
cout << it->s << endl;
return 0;
}

Anything wrong? Better way to implement?
Thanks for any comments.
 
P

Paulo da Silva

Em 21-12-2010 03:37, Paulo da Silva escreveu:
Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set. Neverthless it is not working in
this code!

#include <memory>
#include <unordered_set>
#include <iostream>

using namespace std;

class Foo
{public:
string s;
Foo(char const * const sc): s(sc) {}
};

class eqf
{public:
inline bool operator()(Foo const &s1,Foo const &s2) const
{ return (s1.s==s2.s);
}
};

class hf
{public:
inline size_t operator()(Foo const &x) const
{ return hash<char const *>()(x.s.c_str());
}
};

// typedef __gnu_cxx::hash_set<Foo,hf,eqf> MSet;
typedef unordered_set<Foo,hf,eqf> MSet;

int main()
{ MSet mc;
pair<MSet::iterator,bool> r;
r=mc.insert(Foo("xxxx"));
// OK expected and obtained
cout << "xxxx " << (r.second?"OK":"BAD") << endl;
mc.insert(Foo("zzzz"));
// Does not allow duplicates ...
r=mc.insert(Foo("zzzz"));
// BAD (duplicate) expected but OK obtained
cout << "zzzz " << (r.second?"OK":"BAD") << endl;
MSet::const_iterator it=mc.find(Foo("xxxx"));
for (it=mc.begin();it!=mc.end();++it)
cout << it->s << endl;
return 0;
}

Anything wrong? Better way to implement?
Thanks for any comments.

Replacing class hf with this works.
class hf
{public:
inline size_t operator()(Foo const &s) const
{ return hash<string const &>()(s.s);
}
};

I still don't understand why the previous example stopped to work!
 
H

Howard Hinnant

Em 21-12-2010 03:37, Paulo da Silva escreveu:















Replacing class hf with this works.
class hf
{public:
        inline size_t operator()(Foo const &s) const
        {       return hash<string const &>()(s.s);
        }

};

I still don't understand why the previous example stopped to work!

I suspect it was a change in the behavior of the hash<char const*>
function. In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string. Therefore using std::hash<char const*>
combined with your eqf would give you the situation that two Foo's
could compare equal but hash to different buckets. This is a
situation that will essentially corrupt your container. Your fix
(using hash<string const &>) is correct.

-Howard
 
P

Paulo da Silva

Em 21-12-2010 16:05, Howard Hinnant escreveu:
I suspect it was a change in the behavior of the hash<char const*>
function. In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string. Therefore using std::hash<char const*>
combined with your eqf would give you the situation that two Foo's
could compare equal but hash to different buckets. This is a
situation that will essentially corrupt your container. Your fix
(using hash<string const &>) is correct.
That makes sense and explains the whole thing.
Nevertheless it seems a weird behaviour of hash<char const*>.
Thanks Howard.
 
J

Juha Nieminen

Pete Becker said:
The C++ standard doesn't have experimental features.

One could cynically argue that export templates turned out to be one,
in practice.
 
J

James Kanze

On Dec 20, 11:25 pm, Paulo da Silva [...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.

Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)
 
H

Howard Hinnant

    [...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.

Why did they do that?  I can see hashing the pointer for other
pointer types, but char const*?  (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)

Because it makes the behavior uniform for different pointer types.
This is a big win in generic code. For example storing pointers or
iterators in containers is common practice. Making hash<char*> behave
differently than hash<int*> would have even more severe consequences
than making vector<bool> behave differently than vector<int>. If
there is a silver lining to the vector<bool> cloud, it is that it
taught us a lesson in generic programming. Best we not forget that
lesson.

-Howard
 
J

James Kanze

[...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.
Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)
Because it makes the behavior uniform for different pointer types.

Like in iostream. For better or for worse, it's a different
behavior that most programmers are used to, and even count on.
This is a big win in generic code. For example storing pointers or
iterators in containers is common practice.

You mean an excessively dangerous practice, banned by a lot of
coding guidelines. (Iterators and pointers tend to become
invalid at inconvenient times.)
Making hash<char*> behave differently than hash<int*> would
have even more severe consequences than making vector<bool>
behave differently than vector<int>.

Come now. In both cases, all that's needed is a little
specialization (or overloading, if it's a question of
functions). With the difference that I suspect unordered
containers are fairly rare in generic code, where as vectors
aren't.
If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)

If we were designing the language from zero, I'd totally agree
with you. But we're not, and for better or worse, char const*
is, in most people's minds (and in a lot of the interfaces we
have to deal with), a string.

Anyway, I doubt that it's a real problem. Since the standard
doesn't make any quality guarantees (and I don't see how it
could), no one will actually use the standard hash function
anyway. (It suffers from the same problem rand() does: anything
that's good enough to work in all cases will be unnecessarily
slow in a number of frequent cases.)
 
H

Howard Hinnant

    [...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.
Why did they do that?  I can see hashing the pointer for other
pointer types, but char const*?  (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)
Because it makes the behavior uniform for different pointer types.

My apologies, I was probably brusk.
Like in iostream.  For better or for worse, it's a different
behavior that most programmers are used to, and even count on.

Agreed for iostream.
You mean an excessively dangerous practice, banned by a lot of
coding guidelines.  (Iterators and pointers tend to become
invalid at inconvenient times.)

No, I see containers of iterators or pointers into other containers on
a regular basis used in a safe and useful manner. I see this pattern
recommended here on this newsgroup with little or no objections.
Come now.  In both cases, all that's needed is a little
specialization (or overloading, if it's a question of
functions).  With the difference that I suspect unordered
containers are fairly rare in generic code, where as vectors
aren't.

I expect unordered containers to become about as popular as (multi)map/
set, perhaps a little more. Naturally the future is always difficult
to predict.
If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming.  Best we
not forget that lesson.

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container.  (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example.  But that's a different issue.)

The problem with vector<bool> is that it has expected semantics based
on experience with vector<int>, vector<char>, vector<char*>, etc. And
those expectations are not always met. This has more to do with the
consistency of the client's experience than it does with a spec that
may or may not be thoroughly read by the majority of the client
authors. Don't get me wrong: I think a dynamically sized array of
bits is a very beautiful and useful data structure. It just shouldn't
have been named vector<bool> because vector<T> for all scalars has a
well defined behavior without the specialization and the
specialization of vector<bool> implementing a dynamic array of bits is
unable to completely mimic the data structure of a dynamic array of
bools. /That/ is why it was a mistake -- a mismatch of reasonable
expectations.
If we were designing the language from zero, I'd totally agree
with you.  But we're not, and for better or worse, char const*
is, in most people's minds (and in a lot of the interfaces we
have to deal with), a string.

For C++0x we are designing std::hash from scratch. And we are also
designing std::unordered_(multi)map/set from scratch. Though
admittedly gratefully guided by a fair amount field experience. That
field experience indicates to me that:

1. std::hash should treat all pointers uniformly.
2. For storing containers of strings in a hash container, std::string,
not char* is the preferred key. The latter is replete with memory
ownership issues that the beginner is ill-equiped to deal with.
3. If the client really wants to store null-terminated char*'s,
dealing with memory ownership himself, then the hash containers should
allow that. They do: write your own hash function. This client must
be knowledgable enough to handle the manual memory management, manual
equality comparison, and manual hash functions. It is all allowed, it
just isn't the default.
Anyway, I doubt that it's a real problem.  Since the standard
doesn't make any quality guarantees (and I don't see how it
could), no one will actually use the standard hash function
anyway.  (It suffers from the same problem rand() does: anything
that's good enough to work in all cases will be unnecessarily
slow in a number of frequent cases.)

rand suffered from a different problem: a implicitly recommended poor
implementation. <random> goes a long way to correct that problem ...
a very long way. <random> is one of the most impressive random number
libraries I've seen for general purpose use. And it took a tremendous
effort and expense from a few very dedicated individuals to guide it
through the standardization process (I was not one of those).

But I agree with you that clients should feel free to write their own
hash functions. And I'm quite glad we didn't try to standardize, or
make an example hashing function for arrays of char. That would have
repeated the problems of rand. When prototyping, try the default
first. If hashing turns up performance problems, then write your own,
else don't. The library anticipates both cases. I expect some
std::hash<std::string> to be better than others, and for the market to
decide which implementation does best in this area. I have no doubt
that all implementations will do their best to become the best.
Vendors of the std::lib's want to please their customers. They have
high motivation to do so.

-Howard
 
J

James Kanze

On Dec 21, 6:32 pm, James Kanze <[email protected]> wrote:

[...]
No, I see containers of iterators or pointers into other containers on
a regular basis used in a safe and useful manner. I see this pattern
recommended here on this newsgroup with little or no objections.

I've not seen it recommended by any serious professional. And
it *is* dangerous: iterators are by design to be ephemeral.
They are more or less the equivalent to pointers to local or
member variables. (As with all "rules", there are some
exceptions. But you won't find them much in application code.)
I expect unordered containers to become about as popular as (multi)map/
set, perhaps a little more. Naturally the future is always difficult
to predict.

I don't doubt that they'll become popular. But I can't imagine
much generic code which deals with hash---if it does, it's not
too generic, since it is limited to unordered containers.
If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.
The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)
The problem with vector<bool> is that it has expected semantics based
on experience with vector<int>, vector<char>, vector<char*>, etc.

Expected semantics based on the fact that vector is a sequence.
And
those expectations are not always met. This has more to do with the
consistency of the client's experience than it does with a spec that
may or may not be thoroughly read by the majority of the client
authors. Don't get me wrong: I think a dynamically sized array of
bits is a very beautiful and useful data structure. It just shouldn't
have been named vector<bool> because vector<T> for all scalars has a
well defined behavior without the specialization and the
specialization of vector<bool> implementing a dynamic array of bits is
unable to completely mimic the data structure of a dynamic array of
bools. /That/ is why it was a mistake -- a mismatch of reasonable
expectations.

Agreed. Or in more formal terms, the standard defines
a contract for std::vector, then violates it for
For C++0x we are designing std::hash from scratch. And we are also
designing std::unordered_(multi)map/set from scratch. Though
admittedly gratefully guided by a fair amount field experience. That
field experience indicates to me that:
1. std::hash should treat all pointers uniformly.
2. For storing containers of strings in a hash container, std::string,
not char* is the preferred key. The latter is replete with memory
ownership issues that the beginner is ill-equiped to deal with.
3. If the client really wants to store null-terminated char*'s,
dealing with memory ownership himself, then the hash containers should
allow that. They do: write your own hash function. This client must
be knowledgable enough to handle the manual memory management, manual
equality comparison, and manual hash functions. It is all allowed, it
just isn't the default.

I'm not convinced that all of the above are that essential, but
I am beginning to think that treating std::hash<char*> as just
another pointer may be OK. If only because you also want it to
work when the pointer is null. And of course, because the cases
where the container will be indexed by C style strings should be
very exceptional, and the coder should expect to have to do
something exceptional in that case. (On the other hand, I'm
sure we both know programmers who thing char const* is
a string.)
rand suffered from a different problem: a implicitly recommended poor
implementation.

That aggrevated the problem, I agree. But the problem is, IMHO,
more general: if you're doing any serious work, and you need
random numbers, you need a minimum quality for those random
numbers.

Similarly, if you're using an unordered container, it's probably
because of speed as well -- if not, std::vector and std::find
work very well. And speed likely implies that you'll want
a guaranteed minimum quality for the hash. For the data sets
you expect. And while I think it would be possible to define
a hash with a minimum quality for std::string (but
std::basic_string already causes problems, because the hash has
to be consistent with char_traits::eq), but I don't know of any
generic way of specifying quality criteria for all possible
types; for that matter, I don't know how you'd specify the
quality critera other than by specifying an implementation.
<random> goes a long way to correct that problem ...
a very long way.

Some would say too far:). (Over engineering and all that.
Personally, since it's someone else who did the work, I'm all
for over engineering.)
<random> is one of the most impressive random number
libraries I've seen for general purpose use. And it took a tremendous
effort and expense from a few very dedicated individuals to guide it
through the standardization process (I was not one of those).
But I agree with you that clients should feel free to write their own
hash functions. And I'm quite glad we didn't try to standardize, or
make an example hashing function for arrays of char. That would have
repeated the problems of rand. When prototyping, try the default
first. If hashing turns up performance problems, then write your own,
else don't. The library anticipates both cases. I expect some
std::hash<std::string> to be better than others, and for the market to
decide which implementation does best in this area. I have no doubt
that all implementations will do their best to become the best.
Vendors of the std::lib's want to please their customers. They have
high motivation to do so.

Agreed in general, although I suspect that there's one issue
that people will tend to forget: portability of the data. As
long as the unordered containers are entirely in a single
process, or only shared by code using the same implementation
(and those are the only cases which the standard should be
concerned with), using the standard hash is find, even if it is
underspecified. I fear, however, that there are people who will
use it (because it's there) to hash tables on disk, or other
external media. I don't think it's a problem the standard
should address (other than to specify that the hashing
algorithms are "unspecified"), but teachers should at least
mention the issue.
 
J

Jorgen Grahn

Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set.

Where more exactly? As I recall it, the iterator invalidation rules are
different, and there may be other differences as well. Unfortunately I
don't have any references handy.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top