Help migrating hash_set to c++0x

Paulo da Silva · Dec 20, 2010

Please correct me if I am wrong ...

I replaced hash_set by unordered_set. It seems that hash_set does not
exist using c++0x.

I am having the following problem:

pair<some_container::iterator,bool> r=ct_dir->insert(de);

When "de" already exists, with hash_table r.second contained false, but
using unordered_set r.second contains true, i.e. it accepts duplicates.

How do I fix this to avoid duplicates in the set and also be aware of that?

Thanks

red floyd · Dec 21, 2010

Please correct me if I am wrong ...

I replaced hash_set by unordered_set. It seems that hash_set does not
exist using c++0x.

Just an FYI. hash_set does not exist in c++98 or c++03 either.

Paulo da Silva · Dec 21, 2010

Em 21-12-2010 02:24, red floyd escreveu:

Just an FYI. hash_set does not exist in c++98 or c++03 either.

It existed, at least as an experimental feature.
The problem is what replaced it? In other words, how to implement a hash
set without duplicates?

Paulo da Silva · Dec 21, 2010

Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set. Neverthless it is not working in
this code!

#include <memory>
#include <unordered_set>
#include <iostream>

using namespace std;

class Foo
{public:
string s;
Foo(char const * const sc): s(sc) {}
};

class eqf
{public:
inline bool operator()(Foo const &s1,Foo const &s2) const
{ return (s1.s==s2.s);
}
};

class hf
{public:
inline size_t operator()(Foo const &x) const
{ return hash<char const *>()(x.s.c_str());
}
};

// typedef __gnu_cxx::hash_set<Foo,hf,eqf> MSet;
typedef unordered_set<Foo,hf,eqf> MSet;

int main()
{ MSet mc;
pair<MSet::iterator,bool> r;
r=mc.insert(Foo("xxxx"));
// OK expected and obtained
cout << "xxxx " << (r.second?"OK":"BAD") << endl;
mc.insert(Foo("zzzz"));
// Does not allow duplicates ...
r=mc.insert(Foo("zzzz"));
// BAD (duplicate) expected but OK obtained
cout << "zzzz " << (r.second?"OK":"BAD") << endl;
MSet::const_iterator it=mc.find(Foo("xxxx"));
for (it=mc.begin();it!=mc.end();++it)
cout << it->s << endl;
return 0;
}

Anything wrong? Better way to implement?
Thanks for any comments.

Paulo da Silva · Dec 21, 2010

Em 21-12-2010 03:37, Paulo da Silva escreveu:

Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set. Neverthless it is not working in
this code!

#include <memory>
#include <unordered_set>
#include <iostream>

using namespace std;

class Foo
{public:
string s;
Foo(char const * const sc): s(sc) {}
};

class eqf
{public:
inline bool operator()(Foo const &s1,Foo const &s2) const
{ return (s1.s==s2.s);
}
};

class hf
{public:
inline size_t operator()(Foo const &x) const
{ return hash<char const *>()(x.s.c_str());
}
};

// typedef __gnu_cxx::hash_set<Foo,hf,eqf> MSet;
typedef unordered_set<Foo,hf,eqf> MSet;

int main()
{ MSet mc;
pair<MSet::iterator,bool> r;
r=mc.insert(Foo("xxxx"));
// OK expected and obtained
cout << "xxxx " << (r.second?"OK":"BAD") << endl;
mc.insert(Foo("zzzz"));
// Does not allow duplicates ...
r=mc.insert(Foo("zzzz"));
// BAD (duplicate) expected but OK obtained
cout << "zzzz " << (r.second?"OK":"BAD") << endl;
MSet::const_iterator it=mc.find(Foo("xxxx"));
for (it=mc.begin();it!=mc.end();++it)
cout << it->s << endl;
return 0;
}

Anything wrong? Better way to implement?
Thanks for any comments.

Replacing class hf with this works.
class hf
{public:
inline size_t operator()(Foo const &s) const
{ return hash<string const &>()(s.s);
}
};

I still don't understand why the previous example stopped to work!

Howard Hinnant · Dec 21, 2010

Em 21-12-2010 03:37, Paulo da Silva escreveu:

Replacing class hf with this works.
class hf
{public:
inline size_t operator()(Foo const &s) const
{ return hash<string const &>()(s.s);
}

};

I still don't understand why the previous example stopped to work!

I suspect it was a change in the behavior of the hash<char const*>
function. In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string. Therefore using std::hash<char const*>
combined with your eqf would give you the situation that two Foo's
could compare equal but hash to different buckets. This is a
situation that will essentially corrupt your container. Your fix
(using hash<string const &>) is correct.

-Howard

Paulo da Silva · Dec 21, 2010

Em 21-12-2010 16:05, Howard Hinnant escreveu:

I suspect it was a change in the behavior of the hash<char const*>
function. In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string. Therefore using std::hash<char const*>
combined with your eqf would give you the situation that two Foo's
could compare equal but hash to different buckets. This is a
situation that will essentially corrupt your container. Your fix
(using hash<string const &>) is correct.

That makes sense and explains the whole thing.
Nevertheless it seems a weird behaviour of hash<char const*>.
Thanks Howard.

Juha Nieminen · Dec 21, 2010

Pete Becker said:
The C++ standard doesn't have experimental features.

One could cynically argue that export templates turned out to be one,
in practice.

James Kanze · Dec 21, 2010

On Dec 20, 11:25 pm, Paulo da Silva [...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.

Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)

Howard Hinnant · Dec 21, 2010

[...]

In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.

Click to expand...

Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)

Because it makes the behavior uniform for different pointer types.
This is a big win in generic code. For example storing pointers or
iterators in containers is common practice. Making hash<char*> behave
differently than hash<int*> would have even more severe consequences
than making vector<bool> behave differently than vector<int>. If
there is a silver lining to the vector<bool> cloud, it is that it
taught us a lesson in generic programming. Best we not forget that
lesson.

-Howard

James Kanze · Dec 21, 2010

[...]

In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.

Click to expand...

Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)

Click to expand...

Because it makes the behavior uniform for different pointer types.

Like in iostream. For better or for worse, it's a different
behavior that most programmers are used to, and even count on.

This is a big win in generic code. For example storing pointers or
iterators in containers is common practice.

You mean an excessively dangerous practice, banned by a lot of
coding guidelines. (Iterators and pointers tend to become
invalid at inconvenient times.)

Making hash<char*> behave differently than hash<int*> would
have even more severe consequences than making vector<bool>
behave differently than vector<int>.

Come now. In both cases, all that's needed is a little
specialization (or overloading, if it's a question of
functions). With the difference that I suspect unordered
containers are fairly rare in generic code, where as vectors
aren't.

If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)

If we were designing the language from zero, I'd totally agree
with you. But we're not, and for better or worse, char const*
is, in most people's minds (and in a lot of the interfaces we
have to deal with), a string.

Anyway, I doubt that it's a real problem. Since the standard
doesn't make any quality guarantees (and I don't see how it
could), no one will actually use the standard hash function
anyway. (It suffers from the same problem rand() does: anything
that's good enough to work in all cases will be unnecessarily
slow in a number of frequent cases.)

Howard Hinnant · Dec 22, 2010

[...]
In C++0x std::hash<char const*> simply hashes the pointer,
not a null terminated string.
Why did they do that? I can see hashing the pointer for other
pointer types, but char const*? (Of course, if the indices are
strings, which is usually the case, hash<std::string> seems more
logical.)

Click to expand...

Because it makes the behavior uniform for different pointer types.

Click to expand...

My apologies, I was probably brusk.

Like in iostream. For better or for worse, it's a different
behavior that most programmers are used to, and even count on.

Agreed for iostream.

You mean an excessively dangerous practice, banned by a lot of
coding guidelines. (Iterators and pointers tend to become
invalid at inconvenient times.)

No, I see containers of iterators or pointers into other containers on
a regular basis used in a safe and useful manner. I see this pattern
recommended here on this newsgroup with little or no objections.

Come now. In both cases, all that's needed is a little
specialization (or overloading, if it's a question of
functions). With the difference that I suspect unordered
containers are fairly rare in generic code, where as vectors
aren't.

I expect unordered containers to become about as popular as (multi)map/
set, perhaps a little more. Naturally the future is always difficult
to predict.

If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.

Click to expand...

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)

The problem with vector<bool> is that it has expected semantics based
on experience with vector<int>, vector<char>, vector<char*>, etc. And
those expectations are not always met. This has more to do with the
consistency of the client's experience than it does with a spec that
may or may not be thoroughly read by the majority of the client
authors. Don't get me wrong: I think a dynamically sized array of
bits is a very beautiful and useful data structure. It just shouldn't
have been named vector<bool> because vector<T> for all scalars has a
well defined behavior without the specialization and the
specialization of vector<bool> implementing a dynamic array of bits is
unable to completely mimic the data structure of a dynamic array of
bools. /That/ is why it was a mistake -- a mismatch of reasonable
expectations.

If we were designing the language from zero, I'd totally agree
with you. But we're not, and for better or worse, char const*
is, in most people's minds (and in a lot of the interfaces we
have to deal with), a string.

For C++0x we are designing std::hash from scratch. And we are also
designing std::unordered_(multi)map/set from scratch. Though
admittedly gratefully guided by a fair amount field experience. That
field experience indicates to me that:

1. std::hash should treat all pointers uniformly.
2. For storing containers of strings in a hash container, std::string,
not char* is the preferred key. The latter is replete with memory
ownership issues that the beginner is ill-equiped to deal with.
3. If the client really wants to store null-terminated char*'s,
dealing with memory ownership himself, then the hash containers should
allow that. They do: write your own hash function. This client must
be knowledgable enough to handle the manual memory management, manual
equality comparison, and manual hash functions. It is all allowed, it
just isn't the default.

Anyway, I doubt that it's a real problem. Since the standard
doesn't make any quality guarantees (and I don't see how it
could), no one will actually use the standard hash function
anyway. (It suffers from the same problem rand() does: anything
that's good enough to work in all cases will be unnecessarily
slow in a number of frequent cases.)

rand suffered from a different problem: a implicitly recommended poor
implementation. <random> goes a long way to correct that problem ...
a very long way. <random> is one of the most impressive random number
libraries I've seen for general purpose use. And it took a tremendous
effort and expense from a few very dedicated individuals to guide it
through the standardization process (I was not one of those).

But I agree with you that clients should feel free to write their own
hash functions. And I'm quite glad we didn't try to standardize, or
make an example hashing function for arrays of char. That would have
repeated the problems of rand. When prototyping, try the default
first. If hashing turns up performance problems, then write your own,
else don't. The library anticipates both cases. I expect some
std::hash<std::string> to be better than others, and for the market to
decide which implementation does best in this area. I have no doubt
that all implementations will do their best to become the best.
Vendors of the std::lib's want to please their customers. They have
high motivation to do so.

-Howard

James Kanze · Dec 22, 2010

On Dec 21, 6:32 pm, James Kanze <[email protected]> wrote:

[...]

No, I see containers of iterators or pointers into other containers on
a regular basis used in a safe and useful manner. I see this pattern
recommended here on this newsgroup with little or no objections.

I've not seen it recommended by any serious professional. And
it *is* dangerous: iterators are by design to be ephemeral.
They are more or less the equivalent to pointers to local or
member variables. (As with all "rules", there are some
exceptions. But you won't find them much in application code.)

I expect unordered containers to become about as popular as (multi)map/
set, perhaps a little more. Naturally the future is always difficult
to predict.

I don't doubt that they'll become popular. But I can't imagine
much generic code which deals with hash---if it does, it's not
too generic, since it is limited to unordered containers.

If there is a silver lining to the vector<bool> cloud, it is
that it taught us a lesson in generic programming. Best we
not forget that lesson.

Click to expand...

The problem with vector<bool> isn't just that it's unorthogonal.
It's that it doesn't meet any of the requirements for a
container. (Arguable, some of the requirements may be
overspecification; it's a bit hard that operator[] must return a
real reference, for example. But that's a different issue.)

Click to expand...

The problem with vector<bool> is that it has expected semantics based
on experience with vector<int>, vector<char>, vector<char*>, etc.

Expected semantics based on the fact that vector is a sequence.

And
those expectations are not always met. This has more to do with the
consistency of the client's experience than it does with a spec that
may or may not be thoroughly read by the majority of the client
authors. Don't get me wrong: I think a dynamically sized array of
bits is a very beautiful and useful data structure. It just shouldn't
have been named vector<bool> because vector<T> for all scalars has a
well defined behavior without the specialization and the
specialization of vector<bool> implementing a dynamic array of bits is
unable to completely mimic the data structure of a dynamic array of
bools. /That/ is why it was a mistake -- a mismatch of reasonable
expectations.

Agreed. Or in more formal terms, the standard defines
a contract for std::vector, then violates it for

For C++0x we are designing std::hash from scratch. And we are also
designing std::unordered_(multi)map/set from scratch. Though
admittedly gratefully guided by a fair amount field experience. That
field experience indicates to me that:

1. std::hash should treat all pointers uniformly.
2. For storing containers of strings in a hash container, std::string,
not char* is the preferred key. The latter is replete with memory
ownership issues that the beginner is ill-equiped to deal with.
3. If the client really wants to store null-terminated char*'s,
dealing with memory ownership himself, then the hash containers should
allow that. They do: write your own hash function. This client must
be knowledgable enough to handle the manual memory management, manual
equality comparison, and manual hash functions. It is all allowed, it
just isn't the default.

I'm not convinced that all of the above are that essential, but
I am beginning to think that treating std::hash<char*> as just
another pointer may be OK. If only because you also want it to
work when the pointer is null. And of course, because the cases
where the container will be indexed by C style strings should be
very exceptional, and the coder should expect to have to do
something exceptional in that case. (On the other hand, I'm
sure we both know programmers who thing char const* is
a string.)

rand suffered from a different problem: a implicitly recommended poor
implementation.

That aggrevated the problem, I agree. But the problem is, IMHO,
more general: if you're doing any serious work, and you need
random numbers, you need a minimum quality for those random
numbers.

Similarly, if you're using an unordered container, it's probably
because of speed as well -- if not, std::vector and std::find
work very well. And speed likely implies that you'll want
a guaranteed minimum quality for the hash. For the data sets
you expect. And while I think it would be possible to define
a hash with a minimum quality for std::string (but
std::basic_string already causes problems, because the hash has
to be consistent with char_traits::eq), but I don't know of any
generic way of specifying quality criteria for all possible
types; for that matter, I don't know how you'd specify the
quality critera other than by specifying an implementation.

<random> goes a long way to correct that problem ...
a very long way.

Some would say too far

. (Over engineering and all that.
Personally, since it's someone else who did the work, I'm all
for over engineering.)

<random> is one of the most impressive random number
libraries I've seen for general purpose use. And it took a tremendous
effort and expense from a few very dedicated individuals to guide it
through the standardization process (I was not one of those).

But I agree with you that clients should feel free to write their own
hash functions. And I'm quite glad we didn't try to standardize, or
make an example hashing function for arrays of char. That would have
repeated the problems of rand. When prototyping, try the default
first. If hashing turns up performance problems, then write your own,
else don't. The library anticipates both cases. I expect some
std::hash<std::string> to be better than others, and for the market to
decide which implementation does best in this area. I have no doubt
that all implementations will do their best to become the best.
Vendors of the std::lib's want to please their customers. They have
high motivation to do so.

Agreed in general, although I suspect that there's one issue
that people will tend to forget: portability of the data. As
long as the unordered containers are entirely in a single
process, or only shared by code using the same implementation
(and those are the only cases which the standard should be
concerned with), using the standard hash is find, even if it is
underspecified. I fear, however, that there are people who will
use it (because it's there) to hash tables on disk, or other
external media. I don't think it's a problem the standard
should address (other than to specify that the hashing
algorithms are "unspecified"), but teachers should at least
mention the issue.

Jorgen Grahn · Dec 30, 2010

Em 20-12-2010 22:59, Paulo da Silva escreveu:

I just saw (wikipedia for example) that unordered_set should implement
the same behaviour as that of hash_set.

Where more exactly? As I recall it, the iterator invalidation rules are
different, and there may be other differences as well. Unfortunately I
don't have any references handy.

/Jorgen

hash_set: how to handle collisions?	5	Jul 11, 2008
STL hash_set problem, SGI impl	1	Nov 18, 2004
TF-IDF	1	Aug 19, 2021
Questions on various string literals in c++0x	1	Dec 7, 2010
How to try a range of hex values in C# code ?	0	Nov 19, 2022
problem with hash set iter	3	Apr 14, 2006
C++ danger to break due to its weight, fragmentation danger - C++0x	14	Apr 19, 2004
Help in C#	0	Mar 23, 2011

Help migrating hash_set to c++0x

Paulo da Silva

red floyd

Paulo da Silva

Paulo da Silva

Paulo da Silva

Howard Hinnant

Paulo da Silva

Juha Nieminen

James Kanze

Howard Hinnant

James Kanze

Howard Hinnant

James Kanze

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads