hash_set: problem with hash function?

M

Markus Dehmann

I have a class "Data" and I store Data pointers in an STL set. But I have
millions of inserts and many more lookups, and my profiler found that they
cost a lot of runtime.

Therefore, I want to substitute the set<Data*> with a hash_set<Data*>:

typedef hash_set<const Data*, hash<const Data*>, eqData> DataPointerHashSet;
// typedef set<Data*> DataPointerHashSet; // (see complete code below)

But it doesn't work! Everything is fine when I use set<Data*>, but when I
plug in hash_set<...> the program gives a wrong answer and then crashes.

I tried to make a mini example, see code below. But the problem is, the
mini example seems to work fine. Just my original example crashes.

Can anyone see problems with my hash_set usage in the mini example below?
Problems that might occur only in certain situations (collisions)? I don't
really know how to write the hash function, so I adapted it from an example
that I found somewhere.


#include <iostream>

#if __GNUC__ < 3 && __GNUC__ >= 2 && __GNUC_MINOR__ >= 95
# include <hash_set>
# include <functional>
# define gnu_namespace std
#elif __GNUC__ >= 3
# include <ext/hash_set>
# if __GNUC_MINOR__ == 0
# include <functional>
# define gnu_namespace std
# else
# include <ext/functional>
# define gnu_namespace __gnu_cxx
# endif
#else
# include <hash_set.h>
# include <functional.h>
# define gnu_namespace std
#endif
using namespace gnu_namespace;
#include <set>
using namespace std;

class Data {
private:
string someData;
unsigned index;
public:
const unsigned getIndex() const { return index; }
Data(const string& s) : someData(s) {
static unsigned index; // a unique index for each Data obj
this->index = index++;
}
};
bool operator == (const Data& n1, const Data& n2 ){
return n1.getIndex() == n2.getIndex();
}

namespace gnu_namespace {
template<>
struct hash<const Data*> {
hash<unsigned> hasher;
size_t operator()(const Data* n) const {
return hasher(n->getIndex()); // wrong hash function??
}
};
}
struct eqData{
bool operator()(const Data* s1, const Data* s2) const{
return *s1 == *s2;
}
};

typedef hash_set<const Data*, hash<const Data*>, eqData>
DataPointerHashSet;
// typedef set<Data*> DataPointerHashSet;

int main(){
DataPointerHashSet s;
const unsigned size = 10000;

Data* pointers[size];
for(int i = 0; i < size; ++i){
pointers = new Data("bla");
s.insert(pointers);
}
for(int i = 0; i < size; ++i){
if(s.find(pointers) == s.end()){
cerr << "ERROR" << endl;
exit(EXIT_FAILURE);
}
}
for(int i = 0; i < size; ++i){
delete pointers;
}
}
 
A

Alan Johnson

Markus said:
I have a class "Data" and I store Data pointers in an STL set. But I have
millions of inserts and many more lookups, and my profiler found that they
cost a lot of runtime.

Therefore, I want to substitute the set<Data*> with a hash_set<Data*>:

typedef hash_set<const Data*, hash<const Data*>, eqData> DataPointerHashSet;
// typedef set<Data*> DataPointerHashSet; // (see complete code below)

But it doesn't work! Everything is fine when I use set<Data*>, but when I
plug in hash_set<...> the program gives a wrong answer and then crashes.

I tried to make a mini example, see code below. But the problem is, the
mini example seems to work fine. Just my original example crashes.

Can anyone see problems with my hash_set usage in the mini example below?
Problems that might occur only in certain situations (collisions)? I don't
really know how to write the hash function, so I adapted it from an example
that I found somewhere.


#include <iostream>

#if __GNUC__ < 3 && __GNUC__ >= 2 && __GNUC_MINOR__ >= 95
# include <hash_set>
# include <functional>
# define gnu_namespace std
#elif __GNUC__ >= 3
# include <ext/hash_set>
# if __GNUC_MINOR__ == 0
# include <functional>
# define gnu_namespace std
# else
# include <ext/functional>
# define gnu_namespace __gnu_cxx
# endif
#else
# include <hash_set.h>
# include <functional.h>
# define gnu_namespace std
#endif
using namespace gnu_namespace;
#include <set>
using namespace std;

class Data {
private:
string someData;
unsigned index;
public:
const unsigned getIndex() const { return index; }
Data(const string& s) : someData(s) {
static unsigned index; // a unique index for each Data obj
this->index = index++;
}
};
bool operator == (const Data& n1, const Data& n2 ){
return n1.getIndex() == n2.getIndex();
}

namespace gnu_namespace {
template<>
struct hash<const Data*> {
hash<unsigned> hasher;
size_t operator()(const Data* n) const {
return hasher(n->getIndex()); // wrong hash function??
}
};
}
struct eqData{
bool operator()(const Data* s1, const Data* s2) const{
return *s1 == *s2;
}
};

typedef hash_set<const Data*, hash<const Data*>, eqData>
DataPointerHashSet;
// typedef set<Data*> DataPointerHashSet;

int main(){
DataPointerHashSet s;
const unsigned size = 10000;

Data* pointers[size];
for(int i = 0; i < size; ++i){
pointers = new Data("bla");
s.insert(pointers);
}
for(int i = 0; i < size; ++i){
if(s.find(pointers) == s.end()){
cerr << "ERROR" << endl;
exit(EXIT_FAILURE);
}
}
for(int i = 0; i < size; ++i){
delete pointers;
}
}


I don't entirely understand what you are trying to do with the hash
function here. The point of a hash function is to basically to turn
some block of data into an integer. That integer does not have to be
unique, but it does have to be deterministic (that is, the if you hash
the same thing twice, you get the same value). The quality of a hash
function can be measured by how many collisions it produces. A
collision occurs when two or more objects hash to the same value.

In your case, you are storing a set of pointers. I am curious, what is
it that you want to be unique, the pointers, or the things to which they
are pointing? For example, if I did the following:
Data *p1 = new Data("blah") ;
Data *p2 = new Data("blah") ;

Should I be able to insert both p1 and p2 into the set? The pointers
are certainly different (they point to different memory locations), but
the objects to which they point are equivalent. If it is the pointers
themselves you wish to be unique (i.e., both p1 and p2 WOULD be allowed
in the set) then I suggest simply using the pointer value as your hash.
For example:

class data_pointer_hash
{
public :
size_t operator()(const Data *p)
{
return reinterpret_cast<size_t>(p) ;
}
} ;

If it is the objects themselves that are important to you, then you need
to come up with some way to turn your object into an integer. If your
data is a string (as in the mini-example you gave) you can use the
templated hash function already provided by your STL implementation.
For example (note that for this example to work you will have to make
this class a friend of Data):

class data_hash
{
private :
hash<const char *> h ;
public :
size_t operator()(const Data *p)
{
return h(p->someData.c_str()) ;
}
} ;

Given that you are using, in your example, an equality function object
that compares the actual objects (instead of just the pointers), the
latter is likely the more appropriate hash function for you.
 
M

Markus Dehmann

Alan said:
the objects to which they point are equivalent. If it is the pointers
themselves you wish to be unique (i.e., both p1 and p2 WOULD be allowed
in the set) then I suggest simply using the pointer value as your hash.
For example:

class data_pointer_hash
{
public :
size_t operator()(const Data *p)
{
return reinterpret_cast<size_t>(p) ;
}
} ;

Thanks! That's what I wanted. It turned out there was a bug in the original
code that assigned duplicate indexes. Now I am getting rid of the indexes
at all and use your solution with directly hashing the pointers.

Markus
 
M

Markus Dehmann

Markus said:
I have a class "Data" and I store Data pointers in an STL set. But I have
millions of inserts and many more lookups, and my profiler found that they
cost a lot of runtime.

Therefore, I want to substitute the set<Data*> with a hash_set<Data*>:

typedef hash_set<const Data*, hash<const Data*>, eqData>
DataPointerHashSet;
// typedef set<Data*> DataPointerHashSet; // (see complete code below)

Actually, there are only a few hundred insertions and millions of lookups.
Is hash_set the best structure for that? After I fixed my bug I didn't
find evidence that hash_set worked faster than set. Would a vector and
stl::find be the better solution (since there are only a few hundred
elements)?

Markus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top