Defining hash-keys?


C

carl

First a short introduction to the problem:

I am using a kd-tree to find the k-nearest neighbors to a point, P, in 3D
euclidean space. The result is a list of 3D vectors corresponding to vectors
in 3D space. Something like:

[12,45,1]
[102,5,231]
[142,45,1]
[0,425,81]
....

Now each of these points are actually position fields in unique objects that
contains additional info. Currently I store these objects in a std::vector.
I then search the vector for objects that contains the positions returned
from the kd-tree. But this sorta kills the idea of using the kd-tree (in
worst case I need to run through the whole vector).

My ideas was therefore to think of the position returned from the kd-tree as
a key used in a hash-table. If I store my objects in a hash-table using the
position as keys I could then get constant time lookup.

This leads me to the following questions:

1) Where do I find a hash-table hash-map for c++? I have heard of the
unordered set/map but where is it located?
2) Is it possible to use a unique 3D vector like [12,3,4] as a key for the
above hash-table?
 
Ad

Advertisements

A

Alf P. Steinbach

* carl:
1) Where do I find a hash-table hash-map for c++? I have heard of the
unordered set/map but where is it located?

If your compiler offers support for TR1 (the first library extension proposal),
then it's down in some funny namespace and header. Your compiler may also offer
its own hash table implementation. They're not yet standard.

But anyway just use Boost's hash tables.

www.boost.org

2) Is it possible to use a unique 3D vector like [12,3,4] as a key for
the above hash-table?

Yes, but you have to hash it. :) That is, you need a function that given such a
vector value produces an integer in some given range (suitable for the hash
table), where the mapping is such that varying the vector value by small amount
varies the integer by apparently randomly large amount. I think, but not sure,
that you'll have to define this hashing function yourself, then the hash table
takes care of using it to determine placement of your value in the table.


Cheers & hth.,

- Alf
 
C

carl

Alf P. Steinbach said:
* carl:
1) Where do I find a hash-table hash-map for c++? I have heard of the
unordered set/map but where is it located?

If your compiler offers support for TR1 (the first library extension
proposal), then it's down in some funny namespace and header. Your
compiler may also offer its own hash table implementation. They're not yet
standard.

But anyway just use Boost's hash tables.

www.boost.org

2) Is it possible to use a unique 3D vector like [12,3,4] as a key for
the above hash-table?

Yes, but you have to hash it. :) That is, you need a function that given
such a vector value produces an integer in some given range (suitable for
the hash table), where the mapping is such that varying the vector value
by small amount varies the integer by apparently randomly large amount. I
think, but not sure, that you'll have to define this hashing function
yourself, then the hash table takes care of using it to determine
placement of your value in the table.

I have seen some examples where the hash keys are defined as strings. Maybe
the components in the vector could be parsed into a string and then use that
for the hash-key:

int key = [2,4,12]

std::string key = "2412"

Each string version of the key would be unique since each integer key (3D
position vector) is unique.

As a consequence I just need a hash-function that works with strings.
 
C

carl

A simpler approach could be to just use a std::map where the keys are string
representations of the space locations.





carl said:
Alf P. Steinbach said:
* carl:
1) Where do I find a hash-table hash-map for c++? I have heard of the
unordered set/map but where is it located?

If your compiler offers support for TR1 (the first library extension
proposal), then it's down in some funny namespace and header. Your
compiler may also offer its own hash table implementation. They're not
yet standard.

But anyway just use Boost's hash tables.

www.boost.org

2) Is it possible to use a unique 3D vector like [12,3,4] as a key for
the above hash-table?

Yes, but you have to hash it. :) That is, you need a function that given
such a vector value produces an integer in some given range (suitable for
the hash table), where the mapping is such that varying the vector value
by small amount varies the integer by apparently randomly large amount. I
think, but not sure, that you'll have to define this hashing function
yourself, then the hash table takes care of using it to determine
placement of your value in the table.

I have seen some examples where the hash keys are defined as strings.
Maybe the components in the vector could be parsed into a string and then
use that for the hash-key:

int key = [2,4,12]

std::string key = "2412"

Each string version of the key would be unique since each integer key (3D
position vector) is unique.

As a consequence I just need a hash-function that works with strings.
 
J

James Kanze

If your compiler offers support for TR1 (the first library
extension proposal), then it's down in some funny namespace
and header. Your compiler may also offer its own hash table
implementation. They're not yet standard.
But anyway just use Boost's hash tables.
2) Is it possible to use a unique 3D vector like [12,3,4] as
a key for the above hash-table?
Yes, but you have to hash it. :) That is, you need a function
that given such a vector value produces an integer in some
given range (suitable for the hash table), where the mapping
is such that varying the vector value by small amount varies
the integer by apparently randomly large amount. I think, but
not sure, that you'll have to define this hashing function
yourself, then the hash table takes care of using it to
determine placement of your value in the table.

Don't forget the most important characteristic of the hash
function: if a == b, then hash(a) == hash(b). This could easily
be a problem with floats (where +/- O might hash differently, if
the hash function was done naïvely) or unordered collections
(where different ordering of the same values might hash
differenty). If his key is a vector of integers (or unsigned
integers), however, hashing it more or less like a string should
work.

(Note too that I'm not sure I agree with your criteria, at least
as the way you formulate it. Something like FNV hashing or my
Mersenne prime algorithms would result in [12,3,4] and [12,3,5]
having hash values differing by one, but they're still about the
best around.)
 
J

James Kanze

I have seen some examples where the hash keys are defined as
strings. Maybe the components in the vector could be parsed
into a string and then use that for the hash-key:
int key = [2,4,12]
std::string key = "2412"
Each string version of the key would be unique since each
integer key (3D position vector) is unique.
As a consequence I just need a hash-function that works with
strings.

That's a lot more work than necessary. The better string hash
functions actually treat the string as a sequence of unsigned
char---small unsigned integers. Depending on the algorithm and
your data types (and values), either just converting the
elements to unsigned (if this results in no loss of information
over the set of data types) or breaking the values up into two
or more unsigned (which taken together result in no loss of
information) should be enough.
 
Ad

Advertisements

J

James Kanze

A simpler approach could be to just use a std::map where the
keys are string representations of the space locations.

std::map will take an std::vector<int> as a key. No need to
convert to string. (The access will still be O(lg n), rather
than O(1), but that's often enough.)
 
A

Alf P. Steinbach

* James Kanze:
If your compiler offers support for TR1 (the first library
extension proposal), then it's down in some funny namespace
and header. Your compiler may also offer its own hash table
implementation. They're not yet standard.
But anyway just use Boost's hash tables.
2) Is it possible to use a unique 3D vector like [12,3,4] as
a key for the above hash-table?
Yes, but you have to hash it. :) That is, you need a function
that given such a vector value produces an integer in some
given range (suitable for the hash table), where the mapping
is such that varying the vector value by small amount varies
the integer by apparently randomly large amount. I think, but
not sure, that you'll have to define this hashing function
yourself, then the hash table takes care of using it to
determine placement of your value in the table.

Don't forget the most important characteristic of the hash
function: if a == b, then hash(a) == hash(b). This could easily
be a problem with floats (where +/- O might hash differently, if
the hash function was done naïvely) or unordered collections
(where different ordering of the same values might hash
differenty). If his key is a vector of integers (or unsigned
integers), however, hashing it more or less like a string should
work.

(Note too that I'm not sure I agree with your criteria, at least
as the way you formulate it. Something like FNV hashing or my
Mersenne prime algorithms would result in [12,3,4] and [12,3,5]
having hash values differing by one, but they're still about the
best around.)

It depends. Like how exactly the hash table handles collisions (search for next
available, overflow list, what?), and whether values tend to be bundled in small
intervals. I think doing the randomization thing is safest, but considering how
it might lead to excessive paging I'm not even sure about that -- depends also
on OS etc.


Cheers,

- Alf
 
J

James Kanze

* James Kanze:
(Note too that I'm not sure I agree with your criteria, at least
as the way you formulate it. Something like FNV hashing or my
Mersenne prime algorithms would result in [12,3,4] and [12,3,5]
having hash values differing by one, but they're still about the
best around.)
It depends. Like how exactly the hash table handles collisions
(search for next available, overflow list, what?), and whether
values tend to be bundled in small intervals. I think doing
the randomization thing is safest, but considering how it
might lead to excessive paging I'm not even sure about that
-- depends also on OS etc.

Yes. It's not an easy issue to quantify. I was just
disagreeing with your formulation. What you want, of course, is
to have a high probability that two different entries hash to
two different values. But that's really pretty vague, and
depending on the scheme used for conflict resolution, doesn't
address the issue of secondary clustering. The main reason you
might want to "similar" entries to have large differences in
their hash code is to avoid secondary clustering. With the most
frequent conflict resolution: using buckets which can contain
more than one entry, secondary clustering is not an issue.
Since this is the solution adopted by the standard, and by all
of the hash_map that come with your compilers today, and by
Boost, I don't think that secondary clustering is an issue. And
from actual benchmarks, run with a number of different hash
codes, FNV or one of the Mersenne prime algorithms seems by far
the best with such hash tables, despite their poor
characteristics with regards to secondary clustering.
 
C

carl

James Kanze said:
std::map will take an std::vector<int> as a key. No need to
convert to string. (The access will still be O(lg n), rather
than O(1), but that's often enough.)

Hm but how would you convert a 3 element vector like:

v = [3,2,55]

into an integer like:

i = 3255

I think it would be easier to convert each element into a string and then
concatenate them.
 
J

James Kanze

Hm but how would you convert a 3 element vector like:
v = [3,2,55]
into an integer like:
I think it would be easier to convert each element into a string and
then concatenate them.

And what is a string other than a vector of small integers? Just use
the same algorithm you use to generate a hash code for a string.
There
are a few precautions to take, e.g. if the elements are floating
point,
or larger than the type used to calculate the hash code, but they're
not
insurmountable. If worst comes to worse, convert each element of the
vector into a base 256 representation, and hash that. But at least
for
the values you show, just using the same algorithm as you would for a
string will work fine.

Note too that if you just convert to a string, without any separators,
you'll get a fairly poor hash function, since things like [11,2,43]
and
[1,12,43] will hash to the same value.
 
Ad

Advertisements

C

carl

James Kanze said:
Hm but how would you convert a 3 element vector like:
v = [3,2,55]
into an integer like:
I think it would be easier to convert each element into a string and
then concatenate them.

And what is a string other than a vector of small integers? Just use
the same algorithm you use to generate a hash code for a string.
There
are a few precautions to take, e.g. if the elements are floating
point,
or larger than the type used to calculate the hash code, but they're
not
insurmountable. If worst comes to worse, convert each element of the
vector into a base 256 representation, and hash that. But at least
for
the values you show, just using the same algorithm as you would for a
string will work fine.

Note too that if you just convert to a string, without any separators,
you'll get a fairly poor hash function, since things like [11,2,43]
and
[1,12,43] will hash to the same value.




Good point but as you also write this can be solved inserting separators:

int_vector0 = [11,2,43]
unique_string0 = "11,2,43"

int_vector1 = [1,12,43]
unique_string1 = "1,12,43"

I can now use unique_string0 and unique_string1 as keys in a std::map.
 
T

Thomas J. Gritzan

carl said:
James Kanze said:
A simpler approach could be to just use a std::map where the keys
are string representations of the space locations.
std::map will take an std::vector<int> as a key. No need to convert
to string. (The access will still be O(lg n), rather than O(1), but
that's often enough.)
Hm but how would you convert a 3 element vector like:
v = [3,2,55]
into an integer like:
I think it would be easier to convert each element into a string and
then concatenate them.

And what is a string other than a vector of small integers? Just use
the same algorithm you use to generate a hash code for a string.
There
are a few precautions to take, e.g. if the elements are floating
point,
or larger than the type used to calculate the hash code, but they're
not
insurmountable. If worst comes to worse, convert each element of the
vector into a base 256 representation, and hash that. But at least
for
the values you show, just using the same algorithm as you would for a
string will work fine.

Note too that if you just convert to a string, without any separators,
you'll get a fairly poor hash function, since things like [11,2,43]
and
[1,12,43] will hash to the same value.




Good point but as you also write this can be solved inserting separators:

int_vector0 = [11,2,43]
unique_string0 = "11,2,43"

int_vector1 = [1,12,43]
unique_string1 = "1,12,43"

I can now use unique_string0 and unique_string1 as keys in a std::map.

But why don't you want to use the three integers directly as keys in the
std::map?
It costs some (small) time to convert these into their string
representation, and comparing strings is surely slower than just
comparing three ints.
So you decrease the performance of the lookup without good reason.

As James said, you could use a std::vector as key in the map. You could
also use a custom class (when your dimension is fixed) with a custom
std::less.
 
J

James Kanze

[...]
Note too that if you just convert to a string, without any
separators, you'll get a fairly poor hash function, since
things like [11,2,43] and [1,12,43] will hash to the same
value.
Good point but as you also write this can be solved
inserting separators:
int_vector0 = [11,2,43]
unique_string0 = "11,2,43"
int_vector1 = [1,12,43]
unique_string1 = "1,12,43"
I can now use unique_string0 and unique_string1 as keys in a
std::map.
But why don't you want to use the three integers directly as
keys in the std::map?

Because he wants a hash table, and not std::map:).

Still, I don't understand this desire to pass by a string at all
costs (and the conversion isn't that cheap, and if performance
isn't an issue, then we'd just use std::map and the already
defined ordering relationship). A string really isn't anything
else but an array of integers, and all of the hash functions I
know treat it as such. So why go through all the bother of
converting, when the hash functions work on arrays of integers,
and we have an array of integers.
As James said, you could use a std::vector as key in the map.
You could also use a custom class (when your dimension is
fixed) with a custom std::less.

Unless there was some good reason not to, I'd probably
encapsulate the data in a class, but the implementation of the
required functions in that class would just forward to std::less
on the vector.
 
T

Thomas J. Gritzan

James said:
carl said:
[...]
Note too that if you just convert to a string, without any
separators, you'll get a fairly poor hash function, since
things like [11,2,43] and [1,12,43] will hash to the same
value.
Good point but as you also write this can be solved
inserting separators:
int_vector0 = [11,2,43]
unique_string0 = "11,2,43"
int_vector1 = [1,12,43]
unique_string1 = "1,12,43"
I can now use unique_string0 and unique_string1 as keys in a
std::map.
But why don't you want to use the three integers directly as
keys in the std::map?

Because he wants a hash table, and not std::map:).

Actually, the OP wrote...
A simpler approach could be to just use a std::map where the keys are
string representations of the space locations.
....in the part you just snipped :)

So I guess any fast lookup (faster than linear search) would be fine.
std::map would be a good default choice if he doesn't have boost or TR1
(or another implementation of a hash map) at hand. If the performance is
not good enough, switching to a hash map is easy.
Still, I don't understand this desire to pass by a string at all
costs [...]
As James said, you could use a std::vector as key in the map.
You could also use a custom class (when your dimension is
fixed) with a custom std::less.

Unless there was some good reason not to, I'd probably
encapsulate the data in a class, but the implementation of the
required functions in that class would just forward to std::less
on the vector.

While std::vector is fine in general, why bother with a resizeable
container when you work in a 3D space? That would do an extra memory
allocation for every 3D "vector".

With TR1, I would think about std::array instead. It surely has
perfectly fine op< and hash functions.
 
Ad

Advertisements

J

James Kanze

James said:
carl schrieb:
news:[email protected]m...
[...]
Still, I don't understand this desire to pass by a string at all
costs [...]
As James said, you could use a std::vector as key in the map.
You could also use a custom class (when your dimension is
fixed) with a custom std::less.
Unless there was some good reason not to, I'd probably
encapsulate the data in a class, but the implementation of
the required functions in that class would just forward to
std::less on the vector.
While std::vector is fine in general, why bother with a
resizeable container when you work in a 3D space? That would
do an extra memory allocation for every 3D "vector".

Yes. I was not looking at it from a higher level: he had a
vector, so I suggested encapsulating a vector, but if the number
of elements is always exactly 3 (likely the case), then a vector
is really overkill (not to mention a lot of extra CPU cycles for
nothing).
With TR1, I would think about std::array instead. It surely
has perfectly fine op< and hash functions.

Yes. (Actually, I don't know about the hash functions. I don't
think that hash functions are provided for the containers.)
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top