How to get an insertion hint for an unordered associated container?

P

Pavel

I am trying to figure out if I can use tr1/C++0x std::unordered_map to
speed-up an algorithm I did before using an ordered map.

The problem is quite common:

BEGIN
Given a key 'k', if no equivalent of 'k' is on the container, create
value and insert it; otherwise, update part of the value that is already
on the container in a specific way.

Creating the value (even a "default" one) comes at cost so I do not want
to create it if I do not need to insert.
END

With ordered ass. containers I solved it via equal_range() or
lower_bound() depending on the nature of the key and subsequent insert()
with "hint" iterator returned by one of these functions if 'k' was not
found. With unordered, lower_bound() is not there for me and
equal_range() is useless as it returns a couple of end iterators if 'k'
is not found as opposed to a useful insertion point so I would have to
lookup again to insert.

Am I missing any known pattern of use of unordered associative
containers that would help to avoid an extra value-creation? Any help
would be appreciated.

-Pavel
 
S

Saeed Amrollahi

I am trying to figure out if I can use tr1/C++0x std::unordered_map to
speed-up an algorithm I did before using an ordered map.

The problem is quite common:

BEGIN
Given a key 'k', if no equivalent of 'k' is on the container, create
value and insert it; otherwise, update part of the value that is already
on the container in a specific way.

Creating the value (even a "default" one) comes at cost so I do not want
to create it if I do not need to insert.
END

With ordered ass. containers I solved it via equal_range() or
lower_bound() depending on the nature of the key and subsequent insert()
with "hint" iterator returned by one of these functions if 'k' was not
found. With unordered, lower_bound() is not there for me and
equal_range() is useless as it returns a couple of end iterators if 'k'
is not found as opposed to a useful insertion point so I would have to
lookup again to insert.

Am I missing any known pattern of use of unordered associative
containers that would help to avoid an extra value-creation? Any help
would be appreciated.

-Pavel

Hi Pavel
AFAIK, the Unordered associative containers provide an ability for
fast retrieval
of data based on keys. As you know an - ordered - map is designed and
implemented
using Binary tree, so the complexity of retriving data is O(log n). In
the case of
unordered map, because it uses some hashing function, no comparision
is
required and the lookup operation is faster. Of course we need to a
good hash function
with least collision.

Regards,
-- Saeed Amrollahi
 
P

Pavel

Saeed said:
Hi Pavel
AFAIK, the Unordered associative containers provide an ability for
fast retrieval
of data based on keys. As you know an - ordered - map is designed and
implemented
using Binary tree, so the complexity of retriving data is O(log n). In
the case of
unordered map, because it uses some hashing function, no comparision
is
required and the lookup operation is faster. Of course we need to a
good hash function
with least collision.

Regards,
-- Saeed Amrollahi

Thanks Saeed,

Unfortunately, this was not my question. My question is whether one can
solve the problem above with only one look up in the map and without
unnecessary construction of a value object unless it is actually needed.
It is possible with the std::map but and the question is whether same is
doable with std::unordered_map.

-Pavel
 
J

James Kanze

I am trying to figure out if I can use tr1/C++0x
std::unordered_map to speed-up an algorithm I did before using
an ordered map.
The problem is quite common:
BEGIN
Given a key 'k', if no equivalent of 'k' is on the container,
create value and insert it; otherwise, update part of the
value that is already on the container in a specific way.
Creating the value (even a "default" one) comes at cost so I
do not want to create it if I do not need to insert.
END
With ordered ass. containers I solved it via equal_range() or
lower_bound() depending on the nature of the key and
subsequent insert() with "hint" iterator returned by one of
these functions if 'k' was not found. With unordered,
lower_bound() is not there for me and equal_range() is useless
as it returns a couple of end iterators if 'k' is not found as
opposed to a useful insertion point so I would have to lookup
again to insert.
Am I missing any known pattern of use of unordered associative
containers that would help to avoid an extra value-creation?
Any help would be appreciated.

One "standard" solution even with ordered containers is
something like:

MapType::iterator entry = myMap.find(key);
if ( entry != myMap.end() ) {
// insert...
}

This avoids the extra construction, at the cost of two look-ups,
when inserting. The cost of the second look-up can be
attenuated with ordered containers by using lower_bound, rather
than find, and using the return value as a hint---this does cost
an extra comparison in your code, however. The cost of the
second look-up in an unordered container cannot easily be
avoided, but it is O(1), and not O(lg n). The main potentially
avoidable cost would be calculating the hash code twice, but I
don't think that the interfaces currently available have any
provision to support this. (Basically, it would mean
calculating the hash value in user code, and having additional
variants of functions like find and insert which take a
pre-calculated hash value.)

IIRC, the unordered containers do have variants of insert which
take a "hint", but this is only present for compatibility with
the existing ordered containers; the hint is not used.
 
P

Pavel

James said:
One "standard" solution even with ordered containers is
something like:

MapType::iterator entry = myMap.find(key);
if ( entry != myMap.end() ) {
// insert...
}

This avoids the extra construction, at the cost of two look-ups,
when inserting. The cost of the second look-up can be
attenuated with ordered containers by using lower_bound,
Thanks -- that's what I referred to in the original post. With ordered
containers, I have no problems (some argue equal_range should be used
instead of lower_bound; I usually use lower_bound; but of course for
unordered_.. equal_range() would be the only solution if it were defined
properly -- see below).


rather
than find, and using the return value as a hint---this does cost
an extra comparison in your code, however. The cost of the
second look-up in an unordered container cannot easily be
avoided, but it is O(1), and not O(lg n). The main potentially
avoidable cost would be calculating the hash code twice, but I
don't think that the interfaces currently available have any
provision to support this.
That is actually avoidable in tr1/unordered map (you can access
bucket(const key_type &) etc). However, O(1) (on average, worst case is
linear), is not 1. In my domain area, spending 2*O(1) instead of O(1)
often means "you won, I lost". So, I am trying to get to O(1) from 2 *
O(1). If one assumes that the lookup algorithm in hash is always like this:

1. Identify the bucket.
2. Linearly search in the bucket for the equal key or free spot or
end-of-bucket.

(which it is in my version of GNU implementation of tr1 unordered_map).
I could even say my problem if solved (if only they used the hint, which
they don't as you correctly suggested). But it is not guaranteed.


(Basically, it would mean
calculating the hash value in user code, and having additional
variants of functions like find and insert which take a
pre-calculated hash value.)

IIRC, the unordered containers do have variants of insert which
take a "hint", but this is only present for compatibility with
the existing ordered containers; the hint is not used.
Just noticed that.. Don't understand why they don't use it: they have
to re-compute same information in insert() again (at least bucket index
and hash code). I hope they will use it in the future though and that
API is there to allow optimization, not just for compatibility..

But the equal_range() definition to return two end() iterators on "not
found" condition is a hog -- I can't understand why this could not be
defined as "return an empty range" so either of two identical iterators
that could be used as a hint for insertion, consistent with how it's
defined for ordered ass. containers. What is the benefit of requiring
them both be end()? Checking for "not found" condition costs one
iterator comparison either way.. seems like waste to me.
 
J

James Kanze

[...]
That is actually avoidable in tr1/unordered map (you can
access bucket(const key_type &) etc). However, O(1) (on
average, worst case is linear), is not 1. In my domain area,
spending 2*O(1) instead of O(1) often means "you won, I lost".
So, I am trying to get to O(1) from 2 * O(1). If one assumes
that the lookup algorithm in hash is always like this:
1. Identify the bucket.
2. Linearly search in the bucket for the equal key or free spot or
end-of-bucket.

I'm not sure that the (upcoming) standard requires linear search
in the bucket, but it does require buckets, which are accessible
for some things (but I'm not sure what) from the user interface.
I think that the available functionality is only sufficient for
instrumentation---determining how effective your hash function
actually is.
(which it is in my version of GNU implementation of tr1
unordered_map). I could even say my problem if solved (if
only they used the hint, which they don't as you correctly
suggested). But it is not guaranteed.

The "hint" is in the form of an iterator. I'm not sure how they
could use it.

It might be useful if they provided functions for looking up and
inserting into a given bucket (at your own risk, of course);
you could then call c.bucket(key) (which calculates the hash
code), and do all further operations on the returned bucket.
Probably a bit too dangerous, however.
Just noticed that.. Don't understand why they don't use it:

Because it's an iterator, and an iterator doesn't contain any
useful information to use in the case of an unordered container.
they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..

Explain how? The iterator doesn't contain the hash code in any
way.

Most of the hash maps I've written in the past would cache the
last element found, in order to support things like:

if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}

efficiently. It still required a comparison for each access, in
order to know whether I could use the cached value, but it did
avoid multiple calculations of the hash code when accessing the
same element several times in succession.
But the equal_range() definition to return two end() iterators
on "not found" condition is a hog -- I can't understand why
this could not be defined as "return an empty range" so either
of two identical iterators that could be used as a hint for
insertion, consistent with how it's defined for ordered ass.
containers.

What does "consistent with how it's defined for ordered
containers" mean here? The value returned from lower_range
defines where any new element should be inserted, according to
the ordering relationship. In an unordered container, there is
no specific place where the new element should be inserted.
What you're asking for really isn't possible.
What is the benefit of requiring them both be end()? Checking
for "not found" condition costs one iterator comparison either
way.. seems like waste to me.

No benefit, perhaps, but no real harm either. Ordered and
unordered containers are fundamentally different beasts.

If you're concerned about the time necessary to calculate the
hash function, you could use a wrapper for the key, which caches
the hash function once it has been calculated, and uses the
cached value if it is present. Most of the time, I suspect that
it won't make a difference (although for strings, if the keys
can be fairly long, maybe). So you end up with something like:

class CachedHashString : private std::string
{
bool isHashed;
unsigned hashValue;
public:
// Duplicate any needed constructors, setting isHashed
// to false.

// using declarations for whatever functions you want
// to expose (only const!)
unsigned hash() const
{
if ( ! isHashed ) {
hashValue = ... ;
isHashed = true;
}
return hashValue;
}
};

and the hash function you use with the container uses the member
hash function.
 
J

James Kanze

James Kanze wrote:
Unordered containers have member functions begin(size_type n)
and end(size_type n) that give you a pair of iterators that
point to the sequence of elements in bucket n.

But could they be used somehow to optimize lookup and insertion?
Or anything else, for that matter: what use do they have other
than instrumentation?
 
P

Pavel

James said:
James said:
On Nov 21, 8:34 pm, Pavel
[...]
rather than find, and using the return value as a
hint---this does cost an extra comparison in your code,
however. The cost of the second look-up in an unordered
container cannot easily be avoided, but it is O(1), and not
O(lg n). The main potentially avoidable cost would be
calculating the hash code twice, but I don't think that the
interfaces currently available have any provision to support
this.
That is actually avoidable in tr1/unordered map (you can
access bucket(const key_type&) etc). However, O(1) (on
average, worst case is linear), is not 1. In my domain area,
spending 2*O(1) instead of O(1) often means "you won, I lost".
So, I am trying to get to O(1) from 2 * O(1). If one assumes
that the lookup algorithm in hash is always like this:
1. Identify the bucket.
2. Linearly search in the bucket for the equal key or free spot or
end-of-bucket.

I'm not sure that the (upcoming) standard requires linear search
in the bucket, but it does require buckets, which are accessible
for some things (but I'm not sure what) from the user interface.
I think that the available functionality is only sufficient for
instrumentation---determining how effective your hash function
actually is.
(which it is in my version of GNU implementation of tr1
unordered_map). I could even say my problem if solved (if
only they used the hint, which they don't as you correctly
suggested). But it is not guaranteed.

The "hint" is in the form of an iterator. I'm not sure how they
could use it.

It might be useful if they provided functions for looking up and
inserting into a given bucket (at your own risk, of course);
you could then call c.bucket(key) (which calculates the hash
code), and do all further operations on the returned bucket.
Probably a bit too dangerous, however.
Just noticed that.. Don't understand why they don't use it:

Because it's an iterator, and an iterator doesn't contain any
useful information to use in the case of an unordered container.
they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..

Explain how? The iterator doesn't contain the hash code in any
way.
Iterator may contain anything that addresses the element, in particular
hash_code (it does so in one version of GCC hashtable, more precisely,
points to the node that contains hash_code). It has to contain something
to allow navigation (++, --). For example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.

If buckets contain pointers to actual objects (which sounds feasible as
the Standard guarantees the references and pointers to the objects are
not invalidated during re-hashes), the above is quite enough to insert
the object "at" given iterator or at first available space in the bucket
pointed to by the iterator.

Most of the hash maps I've written in the past would cache the
last element found, in order to support things like:

if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}
I understand you can optimize the set for the usage pattern you think is
common. This comes at cost even for this use, however, as the
comparision may be much more expensive than hash-function computation
and you guarantee you will have an extra comparison in m[x] at all
times. We know we can do without (see above) so why should we live with it.
efficiently. It still required a comparison for each access, in
order to know whether I could use the cached value, but it did
avoid multiple calculations of the hash code when accessing the
same element several times in succession.
You are making assumptions that may be very true for your particular
problem but I would not use these in a general-purpose library like STL.
What does "consistent with how it's defined for ordered
containers" mean here? The value returned from lower_range
defines where any new element should be inserted, according to
the ordering relationship. In an unordered container, there is
no specific place where the new element should be inserted.
What you're asking for really isn't possible.
Why not? You can have a bucket and an index in the bucket and the bucket
has the index if the last element. If a pointer to the element stored
the bucket (think of the bucket as an array/vector of pointers although
it does not have to be it) is NULL, insert your element into this spot;
otherwise, if the bucket has free space insert it at the end of the
bucket; otherwise, you have to re-hash (but you would have anyway;
supposedly, you still have "amortized constant" for the average
insertion time)
No benefit, perhaps, but no real harm either. Ordered and
unordered containers are fundamentally different beasts.
If you returned two identical iterators pointing to a free spot in the
correct bucket (say the first free spot), you would know to insert your
element there without any computations whatsoever.


If you're concerned about the time necessary to calculate the
hash function, you could use a wrapper for the key, which caches
the hash function once it has been calculated, and uses the
cached value if it is present. Most of the time, I suspect that
it won't make a difference (although for strings, if the keys
can be fairly long, maybe). So you end up with something like:

class CachedHashString : private std::string
{
bool isHashed;
unsigned hashValue;
public:
// Duplicate any needed constructors, setting isHashed
// to false.

// using declarations for whatever functions you want
// to expose (only const!)
unsigned hash() const
{
if ( ! isHashed ) {
hashValue = ... ;
isHashed = true;
}
return hashValue;
}
};

and the hash function you use with the container uses the member
hash function.
I know.. but as per above, I do not think this is necessary. The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.

To summarize my point:

The Standard recognizes the hint may be useful (it is still in the API
for unordered ass. containers) and I am ok that GCC STL does not use it
now -- it or another implementation may do it in the future or I can
write it myself and the client code will continue to rely on the
Standard-compliant unordered ass. container while enjoying the faster
implementation.

But, I do have an issue with the requirement to return two end()
iterators in equal_range() on "not found" condition instead of such a
hint. I think it is a defect in the Standard that limits possible
optimizations without good reason.
 
J

James Kanze

James said:
James Kanze wrote:
On Nov 21, 8:34 pm, Pavel
[...]
they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..
Explain how? The iterator doesn't contain the hash code in any
way.
Iterator may contain anything that addresses the element, in
particular hash_code (it does so in one version of GCC
hashtable, more precisely, points to the node that contains
hash_code).

That's true, but how would that help? The hint has to be
checked; it's not an absolute. So the insertion code would
still have to calculate the hash code for the element to be
inserted. And having done that, having the iterator will not
make much difference.
It has to contain something to allow navigation (++, --). For
example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.
If buckets contain pointers to actual objects (which sounds
feasible as the Standard guarantees the references and
pointers to the objects are not invalidated during re-hashes),
the above is quite enough to insert the object "at" given
iterator or at first available space in the bucket pointed to
by the iterator.

You seem to be forgetting that the iterator is only a hint, and
that the insertion has to work correctly even if the hint isn't
appropriate. Which means that the inserter would still have to
verify that the element to be inserted belongs in the bucket
designated by the hint. If it doesn't, it has to find the
correct bucket; if it does, it can use the bucket designated by
the iterator. But finding the correct bucket is O(1), with very
low constant time; using the bucket designated by the iterator
doesn't really gain enough to make the extra tests worthwhile.

Note how this is different for ordered containers. The
implementation can verify in constant time whether the insertion
should be adjacent to the given iterator, where as finding the
correct position from scratch is O(lg n). In a hash table,
everything is O(1), and the only potentially expensive operation
which you'd like to eliminate is recalculating the hash code.
To do this, the implementation would have to consider the hint
as guaranteed, and just insert the object in the bucket obtained
from the iterator. Which wouldn't be conform, and probably
wouldn't work in one of the more frequent uses of the hinted
version of insert: in an insert_iterator. (I suspect that
supporting insert_iterator is a major motivation for having the
hinted inserter in the hashed containers. You can't use
back_insert_iterator, since for obvious reasons, the container
doesn't support push_back, and you do want to be able to use
std::copy to insert elements into the container.)
Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:
if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}
I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.

It's a common usage pattern with my hashed containers: since []
has a precondition that the entry exists, you often end up
writing things like the above (and if (m.contains) followed by
use. With the STL idiom, of course, the above would probably be
written:

Map::iterator elem = m.find(x);
if ( elem !== m.end() ) {
*m = 2 * *m ; // or whatever...
}

The STL avoids multiple look-up, at the cost of IMHO a less
natural syntax. Given my syntax, however, caching will be a
win: you do have the comparison each time, but you at least
avoid the extra calcuation of the hash code.
You are making assumptions that may be very true for your
particular problem but I would not use these in a
general-purpose library like STL.

I think it more a question of the idioms associated with the
container, than whether it is general purpose or not. With the
STL, you use find and then the returned iterator; with mine, you
use contain(), and then [].
Why not? You can have a bucket and an index in the bucket and
the bucket has the index if the last element. If a pointer to
the element stored the bucket (think of the bucket as an
array/vector of pointers although it does not have to be it)
is NULL, insert your element into this spot; otherwise, if the
bucket has free space insert it at the end of the bucket;
otherwise, you have to re-hash (but you would have anyway;
supposedly, you still have "amortized constant" for the
average insertion time)

Several points, but the most important one is what I mentionned
before: the function must work even if the hint is incorrect.
And typically, the bucket doesn't have a fixed length: you don't
rehash because there are more than n elements in a single
bucket; you rehash because size()/bucket_count() passes a
pre-established limit.

With regards to the original poster's problem (he didn't want to
construct a complete object if the entry was already there, and
he didn't want to calculate the hash code multiple times), the
interface could be extended with a additional function,
insert_into_bucket, so that he could get the bucket using
bucket, then iterate over it to see if his target was already
present, and if not use this new function for the insertion.
Whether it's worth it is another question: it's a very
specialized use, and most users would probably be content with
just insert (which only inserts if not present), despite having
to create a new complete object.
If you returned two identical iterators pointing to a free
spot in the correct bucket (say the first free spot), you
would know to insert your element there without any
computations whatsoever.

You still need to calculate the hash code of the object being
inserted, to validate the hint (and in a unique hash table, do
some comparisons to ensure that the element isn't already
present). And once you've done that, there's not much
calculation left anyway.

[...]
The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.

It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part of
the interface are rather vague, but I would expect that if
(double)(m.bucket_count()+1)/m.size() < m.max_load_factor(), then:
size_t b = m.bucket(obj);
m.insert(obj);
assert(b == m.bucket(obj));
should hold. If so, that pretty much defines how overflows are
handled.
To summarize my point:
The Standard recognizes the hint may be useful (it is still in
the API for unordered ass. containers)

Does it recognize a utility, other than compatibility of
interface (so e.g. you can instantiation an insert_iterator on
an unordered ass. container)? (I'm not sure I understand the
meaning of "The iterator q is a hint pointing to where the
search should start" in the standard. It doesn't make too much
sense when buckets are used for collision handling, at least to
me.)
and I am ok that GCC STL does not use it now -- it or another
implementation may do it in the future or I can write it
myself and the client code will continue to rely on the
Standard-compliant unordered ass. container while enjoying
the faster implementation.

I'd be interesting in seeing an actual implementation which does
use it somehow.
But, I do have an issue with the requirement to return two
end() iterators in equal_range() on "not found" condition
instead of such a hint. I think it is a defect in the Standard
that limits possible optimizations without good reason.

I would agree that it seems to be a case of overspecification.
Even if I don't see any possible advantages in returning
something else at present, that doesn't mean that they can't
exist, if not now, in the future, and there's no real advantage
in excluding them.
 
P

Pavel

James said:
James said:
On Nov 24, 5:24 am, Pavel
James Kanze wrote:
On Nov 21, 8:34 pm, Pavel
[...]
they have to re-compute same information in insert() again (at
least bucket index and hash code). I hope they will use it in
the future though and that API is there to allow optimization,
not just for compatibility..
Explain how? The iterator doesn't contain the hash code in any
way.
Iterator may contain anything that addresses the element, in
particular hash_code (it does so in one version of GCC
hashtable, more precisely, points to the node that contains
hash_code).

That's true, but how would that help? The hint has to be
checked; it's not an absolute. So the insertion code would
still have to calculate the hash code for the element to be
inserted. And having done that, having the iterator will not
make much difference.
It has to contain something to allow navigation (++, --). For
example, it could contain:
1. A handle for constant-time access to the bucket, to be able to find
out where the bucket ends (like an index of the bucket in some
array/vector/deque or a pointer to it),
2. A handle for constant-time access to element in the bucket (another
index or whatever)
3. A handle for constant-time access to the container (to know how to
navigate to the next bucket as you need for ++, --). Again, a pointer or
similar.
If buckets contain pointers to actual objects (which sounds
feasible as the Standard guarantees the references and
pointers to the objects are not invalidated during re-hashes),
the above is quite enough to insert the object "at" given
iterator or at first available space in the bucket pointed to
by the iterator.

You seem to be forgetting that the iterator is only a hint, and
that the insertion has to work correctly even if the hint isn't
appropriate.
point taken -- the hint-using code must count hash code.
Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:
if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}
I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.

It's a common usage pattern with my hashed containers: since []
has a precondition that the entry exists,
You mean -- in your container, not STL? I don't like STL's inserting
things into maps behind my back either, e.g.:

std::map<int, int> m;
int i = m[5];
/* it's not exactly intuitive that we have just created a (5, 0) entry
on the map.. */

That's why I mostly use explicit insert, find etc.

you often end up
writing things like the above (and if (m.contains) followed by
use. With the STL idiom, of course, the above would probably be
written:

Map::iterator elem = m.find(x);
if ( elem !== m.end() ) {
*m = 2 * *m ; // or whatever...
}

The STL avoids multiple look-up, at the cost of IMHO a less
natural syntax. Given my syntax, however, caching will be a
win: you do have the comparison each time, but you at least
avoid the extra calcuation of the hash code.
Sometimes hash code computation is more expensive than comparison and
sometimes it isn't. Depends on how you calculate hashcode: sometimes it
is feasible to only use a part of the whole that is known to be usually
more distinctive.
You are making assumptions that may be very true for your
particular problem but I would not use these in a
general-purpose library like STL.

I think it more a question of the idioms associated with the
container, than whether it is general purpose or not. With the
STL, you use find and then the returned iterator; with mine, you
use contain(), and then [].
I meant the assumption that hash code calculation is the most expensive
part of the search. Value comparison can be sometimes more expensive and
walking the bucket may take some time (The bucket of size 5 or 6 is
still traverse-able at O(1) but O becomes too big :))
Several points, but the most important one is what I mentionned
before: the function must work even if the hint is incorrect.
true as admitted above. Still some gain is due.
And typically, the bucket doesn't have a fixed length: you don't
rehash because there are more than n elements in a single
bucket; you rehash because size()/bucket_count() passes a
pre-established limit.
Yes, poing taken.. what I was alluding to a map implementation that was
not stl compliant; The essense does not change though: please replace
"re-hash" with "add more space to the end of the bucket".
With regards to the original poster's problem (he didn't want to
construct a complete object if the entry was already there, and
he didn't want to calculate the hash code multiple times)
I was the original poster BTW :)

, the
interface could be extended with a additional function,
insert_into_bucket, so that he could get the bucket using
bucket, then iterate over it to see if his target was already
present, and if not use this new function for the insertion.
Whether it's worth it is another question: it's a very
specialized use, and most users would probably be content with
just insert (which only inserts if not present), despite having
to create a new complete object.
That's what I ended up for a while (sigh). But it's quite probably I
will have to return to it as the lookup in this map seems to take the
lion share of CPU time and the nature of the problem is such that 99% of
all inserted values get removed by the end-of-day so I never know
whether the given key is already in or not. I need to measure more to
see what actually takes time within this lookup.
You still need to calculate the hash code of the object being
inserted, to validate the hint (and in a unique hash table, do
some comparisons to ensure that the element isn't already
present). And once you've done that, there's not much
calculation left anyway.
As I said above, there is some: comparison and potential navigation in
the bucket. It is usually linear and the unpleasant part is that
reasonable bucket length is only guaranteed on average; wherever
selecting a good hash function is difficult (this is not the case in my
current problem), it is possible to accidentally push some 50% of the
whole key population into one bucket and then spend n/4 iterations on
average for each lookup while the average number of elements per bucket
is still below whatever implementation believes is the reasonable limit.
[...]
The cost of
insertion can be anything -- in case of conflict it may even be some
secondary hash or linear or non-linear search in the overflow area or
similar. The Standard does not define how exactly the overflows are
processed.

It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part of
the interface are rather vague, but I would expect that if
(double)(m.bucket_count()+1)/m.size()< m.max_load_factor(), then:
size_t b = m.bucket(obj);
m.insert(obj);
assert(b == m.bucket(obj));
My reading is opposite:

It says (1) "Keys with the same hash code appear in the same bucket" and
(2) "The number of buckets is automatically increased as elements are
added to an unordered associative container"

So, if there is one bucket now, m.bucket(obj) has to return 0;
if there must be two buckets after after the insertion as per (2), the
inserted element may have gone to the bucket 1 with other its peers (the
bucket may contain elements with more than one hash code it is just that
if two elements have same hash code they must appear in the same bucket
per (1) so if obj had hashcode 2 and 0th bucket contained all elements
with codes 1 and 2, all twos may need to go to the 1st bucket during the
insertion).
should hold. If so, that pretty much defines how overflows are
handled.


Does it recognize a utility, other than compatibility of
interface (so e.g. you can instantiation an insert_iterator on
an unordered ass. container)? (I'm not sure I understand the
meaning of "The iterator q is a hint pointing to where the
search should start" in the standard. It doesn't make too much
sense when buckets are used for collision handling, at least to
me.)


I'd be interesting in seeing an actual implementation which does
use it somehow.
I will post a notice here if I make the change in GNU implementation (I
will be legally obliged to contribute it back anyway). But I will only
work on it if I figure out it would it would make a difference for my
particular problem so it's not guaranteed.
-Pavel
 
J

James Kanze

James said:
James Kanze wrote:
On Nov 24, 5:24 am, Pavel
James Kanze wrote:
On Nov 21, 8:34 pm, Pavel
[...]
Most of the hash maps I've written in the past would cache
the last element found, in order to support things like:
if ( m.contains(x) ) {
m[x] = 2*m[x] ; // or whatever...
}
I understand you can optimize the set for the usage pattern
you think is common. This comes at cost even for this use,
however, as the comparision may be much more expensive than
hash-function computation and you guarantee you will have an
extra comparison in m[x] at all times. We know we can do
without (see above) so why should we live with it.
It's a common usage pattern with my hashed containers: since
[] has a precondition that the entry exists,
You mean -- in your container, not STL?
Yes.

I don't like STL's inserting things into maps behind my back
either, e.g.:
std::map<int, int> m;
int i = m[5];
/* it's not exactly intuitive that we have just created a (5, 0) entry
on the map.. */

My earliest versions of a hash table did this as well---they
were based on my experience with AWK. (In fact, I've come to
the conclusion that it's a very bad idea. AWK and C++ are
different languages, and what's the best solution in one isn't
necessarily a good solution in the other.) But even then, I
found myself often using contains to avoid accidentally
inserting a new element. All too often, you need to procede
differently depending on whether the element is present or not.
That's why I mostly use explicit insert, find etc.

Agreed. I've found almost no use for std::map's operator[]
either. I rather think it was an afterthought---that the design
itself is to use find for access, and insert for insertion, and
that operator[] was added with an AWK-like semantic as a
convenience function, for the cases where you need or want an
AWK-like semantic.

If you use find and insert, then of course, there's no point in
caching the last value, since the iterator find returns contains
it. I have a lot of code along the lines of:

Map::const_iterator elem = map.find(key);
return elem == map.end()
? someDefaultValue
: elem->second;

With my earlier implementations, this was:

return map.contains(key)
? map[key]
: someDefaultValue;

(which I still prefer, but the STL usage above is more than
acceptable as well).
[...]
The cost of insertion can be anything -- in case of
conflict it may even be some secondary hash or linear or
non-linear search in the overflow area or similar. The
Standard does not define how exactly the overflows are
processed.
It does require buckets, since it exposes this detail at the
interface level. The guarantees with regards to this part
of the interface are rather vague, but I would expect that
if (double)(m.bucket_count()+1)/m.size()<
m.max_load_factor(), then:
size_t b = m.bucket(obj);
m.insert(obj);
assert(b == m.bucket(obj));
My reading is opposite:
It says (1) "Keys with the same hash code appear in the same
bucket" and (2) "The number of buckets is automatically
increased as elements are added to an unordered associative
container"
So, if there is one bucket now, m.bucket(obj) has to return 0;
if there must be two buckets after after the insertion as per
(2), the inserted element may have gone to the bucket 1 with
other its peers (the bucket may contain elements with more
than one hash code it is just that if two elements have same
hash code they must appear in the same bucket per (1) so if
obj had hashcode 2 and 0th bucket contained all elements with
codes 1 and 2, all twos may need to go to the 1st bucket
during the insertion).

That's why I had the precondition.

As I said, I don't think this is 100% clear in the draft, but
you do have "The number of buckets is automatically increased as
elements are added to an unordered associative container, so
that the average number of elements per bucket is kept below a
bound. Rehashing invalidates iterators, changes ordering between
elements, and changes which buckets elements appear in, but does
not invalidate pointers or references to elements." There seems
(to me, at least) to be a strong presumption that if you don't
rehash, iterators aren't invalidated, and that the container
doesn't rehash unless it needs to, so that you can count on
iterators not being invalidated if you can ensure that the load
factor will not be exceeded. I may be reading too much into
this, since I don't see any practical way of guaranteeing a
sufficient number of buckets beforehand; e.g. something like
reserve in std::vector. Something like:
map.max_load_factor(0.5F);
map.rehash();
map.max_load_factor(0.8F);
would probably work in practice, but I'm not sure that it is
guaranteed. (The argument of max_load_factor is only a hint.
From a QoI point of view, I would expect that it result in a
max_load_factor <= the given value, and I don't understand the
standard not requiring it.) I think some function along the
lines of reserve or reserve_buckets would be in order---if I
know I'm going to insert 100000 elements, it's rather a bother
not to be able to tell the hash table this beforehand, so that
it doesn't have to rehash.
 
P

Pavel

James Kanze wrote:
[snipped]
That's why I had the precondition.
I missed that, sorry.
As I said, I don't think this is 100% clear in the draft, but
you do have "The number of buckets is automatically increased as
elements are added to an unordered associative container, so
that the average number of elements per bucket is kept below a
bound. Rehashing invalidates iterators, changes ordering between
elements, and changes which buckets elements appear in, but does
not invalidate pointers or references to elements." There seems
(to me, at least) to be a strong presumption that if you don't
rehash, iterators aren't invalidated,
23.2.5-12
"
The insert members shall not affect the validity of references to
container elements, but may invalidate all
iterators to the container. The erase members shall invalidate only
iterators and references to the erased
elements.
"

They seem to be determined not to imply anything specific about when
rehashing may happen (except that it does not happen in erase() so
hashtable never shrinks)... can happen in any insert(). Not that I think
it is useful degree of freedom not to define that you only auto-rehash
when the load_factor is exceeded.. usual C++ Standardese.


and that the container
doesn't rehash unless it needs to, so that you can count on
iterators not being invalidated if you can ensure that the load
factor will not be exceeded. I may be reading too much into
this, since I don't see any practical way of guaranteeing a
sufficient number of buckets beforehand; e.g. something like
reserve in std::vector. Something like:
map.max_load_factor(0.5F);
map.rehash();
map.max_load_factor(0.8F);
would probably work in practice, but I'm not sure that it is
guaranteed. (The argument of max_load_factor is only a hint.
From a QoI point of view, I would expect that it result in a
max_load_factor<= the given value, and I don't understand the
standard not requiring it.)
Me neither. I would understand it if there were a danger of back-n-forth
"jitter" in case unordereds were auto-shrinkable but this does not seem
to be the case (see above)
> I think some function along the
lines of reserve or reserve_buckets would be in order---if I
know I'm going to insert 100000 elements, it's rather a bother
not to be able to tell the hash table this beforehand, so that
it doesn't have to rehash.
I think you can give it as a first argument to the constructor. Also, if
a container is empty, rehash(n) should essentially do reserve(n) (there
is nothing to rehash); if it's not empty, rehashing is probably
necessary anyway if you want to reserve more..
 
J

James Kanze

James Kanze wrote:
[snipped]
As I said, I don't think this is 100% clear in the draft, but
you do have "The number of buckets is automatically increased as
elements are added to an unordered associative container, so
that the average number of elements per bucket is kept below a
bound. Rehashing invalidates iterators, changes ordering between
elements, and changes which buckets elements appear in, but does
not invalidate pointers or references to elements." There seems
(to me, at least) to be a strong presumption that if you don't
rehash, iterators aren't invalidated,

23.2.5-12
"
The insert members shall not affect the validity of references
to container elements, but may invalidate all iterators to the
container. The erase members shall invalidate only iterators
and references to the erased elements.
"
They seem to be determined not to imply anything specific
about when rehashing may happen (except that it does not
happen in erase() so hashtable never shrinks)... can happen in
any insert(). Not that I think it is useful degree of freedom
not to define that you only auto-rehash when the load_factor
is exceeded.. usual C++ Standardese.

Yes. The provide hooks that an implementation can use to
possibly offer something useful, but isn't required to.

With regards to rehashing when not required by the load factor
(or some explicit user function), the wording I've seen does
seem to allow it; from a QoI point of view, however, I'm more
sceptical.
Me neither. I would understand it if there were a danger of
back-n-forth "jitter" in case unordereds were auto-shrinkable
but this does not seem to be the case (see above)
I think you can give it as a first argument to the
constructor. Also, if a container is empty, rehash(n) should
essentially do reserve(n) (there is nothing to rehash); if
it's not empty, rehashing is probably necessary anyway if you
want to reserve more..

Yes. I'd missed the fact that rehash had a parameter specifying
the (minimum) number of buckets.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top