STL: how to get the sequence number of a newly added item into a set

B

bilgekhan

After doing a succcessful insert() or find() on a set<T> container
is it possible to get the item number of this item in the set?
(ie. its zero-based sequence number (position/location/rank/index)
within the sorted collection)

Additional info:
insert() returns a pair<T> and find() returns an iterator.
Using any of these two objects is it possible to get the
sequence number of the item these objects point to?
After the insert() or find() the container won't be modified.
 
J

Jerry Coffin

After doing a succcessful insert() or find() on a set<T> container
is it possible to get the item number of this item in the set?
(ie. its zero-based sequence number (position/location/rank/index)
within the sorted collection)

Additional info:
insert() returns a pair<T> and find() returns an iterator.
Using any of these two objects is it possible to get the
sequence number of the item these objects point to?

your_iterator = yourset.find(whatever);
std::distance(yourset.begin(), your_iterator);
 
B

bilgekhan

Jerry Coffin said:
(e-mail address removed) says...

your_iterator = yourset.find(whatever);
std::distance(yourset.begin(), your_iterator);

Yes, distance() does what I wanted, but it is VERY slow!
It seems each time it walks all items from the beginning to the
item in question, ie. it practically does a find().
That's unfortunately too inefficient for my needs.
Isn't there a better (a direct) method without traversing the set ?

FYI: If I program this without STL (and then even using an unordered array
--> ie. each time doing a sequential search) then it is about 7 times
faster than the STL version that uses set<T> and distance() !
How come? :)
The culprit is IMO the distance() call. One needs a better, a direct,
method without traversing the tree.

A better way would have been if the insert() and find() functions would
optionally deliver the distance value on the fly;
ie. STL needs to extend these 2 functions with a direct 'distance' option.
 
B

bilgekhan

Daniel T. said:
set<int>::iterator it = s.insert( x ).first;
// or = s.find( x );

int seqNumb = distance( s.begin(), it );


Yes, distance() does what I wanted, but it is VERY slow!
It seems each time it walks all items from the beginning to the
item in question, ie. it practically does a find().
That's unfortunately too inefficient for my needs.
Isn't there a better (a direct) method without traversing the set ?

FYI: If I program this without STL (and then even using an unordered array
--> ie. each time doing a sequential search) then it is about 7 times
faster than the STL version that uses set<T> and distance() !
How come? :)
The culprit is IMO the distance() call. One needs a better, a direct,
method without traversing the tree.

A better way would have been if the insert() and find() functions would
optionally deliver the distance value on the fly;
ie. STL needs to extend these 2 functions with a direct 'distance' option.
 
J

Jerry Coffin

[ ... ]
Yes, distance() does what I wanted, but it is VERY slow!

That's not terribly surprising.
It seems each time it walks all items from the beginning to the
item in question, ie. it practically does a find().

Yes, that's correct.
That's unfortunately too inefficient for my needs.
Isn't there a better (a direct) method without traversing the set ?

None of which I'm aware.
FYI: If I program this without STL (and then even using an unordered array
--> ie. each time doing a sequential search) then it is about 7 times
faster than the STL version that uses set<T> and distance() !
How come? :)

A set is composed of nodes connected by pointers. Each node is
separately allocated via the allocator for the set (which will use new,
unless you supply an allocator of your own). Those nodes will usually
have fairly poor locality of reference, which leads to poor cache usage
(among other things). The result is relatively slow traversal.

An array is required to be allocated as a single, contiguous block of
storage. This leads to relatively good locality of reference, better
cache usage, and (generally) relatively fast traversal.
The culprit is IMO the distance() call. One needs a better, a direct,
method without traversing the tree.

A better way would have been if the insert() and find() functions would
optionally deliver the distance value on the fly;
ie. STL needs to extend these 2 functions with a direct 'distance' option.

This requirement is sufficiently unusual that I doubt that'll happen.
Anything that supports random access iterators can do distance in
constant time.

Since a set is typically implemented as a balanced tree, you could
theoretically improve the speed of distance() by having each pointer in
a node also record the number of nodes in the subtree at which it points
(or at least in its left sub-tree). This would avoid traversing at least
part of the tree most of the time -- OTOH, maintaining those counts
would slow down insertions and deletions. TANSTAAFL.
 
J

James Kanze

There are no "sequence numbers" for members of a set.
Individual members of sets are accessed via iterators.

A set has a defined order, so there is a defined and guaranteed
bijection between elements of the set and the set of integers in
[0...s.size()).
No, because there is no such a concept as a "sequence number".

Anytime you have a defined order, there is a concept of sequence
number. Set doesn't support it directly, for reasons explained
by Jerry Coffin, but the concept definitly exists.
 
J

James Kanze

[ ... ]
Yes, distance() does what I wanted, but it is VERY slow!
That's not terribly surprising.
Yes, that's correct.
None of which I'm aware.

Perhaps the solution to his problem would be to use an
std::vector, and keep it sorted (using std::lower_bound to find
the insertion point). If insertions aren't too frequent, or
copying is fairly cheap, this can be even faster than an
std::set. And of course, given an iterator to an element, it's
trivial and very fast to find the elements sequence number.
 
J

Jerry Coffin

[ ... ]
Perhaps the solution to his problem would be to use an
std::vector, and keep it sorted (using std::lower_bound to find
the insertion point). If insertions aren't too frequent, or
copying is fairly cheap, this can be even faster than an
std::set. And of course, given an iterator to an element, it's
trivial and very fast to find the elements sequence number.

Right -- I'd add, however, that my mention of random access iterators
instead of vectors was intentional. Under the circumstances at hand, a
deque might well be a better choice, since it halves the number of
elements that need to be shuffled for an average insertion.
 
B

bilgekhan

Daniel T. said:
How else could it be done?

See below.
Your comparing apples to oranges. In one implementation you are using an
unordered array where all the elements are contiguous, in the other
implementation the elements are ordered but not contiguous. But before
abandoning the set idea, make sure you have all optimizations turned on.
Many standard libraries have extensive error checking built in and make
a lot of calls that can be inlined, but the former doesn't get removed
and the latter doesn't get inlined unless optimizations are turned on.


Like how?

See below.
They would have to traverse the tree to do that.

Yes, of course! Therefore the traversion does not need
to be done again in distance() ! That's the point I'm trying
to make clear. If you reread my inital posting then you will see that
I said that I need the distance value immediately after an insert()
or find() call... Do you see what I mean? :)
Currently the tree is traversed twice: once in insert() and find(),
and the second time in distance(). The second traversion is
IMO unneccessary if insert() and find() would store this value
somewhere internally, and distance() then just returns that value
w/o traversing the tree.
Maybe for this one can make a new function member like
"distance_last_added()" or something that. Another option would be
if insert() and find() would deliver it in a overloaded version...
As others have pointed out, "sequence numbers" aren't all that useful in
sets. It sounds like you would be better served by using a std::vector
and sorting it (using std::sort,) then you can do lookups with
std::lower_bound, upper_bound, equal_range, and binary_search and
getting the "sequence number" is simply a matter of a little pointer
math.

I need to get the seqnbr for each item that gets added to
the sorted collection, so then especially the sorting would be an overkill.

FYI: I'm doing this for the table-lookup routine (ie. a dictionary) of
a special data compressor for data records with common structure;
ie. it's not general purpose.

I just want to make this last point clear: I just have pointed out
IMO a shortcoming, performance issues or missing features
of the language that I in the field have encountered.
It's of course up to the language and library designers to
understand the problem and to fix or implement the functionality.
I'm just informing the people who are more involved in such issues.
If I were the designer of the library I would do it the way I described above.
 
K

Kai-Uwe Bux

bilgekhan said:
See below.


See below.

Notice the "would".
Yes, of course! Therefore the traversion does not need
to be done again in distance() ! That's the point I'm trying
to make clear. If you reread my inital posting then you will see that
I said that I need the distance value immediately after an insert()
or find() call... Do you see what I mean? :)
Currently the tree is traversed twice: once in insert() and find(),

No. Insert and find do _not_ traverse the tree. They only run through a path
from the root to a leave in the tree. During this process, the sequence
number is not established as a by-product.
and the second time in distance().

It's the first time.

[snip]


Best

Kai-Uwe Bux
 
J

Jerry Coffin

[ ... ]
Yes, of course! Therefore the traversion does not need
to be done again in distance() ! That's the point I'm trying
to make clear. If you reread my inital posting then you will see that
I said that I need the distance value immediately after an insert()
or find() call... Do you see what I mean? :)

I see what you mean, but I think you're wrong. Consider (for example),
inserting an item that's going to be in the right subtree. insert()
compares the new item to the root node, finds that the new item is
larger than the root, and proceeds to the node to the right of the root.
In doing so, it has ignored the _entire_ left sub-tree.

When you use distance to find the position of the new node in the tree,
it cannot ignore the left subtree. Instead, it traverses the entire left
subtree counting each node as it goes. It then proceeds to count through
the right subtree until it reaches the point at which the new node was
inserted.
Currently the tree is traversed twice: once in insert() and find(),
and the second time in distance(). The second traversion is
IMO unneccessary if insert() and find() would store this value
somewhere internally, and distance() then just returns that value
w/o traversing the tree.

See above. The traversals are entirely _different_ from each other. The
traversal involved in an insertion or deletion is NOT sufficient to
determine the overall position of the new node in the tree.

You _could_ determine an _extremely_ rough estimate of the overall
position in the tree with only a little extra work. A set, map, etc., is
normally implemented as a balanced binary tree (e.g. red-black or AVL
tree). To maintain balancing, such a tree stores some information about
the _relative_ depths of the left and right subtrees in each node. They
limit the difference in height between the left and right subtrees
(though r-b trees and AVL trees impose slightly different restrictions).

Based upon that height difference, and the depth of the tree you _did_
traverse, you could very quickly compute upper and lower bounds for the
overall position of the new node in the tree -- but those bounds may
easily differ from each other by a factor of 2 or so. Getting the
correct number is going to be linear.

[ ... ]
I just want to make this last point clear: I just have pointed out
IMO a shortcoming, performance issues or missing features
of the language that I in the field have encountered.
It's of course up to the language and library designers to
understand the problem and to fix or implement the functionality.
I'm just informing the people who are more involved in such issues.
If I were the designer of the library I would do it the way I described above.

Maybe so -- but if you did, I doubt much of anybody would use your
library. In fact, except for this specific situation, even YOU probably
wouldn't use it. Making things work as you seem to want would involve a
_large_ slowdown in common operations in the hope of achieving a _small_
speedup for a relatively rare sequence of operations.
 
F

Frank Birbacher

Hi!
I need to get the seqnbr for each item that gets added to
the sorted collection, so then especially the sorting would be an overkill.

Your answer doesn't make sense to me. I guess you misunderstood the
advise. The advise propses what I had done if I were you:

DO NOT USE A STD::SET.

It doesn't fit your needs. The fact that you need the "index" of some
element shows it.

FYI: I'm doing this for the table-lookup routine (ie. a dictionary) of
a special data compressor for data records with common structure;
ie. it's not general purpose. [snip]
If I were the designer of the library I would do it the way I described above.

Maybe all people writing a non general purpose data compressor with a
table-lookup routine will use your library then.

I probably wouldn't. Don't be offensed. It is just that std::set will
never have a "index_last_added" method. It's by design. If you need a
function like that you need to implement a new set class of your own.
Having done that come back and tell us how easy it was to determine the
exact size of a tree branch which is never visited or how to keep the
counts up-to-date which tell you the exact size, how you then managed to
keep the insert in O(log n). I think this is too complicated compared to
a sorted vector solution.

Frank
 
D

Daniel Pitts

Frank said:
Hi!
I need to get the seqnbr for each item that gets added to
the sorted collection, so then especially the sorting would be an
overkill.

Your answer doesn't make sense to me. I guess you misunderstood the
advise. The advise propses what I had done if I were you:

DO NOT USE A STD::SET.

It doesn't fit your needs. The fact that you need the "index" of some
element shows it.

FYI: I'm doing this for the table-lookup routine (ie. a dictionary) of
a special data compressor for data records with common structure;
ie. it's not general purpose. [snip]
If I were the designer of the library I would do it the way I
described above.

Maybe all people writing a non general purpose data compressor with a
table-lookup routine will use your library then.

I probably wouldn't. Don't be offensed. It is just that std::set will
never have a "index_last_added" method. It's by design. If you need a
function like that you need to implement a new set class of your own.
Having done that come back and tell us how easy it was to determine the
exact size of a tree branch which is never visited or how to keep the
counts up-to-date which tell you the exact size, how you then managed to
keep the insert in O(log n). I think this is too complicated compared to
a sorted vector solution.

Frank
I agree with your point, but feel like pointing out that it is possible
(almost trivial) to have every node on a tree keep track of how many
children it has. An insert operation would increment the value on the
return-trip if the insert was successful.

My question to the OP is, do you need to update the insert-position of
the elements *after* the last inserted element? say I have an existing set:

<1, 5, 6> so, 5 and 6 are at position 1 and 2 respectively...
I insert 3:
<1, 3, 5, 6> and determine that 3 is at position 1, do I need to update
the associated data with 5 and 6 to say they are now at position 2 and 3
respectively?

Also to the OP:
If you design your library for your specific use-cases, but in such a
way that it can easily be extended, you'll find yourself much happier.
 
D

Daniel Pitts

Daniel said:
I'm not so sure if an extra unsigned long per element can be considered
"trivial".
I said "almost trivial", which refers to difficulty, not space complexity.

Also, it could easily be a compile-time decision if you needed it or not.
 
B

bilgekhan

Daniel Pitts said:
I agree with your point, but feel like pointing out that it is possible
(almost trivial) to have every node on a tree keep track of how many
children it has. An insert operation would increment the value on the
return-trip if the insert was successful.

My question to the OP is, do you need to update the insert-position of
the elements *after* the last inserted element? say I have an existing set:

<1, 5, 6> so, 5 and 6 are at position 1 and 2 respectively...
I insert 3:
<1, 3, 5, 6> and determine that 3 is at position 1, do I need to update
the associated data with 5 and 6 to say they are now at position 2 and 3
respectively?

From point of application code it is sufficient to know the position of
just only the *last* successful insert() operation and of the
*last* successful find() operation.
Ie. in the application code there is no need to keep track of the positions
of the other items; just that of the latest successful insert() or find().
 
B

bilgekhan

Daniel Pitts said:
I agree with your point, but feel like pointing out that it is possible
(almost trivial) to have every node on a tree keep track of how many
children it has. An insert operation would increment the value on the
return-trip if the insert was successful.

My question to the OP is, do you need to update the insert-position of
the elements *after* the last inserted element? say I have an existing set:

<1, 5, 6> so, 5 and 6 are at position 1 and 2 respectively...
I insert 3:
<1, 3, 5, 6> and determine that 3 is at position 1, do I need to update
the associated data with 5 and 6 to say they are now at position 2 and 3
respectively?

From point of application code it is sufficient to know the position of
just only the *last* successful insert() operation and of the
*last* successful find() operation.
Ie. in the application code there is no need to keep track of the positions
of the other items; just that of the latest successful insert() or find().

In the example above, after inserting 3 into this set
it just shall be possible to get the position
of this latest new item (ie. here the position of 3, ie. 1).
There is no need for the library to keep track or to update
anything defined in the application code that references items of the set.
Ie. there are no such references (at least not in my code).
 
K

Kai-Uwe Bux

Daniel said:
I'm not so sure if an extra unsigned long per element can be considered
"trivial".

It's not really "extra": Any balanced tree has to store some additional
information. This information can be _traded_ for the unsigned long since
size information can be used to rebalance. Then, alignment and memory
quantization from the allocator will very likely imply that

2 pointers + 1 unsigned long + user data

is simply indistinguishable from

2 pointers + balancing info + user data



Best

Kai-Uwe Bux
 
B

bilgekhan

bilgekhan said:
From point of application code it is sufficient to know the position of
just only the *last* successful insert() operation and of the
*last* successful find() operation.
Ie. in the application code there is no need to keep track of the positions
of the other items; just that of the latest successful insert() or find().

In the example above, after inserting 3 into this set
it just shall be possible to get the position
of this latest new item (ie. here the position of 3, ie. 1).
There is no need for the library to keep track or to update
anything defined in the application code that references items of the set.
Ie. there are no such references (at least not in my code).

Besides the above functionality a furthergoing super solution
would be if the traversing of the container would be possible
for all these 3 cases:
1) traverse in default (sorted) order
2) traverse in raw (physical) order
3) traverse in insert order

The 1st one is of course already possible.

The 2nd case could also efficiently be realized in application code
if only the above mentioned functionality would be possible. Ie. one then
would need to store and update the returned pos of the inserted item
in a std::vector<int> at position pos. But before doing this one would need
to move (reversely) all items from pos to the end by 1 position,
and while doing this incrementing by 1 those items that have a
value >= the pos to insert...

The 3rd could be done for example like this:
cursor c; // stores last walked pos etc.
while ((it = s.next(c)) != s.end())
do_something with *it

Ie. one would need to implement a "next(cursor& c)" method in the library...

These are just some ideas of mine...
At the moment I unfortunately don't have time to implement these
IMO useful functionalities myself, besides this I don't know the STL
that well, much less its internals. It seems one would need to
extend the underlying code of the RB or AVL tree and add
a uint data member to each block and update it with each insert, delete etc.
Long time ago I had worked on Btree implementations and I can imagine
that the most complicated case would be when a block would need
to be splitted or when two blocks need to be joined...

People interessted in implementing the above features
can also take a look at this interessting project:

"STL containers map, set, and list with fast array access" by Ray Virzi
"Source code for STL compliant container classes that add
fast indexing capability to existing container types":
http://www.codeproject.com/KB/stl/indexlist.aspx
 
B

bilgekhan

<This a fix to my prev. posting>

bilgekhan said:
From point of application code it is sufficient to know the position of
just only the *last* successful insert() operation and of the
*last* successful find() operation.
Ie. in the application code there is no need to keep track of the positions
of the other items; just that of the latest successful insert() or find().

In the example above, after inserting 3 into this set
it just shall be possible to get the position
of this latest new item (ie. here the position of 3, ie. 1).
There is no need for the library to keep track or to update
anything defined in the application code that references items of the set.
Ie. there are no such references (at least not in my code).

Besides the above functionality a furthergoing super solution
would be if the traversing of the container would be possible
for all these 3 cases:
1) traverse in default (sorted) order
2) traverse in insert order
3) traverse in raw (physical) order

The 1st one is of course already possible.

The 2nd case could also efficiently be realized in application code
if only the above mentioned functionality would be possible. Ie. one then
would need to store and update the returned pos of the inserted item
in a std::vector<uint> at position pos. But before doing this one would need
to move (reversely) all items from pos to the end by 1 position,
and while doing this incrementing by 1 those items that have a
value >= the pos to insert...
But one would need a direct access (array access) to the items of the set via a uint handle...

The 3rd could be done for example like this:
cursor c; // stores last walked pos etc.
while ((it = s.next(c)) != s.end())
do_something with *it

Ie. one would need to implement a "next(cursor& c)" method in the library...

These are just some ideas of mine...
At the moment I unfortunately don't have time to implement these
IMO useful functionalities myself, besides this I don't know the STL
that well, much less its internals. It seems one would need to
extend the underlying code of the RB or AVL tree and add
a uint data member to each block and update it with each insert, delete etc.
Long time ago I had worked on Btree implementations and I can imagine
that the most complicated case would be when a block would need
to be splitted or when two blocks need to be joined...

People interessted in implementing the above features
can also take a look at this interessting project:

"STL containers map, set, and list with fast array access" by Ray Virzi
"Source code for STL compliant container classes that add
fast indexing capability to existing container types":
http://www.codeproject.com/KB/stl/indexlist.aspx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top