Inserting into std::set

M

Marcus Kwok

I am not sure if this is something that is covered by the Standard, or
if it's an implementation detail of my Standard Library.

I am reading in a large amount of data into a std::set. There is an
overload for std::set::insert() that takes in an iterator as a hint as
to where the new value should be inserted, and my implementation
(Dinkumware) says that if the hint is good (meaning the iterator points
immediately before or after where the inserted value should go) then the
insertion can happen in amortized constant time rather than logarithmic
time.

I would like to take advantage of this fact. My input data should
already be in sorted order, therefore I think I can use my_set.end() as
the hint to insert(). Is this a valid assumption, or does this depend
on how std::set is actually implemented?
 
D

David Harmon

On Thu, 26 Oct 2006 18:48:57 +0000 (UTC) in comp.lang.c++,
(e-mail address removed) (Marcus Kwok) wrote,
I would like to take advantage of this fact. My input data should
already be in sorted order, therefore I think I can use my_set.end() as
the hint to insert(). Is this a valid assumption, or does this depend
on how std::set is actually implemented?

Sounds valid to me. However, I think that I would prefer to use the
iterator returned by the previous insert() call.
 
V

Victor Bazarov

Marcus said:
I am not sure if this is something that is covered by the Standard, or
if it's an implementation detail of my Standard Library.

I am reading in a large amount of data into a std::set. There is an
overload for std::set::insert() that takes in an iterator as a hint as
to where the new value should be inserted, and my implementation
(Dinkumware) says that if the hint is good (meaning the iterator
points immediately before or after where the inserted value should
go) then the insertion can happen in amortized constant time rather
than logarithmic time.

I would like to take advantage of this fact. My input data should
already be in sorted order, therefore I think I can use my_set.end()
as the hint to insert(). Is this a valid assumption, or does this
depend on how std::set is actually implemented?

Why don't you try it both ways and compare the time it takes your program
to form your 'set'?

V
 
M

Marcus Kwok

Marcus Kwok wrote:
[using a hint when inserting into std::set]

Victor Bazarov said:
Why don't you try it both ways and compare the time it takes your program
to form your 'set'?

OK, so I tested three different versions of my code: one where no hint
is supplied, one where I use my_set.end() as the hint, and one where I
use the iterator returned from the previous iteration as the hint. All
of them were pretty close. Interestingly, the one with no hint was
actually the fastest, but only very slightly faster than using end() as
the hint (I will consider them the same due to the resolution of my
timing results). Also interesting was that the one using the previous
insertion's iterator was the slowest, but only by about 3%.

Since the timing differences are fairly negligible (and also not
definitive since I had other processes running at the same time), I
think I will just stick to the simple version of insert(), since it is
the clearest.
 
V

Victor Bazarov

Marcus said:
Marcus Kwok wrote:
[using a hint when inserting into std::set]

Victor Bazarov said:
Why don't you try it both ways and compare the time it takes your
program to form your 'set'?

OK, so I tested three different versions of my code: one where no hint
is supplied, one where I use my_set.end() as the hint, and one where I
use the iterator returned from the previous iteration as the hint.
All of them were pretty close. Interestingly, the one with no hint
was actually the fastest, but only very slightly faster than using
end() as the hint (I will consider them the same due to the
resolution of my timing results). Also interesting was that the one
using the previous insertion's iterator was the slowest, but only by
about 3%.

Since the timing differences are fairly negligible (and also not
definitive since I had other processes running at the same time), I
think I will just stick to the simple version of insert(), since it is
the clearest.

While doing that, have you tried looking at the implementation of
the 'insert' without the argument? Could it be that it actually falls
back onto 'insert' with a hint and gives 'end()' as the hint? Not that
it should make much of a difference for you, of course. Just curious,
I guess.

V
 
M

Marcus Kwok

Victor Bazarov said:
While doing that, have you tried looking at the implementation of
the 'insert' without the argument? Could it be that it actually falls
back onto 'insert' with a hint and gives 'end()' as the hint? Not that
it should make much of a difference for you, of course. Just curious,
I guess.

Well, it looks like on my implementation, std::set is implemented in
terms of a Red-Black tree. As far as I can tell, the overload without a
hint will just traverse the tree to find the right spot. I can't really
see what exactly the version with a hint is doing... all those leading
underscores make my eyes hurt :)
 
M

Mark P

Marcus said:
Marcus Kwok wrote:
[using a hint when inserting into std::set]

Victor Bazarov said:
Why don't you try it both ways and compare the time it takes your program
to form your 'set'?

OK, so I tested three different versions of my code: one where no hint
is supplied, one where I use my_set.end() as the hint, and one where I
use the iterator returned from the previous iteration as the hint. All
of them were pretty close. Interestingly, the one with no hint was
actually the fastest, but only very slightly faster than using end() as
the hint (I will consider them the same due to the resolution of my
timing results). Also interesting was that the one using the previous
insertion's iterator was the slowest, but only by about 3%.

Since the timing differences are fairly negligible (and also not
definitive since I had other processes running at the same time), I
think I will just stick to the simple version of insert(), since it is
the clearest.

Keep in mind that inserting elements into a set in sorted order may be
far from optimal and that the run time may be dominated by the
rebalancing required to maintain the red-black tree.
 
M

Marcus Kwok

Mark P said:
Keep in mind that inserting elements into a set in sorted order may be
far from optimal and that the run time may be dominated by the
rebalancing required to maintain the red-black tree.

Hmm, OK, in that case let me add something to my situation. I have a
large list of data that I am adding to my container in sorted order.
There is a possibility of duplicates, but we do not want to add
duplicates. Initially, I was pushing the data to the back of a vector,
but checking if the element was already in the vector before pushing it
back. This repeated searching was very inefficient and was what caused
me to redesign it using a set, since I can just insert without needing
to check before. Using a set instead of the previous method with vector
caused a 4-5x speed increase.

Do you have another suggestion for something I can try?
 
P

Pete Becker

Marcus said:
Hmm, OK, in that case let me add something to my situation. I have a
large list of data that I am adding to my container in sorted order.
There is a possibility of duplicates, but we do not want to add
duplicates. Initially, I was pushing the data to the back of a vector,
but checking if the element was already in the vector before pushing it
back. This repeated searching was very inefficient and was what caused
me to redesign it using a set, since I can just insert without needing
to check before. Using a set instead of the previous method with vector
caused a 4-5x speed increase.

Do you have another suggestion for something I can try?

std::unique or std::unique_copy.

--

-- Pete

Author of "The Standard C++ Library Extensions: a Tutorial and
Reference." For more information about this book, see
www.petebecker.com/tr1book.
 
D

David Harmon

Keep in mind that inserting elements into a set in sorted order may be
far from optimal and that the run time may be dominated by the
rebalancing required to maintain the red-black tree.

Hmm, OK, in that case let me add something to my situation. I have a
large list of data that I am adding to my container in sorted order.
There is a possibility of duplicates, but we do not want to add
duplicates. Initially, I was pushing the data to the back of a vector,
but checking if the element was already in the vector before pushing it
back. This repeated searching was very inefficient and was what caused
me to redesign it using a set, since I can just insert without needing
to check before. Using a set instead of the previous method with vector
caused a 4-5x speed increase.

Do you have another suggestion for something I can try?[/QUOTE]

If your data is sorted, then you only need to check if each item is
the same as the previous, right? And at the same time, a check if
the item is "less than" the previous will verify that it really is
sorted.
 
M

Marcus Kwok

David Harmon said:
If your data is sorted, then you only need to check if each item is
the same as the previous, right? And at the same time, a check if
the item is "less than" the previous will verify that it really is
sorted.

That's a good idea. Unfortunately, I should have said that our data
"should be" sorted before reading in, but there is the possibility that
something might come in out of order... though this possibility is
rather low, I still need to handle it properly in case it isn't.

Thanks for your input.
 
M

Mark P

Marcus said:
That's a good idea. Unfortunately, I should have said that our data
"should be" sorted before reading in, but there is the possibility that
something might come in out of order... though this possibility is
rather low, I still need to handle it properly in case it isn't.

Thanks for your input.

So apply std::sort to the initial data, then apply std::unique or
std::unique_copy as Pete Becker suggested.

As a rule of thumb, if the data only needs to be sorted once, then a
dynamically sorted container such as std::set is probably overkill
(read: inefficient). std::set is nice if you need to sort "online", but
red-black tree algorithms are much more complex than simple sorting.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,261
Messages
2,571,041
Members
48,769
Latest member
Clifft

Latest Threads

Top