vector::insert performance tip.

kostas · Apr 28, 2007

Hi,
Trying to dump a set<int> (s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.
I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

Regards
Kostas

Victor Bazarov · Apr 28, 2007

kostas said:
Trying to dump a set<int> (s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

I don't think the advice has anything to do with performance. The
usual (and always correct) advice WRT performance is "measure first
then try to optimize".

V

kostas · Apr 28, 2007

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

The implementation of STL i use it needs first to calculate the
distance
s.end() - s.begin() in order to reserve the appropriate space. Is this
not
the standard (or the usual) implementation of vector::insert ?

I don't think the advice has anything to do with performance.

with what then?

Roland Pibinger · Apr 29, 2007

The implementation of STL i use it needs first to calculate the
distance s.end() - s.begin() in order to reserve the appropriate
space. Is this not the standard (or the usual) implementation
of vector::insert ?

It's probably the 'normal' implementation which is faster for
random-access iterators but slower in some cases of bidirectional
iterators. I guess your number of elements is very small. Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.

kostas · Apr 29, 2007

I guess your number of elements is very small.

Wrong guess. Actually in order to measure the difference i used one
million ints and more.

Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.

Both the vector expansion and the set traversal are amortized linear
operations in the number of items. It just happens that in this case
of built-in types the first coefficient is smaller than the other.

James Kanze · Apr 29, 2007

I don't see why it would need to traverse the set twice. Any hint
to why it was happening?

It's probably an "optimization"

. If the number of elements
in the set is large, and the elements are expensive to copy,
it's faster to travers the set a first time to calculate how
many elements are needed, ensure sufficient capacity, and then
travers the set a second time to copy them.

I don't think the advice has anything to do with performance. The
usual (and always correct) advice WRT performance is "measure first
then try to optimize".

And this turns out to be a good example of why

. Off hand, I
would have imagined that traversing the set twice would be
faster, because it avoids unnecessary allocations. Apparently,
I'd have guessed wrong. And apparently, the author of the STL
he's using guessed wrong as well.

(I suspect which form is faster will depend on any number of
external factors, but if the number of elements is large...
std::set has very poor locality, so you're likely to get some
paging, or at least cache misses, in each pass.)

James Kanze · Apr 29, 2007

The implementation of STL i use it needs first to calculate
the distance s.end() - s.begin() in order to reserve the
appropriate space. Is this not the standard (or the usual)
implementation of vector::insert ?

It's the required implementation: "Complexity: The constructor
template <class InputIterator> vector(InputIterator first,
Input- Iterator last) makes only N calls to the copy constructor
of T (where N is the distance between first and last) and no
reallocations if iterators first and last are of forward,
bidirectional, or random access categories." The apparent
assumption in the standard is that copies and allocations are
expensive, and iteration is cheap. Copying basic types,
however, is very, very cheap; modern allocators aren't that
expensive, and iterating over node based containers can be
extremely expensive due to their poor locality.

James Kanze · Apr 29, 2007

It's probably the 'normal' implementation which is faster for
random-access iterators but slower in some cases of bidirectional
iterators. I guess your number of elements is very small. Otherwise
the vector relocations in the copy(...,back_inserter(v)) variant would
be more time-consuming that the double traversal once.

I just did the test with 10 million members in the set. I get
similar results. I suspect that the lack of locality in node
based containers is the reason.

James Kanze · Apr 29, 2007

[...]

Both the vector expansion and the set traversal are amortized linear
operations in the number of items.

Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

Stuart Golodetz · Apr 30, 2007

[...]

Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

[...]

I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking? If we've got a constant (i.e. independent of the
number of items) upper bound for the time taken for an instruction (however
large), then an operation which requires a number of instructions that is
linear in terms of the problem size will run in linear time. For example, if
I've got an operation on n items that requires no more than (say) 2n
instructions, with each instruction taking at most k milliseconds, then the
maximum time taken on a problem of size n is 2nk milliseconds, which (for k
not a function of n) is linear in n. The actual value of k isn't relevant to
the complexity analysis, even if it's relevant to the person implementing
the code.

Regards,
Stu

kostas · Apr 30, 2007

I just did the test with 10 million members in the set. I get
similar results. I suspect that the lack of locality in node
based containers is the reason.

It's not only memory locality issues involved. It's also algorithmic
complexity. Set traversal is amortized linear but not so "linear" as
list traversal for example.

James Kanze · Apr 30, 2007

"James Kanze" <[email protected]> wrote in message

[...]

Both the vector expansion and the set traversal are amortized linear
operations in the number of items.

Click to expand...

Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

Click to expand...

[...]

Click to expand...

I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking?

You're missing the point that what the original poster measured
was execution time. My point is, precisely, that complexity has
little relationship to execution time. It's a useful indicator
as to whether something will scale, but even then, for example,
an O(n) algorithm will suddenly become much, much slower if page
faults occur, and for very large data sets, locality can become
as important a consideration as complexity.

If we've got a constant (i.e. independent of the number of
items) upper bound for the time taken for an instruction
(however large), then an operation which requires a number of
instructions that is linear in terms of the problem size will
run in linear time.

No. That's only true if the time taken for an instruction is
independant of the size of the data set. Paging and caching
introduce a non-linear component into the execution time of
instructions, which can depend on the size of the data set.

For example, if
I've got an operation on n items that requires no more than (say) 2n
instructions, with each instruction taking at most k milliseconds, then the
maximum time taken on a problem of size n is 2nk milliseconds, which (for k
not a function of n) is linear in n. The actual value of k isn't relevant to
the complexity analysis, even if it's relevant to the person implementing
the code.

Once k has any dependence on n, your complexity analysis no
longer makes reasonable predictions concerning the size.
Saying, for example, that ++i takes a maximum of 10
milliseconds, and calculating bounds from that, doesn't mean
much if most of the time, it only takes 1 or 2 microseconds, and
if it will only take more than 5 milliseconds if the number of
elements depasses, say, 10 million. You still do get assymtotic
performance with enough elements, but the number of elements
needed to get there is so high that it won't occur in actual
practice. For performance reasons, if nothing else; once you
start getting page faults, your performance will end up being
unacceptable.

James Kanze · Apr 30, 2007

It's not only memory locality issues involved. It's also algorithmic
complexity. Set traversal is amortized linear but not so "linear" as
list traversal for example.

Both can play a role. Typically allocators will allocate
successively allocated objects close to one another, which means
that locality when iterating over a set will be better if the
set was created by insertions in order. A quick test of 50
million elements on a Linux based PC shows that iterating over a
set where the elements were inserted in a random order takes
roughly 8 or 9 times longer than iterating over one where they
were inserted in the final order (and that iterating over an
std::list is considerably faster than either---about 3 or 4
times faster than the sorted set). This is probably due
partially to a simpler algorithm (i.e. the elementary operation
of iter++ is faster, even if everything is already in the
cache), and partially due to better locality: the node of a list
is smaller, so more of them fit into each cache line.

Stuart Golodetz · Apr 30, 2007

James said:
"James Kanze" <[email protected]> wrote in message

Click to expand...

[...]

Click to expand...

Both the vector expansion and the set traversal are amortized linear
operations in the number of items.
Which means very little on a modern machine. The time it takes
to execute a single operation can vary in significant
proportions depending on whether the required variables are in
registers, in cache, in main memory, or must be paged in. On
the machines I use, something like iter++, where iter is an
std::set<>::iterator, can vary between less than a microsecond
(everything in registers) to somewhere around 10 milliseconds
(if I get a page miss)---4 orders of magnitude.

Click to expand...

[...]

Click to expand...

I'm not sure if I'm missing the point here, but for the purposes of
complexity analysis, does it actually matter how long each individual
instruction is taking?

Click to expand...

You're missing the point that what the original poster measured
was execution time. My point is, precisely, that complexity has
little relationship to execution time. It's a useful indicator
as to whether something will scale, but even then, for example,
an O(n) algorithm will suddenly become much, much slower if page
faults occur, and for very large data sets, locality can become
as important a consideration as complexity.
Agreed.

If we've got a constant (i.e. independent of the number of
items) upper bound for the time taken for an instruction
(however large), then an operation which requires a number of
instructions that is linear in terms of the problem size will
run in linear time.

Click to expand...

No. That's only true if the time taken for an instruction is
independant of the size of the data set.

I think that's what I said, actually, assuming we equate "number of
items" and "size of the data set".

Paging and caching
introduce a non-linear component into the execution time of
instructions, which can depend on the size of the data set.

Ah, this is the point which I missed. If the execution time for an
instruction is dependent on the problem size then it's a different
matter entirely. Thanks for clarifying

Once k has any dependence on n, your complexity analysis no
longer makes reasonable predictions concerning the size.
Saying, for example, that ++i takes a maximum of 10
milliseconds, and calculating bounds from that, doesn't mean
much if most of the time, it only takes 1 or 2 microseconds, and
if it will only take more than 5 milliseconds if the number of
elements depasses, say, 10 million. You still do get assymtotic
performance with enough elements, but the number of elements
needed to get there is so high that it won't occur in actual
practice. For performance reasons, if nothing else; once you
start getting page faults, your performance will end up being
unacceptable.

Once again, agreed. I think we seem to have a clear case of a discussion
heading along the lines of:

"If and only if X (about which I personally know little), then Y."
"Well, actually, not X."
"Ok, then we're agreed that not Y."

(In this particular instance, I think what I'm trying to say is that X
is "k has no dependence on n" and Y is "the complexity analysis given
above makes sense".)

No worries, anyway, was just curious what you were driving at

Regards,
Stu

[...]

James Kanze · May 1, 2007

[...]

Ah, this is the point which I missed. If the execution time for an
instruction is dependent on the problem size then it's a different
matter entirely. Thanks for clarifying

It's not necessarily directly dependant on the problem size, but
if the time for an instruction can vary over several magnitudes,
depending on locality of access, it starts meaning the locality
of access can be more significant than complexity with regards
to execution time.

Strictly speaking, complexity still wins out in the end; at
some point, you end up getting page faults even with the fastest
algorithm, and the individual operations become enormously
expensive everywhere. In practice, however, we aren't concerned
with such cases, since when (and if) we get there, the program
will take forever anyway; the solution isn't to find a faster
algorithm, but to buy more main memory

.

BTW: I wasn't so much disagreeing with what you said, but
putting it into the practical context which was being discussed.

kostas · May 1, 2007

Hi,
Trying to dump a set<int> (s) in a vector (v) my guess was that
v.insert(v.end(),s.begin(),s.end())
would be at least as efficient as
copy(s.begin(), s.end(), back_inserter(v))
It turn out that it is almost two times slower (even without using
reserve() before copy) and that's because it traverses the set twice.
I'm posting it as an exception to the usual (and usually correct)
advise
"Prefer member functions to algorithms".

So, would be a good idea the addition of a insert variant that instead
of the last iterator in the [firts,last) insertion range would take as
argument the number of steps from the first iterator ?
In this case of back insertion would not do any difference but if you
would like to insert the set at v.begin()?

Regards
Kostas

Insert to a vector quickly and get the insertation position	4	Sep 22, 2006
return result of "copy", "remove_copy"	4	Jul 6, 2007
Copy vector to set	4	Sep 21, 2004
Vector reserve in a for_each	9	Feb 24, 2007
destruction of already destructed pointer variable when copying anobject - abort error	21	Sep 11, 2010
First try at a std::vector from 1 to n	17	May 3, 2005
Simple C containers, std::vector analog	8	Mar 7, 2005
Sorted indexes in Python	0	Oct 17, 2008

vector::insert performance tip.

kostas

Victor Bazarov

kostas

Roland Pibinger

kostas

James Kanze

James Kanze

James Kanze

James Kanze

Stuart Golodetz

kostas

James Kanze

James Kanze

Stuart Golodetz

James Kanze

kostas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads