vector::insert performance tip.

Discussion in 'C++' started by kostas, Apr 28, 2007.

  1. kostas

    kostas Guest

    Hi,
    Trying to dump a set<int> (s) in a vector (v) my guess was that
    v.insert(v.end(),s.begin(),s.end())
    would be at least as efficient as
    copy(s.begin(), s.end(), back_inserter(v))
    It turn out that it is almost two times slower (even without using
    reserve() before copy) and that's because it traverses the set twice.
    I'm posting it as an exception to the usual (and usually correct)
    advise
    "Prefer member functions to algorithms".

    Regards
    Kostas
    kostas, Apr 28, 2007
    #1
    1. Advertising

  2. kostas wrote:
    > Trying to dump a set<int> (s) in a vector (v) my guess was that
    > v.insert(v.end(),s.begin(),s.end())
    > would be at least as efficient as
    > copy(s.begin(), s.end(), back_inserter(v))
    > It turn out that it is almost two times slower (even without using
    > reserve() before copy) and that's because it traverses the set twice.


    I don't see why it would need to traverse the set twice. Any hint
    to why it was happening?

    > I'm posting it as an exception to the usual (and usually correct)
    > advise
    > "Prefer member functions to algorithms".


    I don't think the advice has anything to do with performance. The
    usual (and always correct) advice WRT performance is "measure first
    then try to optimize".

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, Apr 28, 2007
    #2
    1. Advertising

  3. kostas

    kostas Guest

    On Apr 28, 10:49 pm, "Victor Bazarov" <> wrote:
    > I don't see why it would need to traverse the set twice. Any hint
    > to why it was happening?


    The implementation of STL i use it needs first to calculate the
    distance
    s.end() - s.begin() in order to reserve the appropriate space. Is this
    not
    the standard (or the usual) implementation of vector::insert ?

    > I don't think the advice has anything to do with performance.


    with what then?
    kostas, Apr 28, 2007
    #3
  4. On 28 Apr 2007 14:32:55 -0700, kostas <> wrote:
    >On Apr 28, 10:49 pm, "Victor Bazarov" <> wrote:
    >> I don't see why it would need to traverse the set twice. Any hint
    >> to why it was happening?

    >
    >The implementation of STL i use it needs first to calculate the
    >distance s.end() - s.begin() in order to reserve the appropriate
    >space. Is this not the standard (or the usual) implementation
    >of vector::insert ?


    It's probably the 'normal' implementation which is faster for
    random-access iterators but slower in some cases of bidirectional
    iterators. I guess your number of elements is very small. Otherwise
    the vector relocations in the copy(...,back_inserter(v)) variant would
    be more time-consuming that the double traversal once.


    --
    Roland Pibinger
    "The best software is simple, elegant, and full of drama" - Grady Booch
    Roland Pibinger, Apr 29, 2007
    #4
  5. kostas

    kostas Guest

    On Apr 29, 12:09 pm, (Roland Pibinger) wrote:
    > I guess your number of elements is very small.

    Wrong guess. Actually in order to measure the difference i used one
    million ints and more.

    > Otherwise
    > the vector relocations in the copy(...,back_inserter(v)) variant would
    > be more time-consuming that the double traversal once.


    Both the vector expansion and the set traversal are amortized linear
    operations in the number of items. It just happens that in this case
    of built-in types the first coefficient is smaller than the other.
    kostas, Apr 29, 2007
    #5
  6. kostas

    James Kanze Guest

    On Apr 28, 9:49 pm, "Victor Bazarov" <> wrote:
    > kostas wrote:
    > > Trying to dump a set<int> (s) in a vector (v) my guess was that
    > > v.insert(v.end(),s.begin(),s.end())
    > > would be at least as efficient as
    > > copy(s.begin(), s.end(), back_inserter(v))
    > > It turn out that it is almost two times slower (even without using
    > > reserve() before copy) and that's because it traverses the set twice.


    > I don't see why it would need to traverse the set twice. Any hint
    > to why it was happening?


    It's probably an "optimization":). If the number of elements
    in the set is large, and the elements are expensive to copy,
    it's faster to travers the set a first time to calculate how
    many elements are needed, ensure sufficient capacity, and then
    travers the set a second time to copy them.

    > > I'm posting it as an exception to the usual (and usually correct)
    > > advise
    > > "Prefer member functions to algorithms".


    > I don't think the advice has anything to do with performance. The
    > usual (and always correct) advice WRT performance is "measure first
    > then try to optimize".


    And this turns out to be a good example of why:). Off hand, I
    would have imagined that traversing the set twice would be
    faster, because it avoids unnecessary allocations. Apparently,
    I'd have guessed wrong. And apparently, the author of the STL
    he's using guessed wrong as well.

    (I suspect which form is faster will depend on any number of
    external factors, but if the number of elements is large...
    std::set has very poor locality, so you're likely to get some
    paging, or at least cache misses, in each pass.)

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 29, 2007
    #6
  7. kostas

    James Kanze Guest

    On Apr 28, 11:32 pm, kostas <> wrote:
    > On Apr 28, 10:49 pm, "Victor Bazarov" <> wrote:


    > > I don't see why it would need to traverse the set twice. Any hint
    > > to why it was happening?


    > The implementation of STL i use it needs first to calculate
    > the distance s.end() - s.begin() in order to reserve the
    > appropriate space. Is this not the standard (or the usual)
    > implementation of vector::insert ?


    It's the required implementation: "Complexity: The constructor
    template <class InputIterator> vector(InputIterator first,
    Input- Iterator last) makes only N calls to the copy constructor
    of T (where N is the distance between first and last) and no
    reallocations if iterators first and last are of forward,
    bidirectional, or random access categories." The apparent
    assumption in the standard is that copies and allocations are
    expensive, and iteration is cheap. Copying basic types,
    however, is very, very cheap; modern allocators aren't that
    expensive, and iterating over node based containers can be
    extremely expensive due to their poor locality.

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 29, 2007
    #7
  8. kostas

    James Kanze Guest

    On Apr 29, 11:09 am, (Roland Pibinger) wrote:
    > On 28 Apr 2007 14:32:55 -0700, kostas <> wrote:


    > >On Apr 28, 10:49 pm, "Victor Bazarov" <> wrote:
    > >> I don't see why it would need to traverse the set twice. Any hint
    > >> to why it was happening?


    > >The implementation of STL i use it needs first to calculate the
    > >distance s.end() - s.begin() in order to reserve the appropriate
    > >space. Is this not the standard (or the usual) implementation
    > >of vector::insert ?


    > It's probably the 'normal' implementation which is faster for
    > random-access iterators but slower in some cases of bidirectional
    > iterators. I guess your number of elements is very small. Otherwise
    > the vector relocations in the copy(...,back_inserter(v)) variant would
    > be more time-consuming that the double traversal once.


    I just did the test with 10 million members in the set. I get
    similar results. I suspect that the lack of locality in node
    based containers is the reason.


    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 29, 2007
    #8
  9. kostas

    James Kanze Guest

    On Apr 29, 1:28 pm, kostas <> wrote:

    [...]
    > Both the vector expansion and the set traversal are amortized linear
    > operations in the number of items.


    Which means very little on a modern machine. The time it takes
    to execute a single operation can vary in significant
    proportions depending on whether the required variables are in
    registers, in cache, in main memory, or must be paged in. On
    the machines I use, something like iter++, where iter is an
    std::set<>::iterator, can vary between less than a microsecond
    (everything in registers) to somewhere around 10 milliseconds
    (if I get a page miss)---4 orders of magnitude.


    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 29, 2007
    #9
  10. "James Kanze" <> wrote in message
    news:...
    On Apr 29, 1:28 pm, kostas <> wrote:

    [...]
    >> Both the vector expansion and the set traversal are amortized linear
    >> operations in the number of items.


    >Which means very little on a modern machine. The time it takes
    >to execute a single operation can vary in significant
    >proportions depending on whether the required variables are in
    >registers, in cache, in main memory, or must be paged in. On
    >the machines I use, something like iter++, where iter is an
    >std::set<>::iterator, can vary between less than a microsecond
    >(everything in registers) to somewhere around 10 milliseconds
    >(if I get a page miss)---4 orders of magnitude.


    [...]

    I'm not sure if I'm missing the point here, but for the purposes of
    complexity analysis, does it actually matter how long each individual
    instruction is taking? If we've got a constant (i.e. independent of the
    number of items) upper bound for the time taken for an instruction (however
    large), then an operation which requires a number of instructions that is
    linear in terms of the problem size will run in linear time. For example, if
    I've got an operation on n items that requires no more than (say) 2n
    instructions, with each instruction taking at most k milliseconds, then the
    maximum time taken on a problem of size n is 2nk milliseconds, which (for k
    not a function of n) is linear in n. The actual value of k isn't relevant to
    the complexity analysis, even if it's relevant to the person implementing
    the code.

    Regards,
    Stu
    Stuart Golodetz, Apr 30, 2007
    #10
  11. kostas

    kostas Guest

    On Apr 29, 9:22 pm, James Kanze <> wrote:
    > I just did the test with 10 million members in the set. I get
    > similar results. I suspect that the lack of locality in node
    > based containers is the reason.


    It's not only memory locality issues involved. It's also algorithmic
    complexity. Set traversal is amortized linear but not so "linear" as
    list traversal for example.
    kostas, Apr 30, 2007
    #11
  12. kostas

    James Kanze Guest

    On Apr 30, 2:30 am, "Stuart Golodetz"
    <> wrote:
    > "James Kanze" <> wrote in message


    > news:...
    > On Apr 29, 1:28 pm, kostas <> wrote:


    > [...]


    > >> Both the vector expansion and the set traversal are amortized linear
    > >> operations in the number of items.

    > >Which means very little on a modern machine. The time it takes
    > >to execute a single operation can vary in significant
    > >proportions depending on whether the required variables are in
    > >registers, in cache, in main memory, or must be paged in. On
    > >the machines I use, something like iter++, where iter is an
    > >std::set<>::iterator, can vary between less than a microsecond
    > >(everything in registers) to somewhere around 10 milliseconds
    > >(if I get a page miss)---4 orders of magnitude.


    > [...]


    > I'm not sure if I'm missing the point here, but for the purposes of
    > complexity analysis, does it actually matter how long each individual
    > instruction is taking?


    You're missing the point that what the original poster measured
    was execution time. My point is, precisely, that complexity has
    little relationship to execution time. It's a useful indicator
    as to whether something will scale, but even then, for example,
    an O(n) algorithm will suddenly become much, much slower if page
    faults occur, and for very large data sets, locality can become
    as important a consideration as complexity.

    > If we've got a constant (i.e. independent of the number of
    > items) upper bound for the time taken for an instruction
    > (however large), then an operation which requires a number of
    > instructions that is linear in terms of the problem size will
    > run in linear time.


    No. That's only true if the time taken for an instruction is
    independant of the size of the data set. Paging and caching
    introduce a non-linear component into the execution time of
    instructions, which can depend on the size of the data set.

    > For example, if
    > I've got an operation on n items that requires no more than (say) 2n
    > instructions, with each instruction taking at most k milliseconds, then the
    > maximum time taken on a problem of size n is 2nk milliseconds, which (for k
    > not a function of n) is linear in n. The actual value of k isn't relevant to
    > the complexity analysis, even if it's relevant to the person implementing
    > the code.


    Once k has any dependence on n, your complexity analysis no
    longer makes reasonable predictions concerning the size.
    Saying, for example, that ++i takes a maximum of 10
    milliseconds, and calculating bounds from that, doesn't mean
    much if most of the time, it only takes 1 or 2 microseconds, and
    if it will only take more than 5 milliseconds if the number of
    elements depasses, say, 10 million. You still do get assymtotic
    performance with enough elements, but the number of elements
    needed to get there is so high that it won't occur in actual
    practice. For performance reasons, if nothing else; once you
    start getting page faults, your performance will end up being
    unacceptable.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 30, 2007
    #12
  13. kostas

    James Kanze Guest

    On Apr 30, 8:57 am, kostas <> wrote:
    > On Apr 29, 9:22 pm, James Kanze <> wrote:


    > > I just did the test with 10 million members in the set. I get
    > > similar results. I suspect that the lack of locality in node
    > > based containers is the reason.


    > It's not only memory locality issues involved. It's also algorithmic
    > complexity. Set traversal is amortized linear but not so "linear" as
    > list traversal for example.


    Both can play a role. Typically allocators will allocate
    successively allocated objects close to one another, which means
    that locality when iterating over a set will be better if the
    set was created by insertions in order. A quick test of 50
    million elements on a Linux based PC shows that iterating over a
    set where the elements were inserted in a random order takes
    roughly 8 or 9 times longer than iterating over one where they
    were inserted in the final order (and that iterating over an
    std::list is considerably faster than either---about 3 or 4
    times faster than the sorted set). This is probably due
    partially to a simpler algorithm (i.e. the elementary operation
    of iter++ is faster, even if everything is already in the
    cache), and partially due to better locality: the node of a list
    is smaller, so more of them fit into each cache line.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Apr 30, 2007
    #13
  14. James Kanze wrote:
    > On Apr 30, 2:30 am, "Stuart Golodetz"
    > <> wrote:
    >> "James Kanze" <> wrote in message

    >
    >> news:...
    >> On Apr 29, 1:28 pm, kostas <> wrote:

    >
    >> [...]

    >
    >>>> Both the vector expansion and the set traversal are amortized linear
    >>>> operations in the number of items.
    >>> Which means very little on a modern machine. The time it takes
    >>> to execute a single operation can vary in significant
    >>> proportions depending on whether the required variables are in
    >>> registers, in cache, in main memory, or must be paged in. On
    >>> the machines I use, something like iter++, where iter is an
    >>> std::set<>::iterator, can vary between less than a microsecond
    >>> (everything in registers) to somewhere around 10 milliseconds
    >>> (if I get a page miss)---4 orders of magnitude.

    >
    >> [...]

    >
    >> I'm not sure if I'm missing the point here, but for the purposes of
    >> complexity analysis, does it actually matter how long each individual
    >> instruction is taking?

    >
    > You're missing the point that what the original poster measured
    > was execution time. My point is, precisely, that complexity has
    > little relationship to execution time. It's a useful indicator
    > as to whether something will scale, but even then, for example,
    > an O(n) algorithm will suddenly become much, much slower if page
    > faults occur, and for very large data sets, locality can become
    > as important a consideration as complexity.


    Agreed.

    >> If we've got a constant (i.e. independent of the number of
    >> items) upper bound for the time taken for an instruction
    >> (however large), then an operation which requires a number of
    >> instructions that is linear in terms of the problem size will
    >> run in linear time.

    >
    > No. That's only true if the time taken for an instruction is
    > independant of the size of the data set.


    I think that's what I said, actually, assuming we equate "number of
    items" and "size of the data set".

    > Paging and caching
    > introduce a non-linear component into the execution time of
    > instructions, which can depend on the size of the data set.


    Ah, this is the point which I missed. If the execution time for an
    instruction is dependent on the problem size then it's a different
    matter entirely. Thanks for clarifying :)

    >> For example, if
    >> I've got an operation on n items that requires no more than (say) 2n
    >> instructions, with each instruction taking at most k milliseconds, then the
    >> maximum time taken on a problem of size n is 2nk milliseconds, which (for k
    >> not a function of n) is linear in n. The actual value of k isn't relevant to
    >> the complexity analysis, even if it's relevant to the person implementing
    >> the code.

    >
    > Once k has any dependence on n, your complexity analysis no
    > longer makes reasonable predictions concerning the size.
    > Saying, for example, that ++i takes a maximum of 10
    > milliseconds, and calculating bounds from that, doesn't mean
    > much if most of the time, it only takes 1 or 2 microseconds, and
    > if it will only take more than 5 milliseconds if the number of
    > elements depasses, say, 10 million. You still do get assymtotic
    > performance with enough elements, but the number of elements
    > needed to get there is so high that it won't occur in actual
    > practice. For performance reasons, if nothing else; once you
    > start getting page faults, your performance will end up being
    > unacceptable.


    Once again, agreed. I think we seem to have a clear case of a discussion
    heading along the lines of:

    "If and only if X (about which I personally know little), then Y."
    "Well, actually, not X."
    "Ok, then we're agreed that not Y."

    (In this particular instance, I think what I'm trying to say is that X
    is "k has no dependence on n" and Y is "the complexity analysis given
    above makes sense".)

    No worries, anyway, was just curious what you were driving at :)

    Regards,
    Stu

    [...]
    Stuart Golodetz, Apr 30, 2007
    #14
  15. kostas

    James Kanze Guest

    On Apr 30, 5:14 pm, Stuart Golodetz
    <> wrote:
    > James Kanze wrote:
    > > On Apr 30, 2:30 am, "Stuart Golodetz"
    > > <> wrote:
    > >> "James Kanze" <> wrote in message


    [...]
    > > Paging and caching
    > > introduce a non-linear component into the execution time of
    > > instructions, which can depend on the size of the data set.


    > Ah, this is the point which I missed. If the execution time for an
    > instruction is dependent on the problem size then it's a different
    > matter entirely. Thanks for clarifying :)


    It's not necessarily directly dependant on the problem size, but
    if the time for an instruction can vary over several magnitudes,
    depending on locality of access, it starts meaning the locality
    of access can be more significant than complexity with regards
    to execution time.

    Strictly speaking, complexity still wins out in the end; at
    some point, you end up getting page faults even with the fastest
    algorithm, and the individual operations become enormously
    expensive everywhere. In practice, however, we aren't concerned
    with such cases, since when (and if) we get there, the program
    will take forever anyway; the solution isn't to find a faster
    algorithm, but to buy more main memory:).

    BTW: I wasn't so much disagreeing with what you said, but
    putting it into the practical context which was being discussed.


    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, May 1, 2007
    #15
  16. kostas

    kostas Guest

    On Apr 28, 7:53 pm, kostas <> wrote:
    > Hi,
    > Trying to dump a set<int> (s) in a vector (v) my guess was that
    > v.insert(v.end(),s.begin(),s.end())
    > would be at least as efficient as
    > copy(s.begin(), s.end(), back_inserter(v))
    > It turn out that it is almost two times slower (even without using
    > reserve() before copy) and that's because it traverses the set twice.
    > I'm posting it as an exception to the usual (and usually correct)
    > advise
    > "Prefer member functions to algorithms".


    So, would be a good idea the addition of a insert variant that instead
    of the last iterator in the [firts,last) insertion range would take as
    argument the number of steps from the first iterator ?
    In this case of back insertion would not do any difference but if you
    would like to insert the set at v.begin()?

    Regards
    Kostas
    kostas, May 1, 2007
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cindi
    Replies:
    4
    Views:
    681
    Andrew Thompson
    Jan 25, 2005
  2. Ingo Nolden
    Replies:
    15
    Views:
    1,513
    Jerry Coffin
    Apr 30, 2005
  3. Replies:
    8
    Views:
    1,896
    Csaba
    Feb 18, 2006
  4. David Mark
    Replies:
    16
    Views:
    893
    Scott Sauyet
    Nov 11, 2011
  5. David Mark
    Replies:
    58
    Views:
    1,388
    David Mark
    Dec 6, 2011
Loading...

Share This Page