Meyers's preference for vectors over maps

Discussion in 'C++' started by Fred Ma, Jan 28, 2004.

  1. Fred Ma

    Fred Ma Guest

    Hello,

    I was looking at Meyers's "Effective STL", item 23
    about choosing between vectors and maps (at least
    that the choice for me). In many cases, using
    sorted vectors is faster for lookups. The two
    reasons are (1) that vector elements are smaller,
    since map elements need at least 3 pointers, and
    (2) contiguity of vector elements in memory ensure
    fewer page faults.

    For me, I will a container of objects that are much
    bigger than pointers, so reason (1) doesn't apply.

    Reason (2) is what I'm trying to understand better.
    Let me describe my application.

    In my case, I will have N items in the container,
    will look up n (n<N) items, do some processing,
    then replace the n worst items in the container
    with the n new items from processing. Here, worse
    is defined by the sorting function. The problem
    is that n can vary from 1 to N, making the advantages
    of vector vs. map very unclear. For small n,
    map is better because I'm basically interspersing
    1 modification of the database with 1 lookup.
    For large n, vector is better because it's faster
    to sort a whole bunch of objects at once rather
    than building a map in a piecemeal fashion.

    Since I can't get a clear answer from that line of
    reasoning, I looked more carefully at Meyers's
    reason (2). I will be using binary search for
    lookup, and he claims that contiguity of elements
    is good for binary search in particular. More
    specifically, I understood him as meaning that
    closeness of elements in memory should match
    closeness of elements in the traversal of
    container elements. However, my understanding
    of binary search is that it doesn't traverse
    elements in sequential order. So why the
    contiguity of elements in a vector be of any
    advantage?

    Thanks for clarifying. And if there are any
    other considerations that would make it easier
    to decide upon a proper container, I'd appreciate
    those too. Right now, I'm tempted to go for a
    vector because I haven't used a map before. But
    for that reason, I'm also thinking it's not a bad
    reason to dabble in maps, too.

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 28, 2004
    #1
    1. Advertising

  2. Fred Ma wrote:

    > In my case, I will have N items in the container,
    > will look up n (n<N) items, do some processing,
    > then replace the n worst items in the container
    > with the n new items from processing. Here, worse
    > is defined by the sorting function. The problem
    > is that n can vary from 1 to N, making the advantages
    > of vector vs. map very unclear. For small n,
    > map is better because I'm basically interspersing
    > 1 modification of the database with 1 lookup.
    > For large n, vector is better because it's faster
    > to sort a whole bunch of objects at once rather
    > than building a map in a piecemeal fashion.


    That looks like a good candidate for usage of a priority queue.
    Removing items from both a vector and a map is slow, and since you need
    a sorting criteria, a normal deque will not suffice.

    --
    To get my real email adress, remove the two onkas
    --
    Dipl.-Inform. Hendrik Belitz
    Central Institute of Electronics
    Research Center Juelich
     
    Hendrik Belitz, Jan 28, 2004
    #2
    1. Advertising

  3. Fred Ma

    Fred Ma Guest

    Hendrik Belitz wrote:
    >
    > Fred Ma wrote:
    >
    > > In my case, I will have N items in the container,
    > > will look up n (n<N) items, do some processing,
    > > then replace the n worst items in the container
    > > with the n new items from processing. Here, worse
    > > is defined by the sorting function. The problem
    > > is that n can vary from 1 to N, making the advantages
    > > of vector vs. map very unclear. For small n,
    > > map is better because I'm basically interspersing
    > > 1 modification of the database with 1 lookup.
    > > For large n, vector is better because it's faster
    > > to sort a whole bunch of objects at once rather
    > > than building a map in a piecemeal fashion.

    >
    > That looks like a good candidate for usage of a priority queue.
    > Removing items from both a vector and a map is slow, and since you need
    > a sorting criteria, a normal deque will not suffice.



    Hi, Hendrik,

    I looked up priority queues in Josuttis's "The C++
    Standard Library". It's built on top of the standard
    iterator-supporting containers, with the vector as a
    default. Wouldn't I still be faced with the comparison
    of issues between vectors and maps for my usage? My
    impression is that the special container adapters
    simplify the interface rather than changing the
    underlying mechanism of data storage and retrieval.
    Admittedly, this could be a misimpression on my part.

    Fred
     
    Fred Ma, Jan 28, 2004
    #3
  4. Fred Ma wrote:

    > Hi, Hendrik,
    >
    > I looked up priority queues in Josuttis's "The C++
    > Standard Library". It's built on top of the standard
    > iterator-supporting containers, with the vector as a
    > default. Wouldn't I still be faced with the comparison
    > of issues between vectors and maps for my usage? My
    > impression is that the special container adapters
    > simplify the interface rather than changing the
    > underlying mechanism of data storage and retrieval.
    > Admittedly, this could be a misimpression on my part.


    Uh well, I didn't even recognized that priority queues are available as an
    STL adapter. Normally priority queues are implemented by fibonacci heaps
    and are very efficient. Maybe you should look for some implementation which
    does it that way instead of using a stl container/adapter.

    --
    To get my real email adress, remove the two onkas
    --
    Dipl.-Inform. Hendrik Belitz
    Central Institute of Electronics
    Research Center Juelich
     
    Hendrik Belitz, Jan 28, 2004
    #4
  5. Fred Ma

    tom_usenet Guest

    On 28 Jan 2004 08:13:37 GMT, Fred Ma <> wrote:

    >Hello,
    >
    >I was looking at Meyers's "Effective STL", item 23
    >about choosing between vectors and maps (at least
    >that the choice for me). In many cases, using
    >sorted vectors is faster for lookups. The two
    >reasons are (1) that vector elements are smaller,
    >since map elements need at least 3 pointers, and
    >(2) contiguity of vector elements in memory ensure
    >fewer page faults.
    >
    >For me, I will a container of objects that are much
    >bigger than pointers, so reason (1) doesn't apply.
    >
    >Reason (2) is what I'm trying to understand better.
    >Let me describe my application.
    >
    >In my case, I will have N items in the container,


    Is N fixed at runtime? IOW, is the size of the container constant?

    >will look up n (n<N) items, do some processing,
    >then replace the n worst items in the container
    >with the n new items from processing. Here, worse
    >is defined by the sorting function. The problem
    >is that n can vary from 1 to N, making the advantages
    >of vector vs. map very unclear. For small n,
    >map is better because I'm basically interspersing
    >1 modification of the database with 1 lookup.
    >For large n, vector is better because it's faster
    >to sort a whole bunch of objects at once rather
    >than building a map in a piecemeal fashion.


    I would recommend a vector. Keep another empty vector handy. Start by
    sorting your main vector. Calculate your new range of n to add and
    sort it (in O(n)), then merge that with your old vector into your same
    sized vector.

    tempvec.reserve(v.size());
    std::sort(newrange.begin(), newrange.end());
    //merge the N-n remaining elements from v with the new range:
    std::merge(v.begin() + n, v.end(), newrange.begin(), newrange.end(),
    std::back_inserter(tempvec));

    //swap
    tempvec.swap(v);
    tempvec.clear();

    etc. That way the operation is only n log n, not N log N.

    >
    >Since I can't get a clear answer from that line of
    >reasoning, I looked more carefully at Meyers's
    >reason (2). I will be using binary search for
    >lookup, and he claims that contiguity of elements
    >is good for binary search in particular. More
    >specifically, I understood him as meaning that
    >closeness of elements in memory should match
    >closeness of elements in the traversal of
    >container elements. However, my understanding
    >of binary search is that it doesn't traverse
    >elements in sequential order. So why the
    >contiguity of elements in a vector be of any
    >advantage?


    Because they are all close to each other in memory. If a 4k chunk of
    the vector is pulled into cache memory, you'll get a lot of cache
    hits.

    >Thanks for clarifying. And if there are any
    >other considerations that would make it easier
    >to decide upon a proper container, I'd appreciate
    >those too. Right now, I'm tempted to go for a
    >vector because I haven't used a map before. But
    >for that reason, I'm also thinking it's not a bad
    >reason to dabble in maps, too.


    Given the fixed size, vector seems a good choice. If your elements are
    expensive to copy, I'd go for a vector or smart pointers, or just
    pointers (with manual memory management).

    Tom

    C++ FAQ: http://www.parashift.com/c -faq-lite/
    C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
     
    tom_usenet, Jan 28, 2004
    #5
  6. Fred Ma

    Daniel T. Guest

    Fred Ma <> wrote:

    > I was looking at Meyers's "Effective STL", item 23
    > about choosing between vectors and maps (at least
    > that the choice for me). In many cases, using
    > sorted vectors is faster for lookups. The two
    > reasons are (1) that vector elements are smaller,
    > since map elements need at least 3 pointers, and
    > (2) contiguity of vector elements in memory ensure
    > fewer page faults.


    I guess the obvious answer is to create an adaptor that will allow you
    to easily switch between the two. Then profile both and pick the most
    efficient one.
     
    Daniel T., Jan 28, 2004
    #6
  7. Fred Ma

    tom_usenet Guest

    tom_usenet, Jan 28, 2004
    #7
  8. Fred Ma <> wrote:
    > I looked up priority queues in Josuttis's "The C++
    > Standard Library". It's built on top of the standard
    > iterator-supporting containers, with the vector as a
    > default. Wouldn't I still be faced with the comparison
    > of issues between vectors and maps for my usage?


    If the operations provided for a priority queue fit your needs, it is almost
    certainly the better choice: this data structure is specialized for finding
    the smallest (or biggest; depends on the direction of your predicate)
    element as fast as possible. That is, insertion into a priority queue just
    does the necessary work to provide fast access to the smallest element.
    Likewise for removal.

    If you need to access or change other elements than the smallest one, things
    change quite a lot, however. The priority queue in the standard library is
    not at all suited for this but there are priority queues which are helpful
    here, too. However, the difference to other data structures like eg. maps
    is reduced.

    > My
    > impression is that the special container adapters
    > simplify the interface rather than changing the
    > underlying mechanism of data storage and retrieval.


    This is true for stack and queue but not for priority_queue. The latter
    implements the logic of a 2-heap (a special case of a d-heap which is just
    a balanced tree with d children for each inner node). This is quite different
    from doing eg. a binary search etc. Also, the number of elements changed
    during inserts or removals is not linear in the size of the vector but
    logarithmic.

    Concerings Scott's comment about the advantages of vector over map: at first
    sight, a map looks superiour to a vector. The first insight why it is not so
    in all situations is the amount of work necessary to traverse the map's
    internal tree compared to a binary search within the vector which just
    effectively uses an "ad-hoc" binary tree. To move from one node to it child,
    the map has to dereference a pointer whereas the vector simply a clever
    addition or subtraction. This is often much faster. Towards the end of the
    search where most of the work is going on and the steps become smaller,
    the "nodes" (in case of the map the explicit tree nodes, in case of the vector
    the implicit current location) are more likely to sit on one page for the
    vector because they are [nearly] adjacent to each other. A page miss is
    pretty expensive. Also, the vector's nodes are much smaller (just the data
    you store there) compared to the map's nodes (which typically contain at least
    three pointers for the map's maintenance in addition to the data you store)
    increasing the chances to be on one page even more. Especially for small user
    data, the vector's movements are likely to eventually move around only in a
    cache-line of the CPU while the map is still causing page misses or at least
    cache misses.

    For searching it is pretty obvious that you want to use a vector in favour of
    a map. If you can go with an unorder structure, a hash is an even better
    choice, BTW.

    Things become less obvious if you need to change the data: if the number of
    searches compared to the number of modifications (insertions and removals;
    a change is effectively a removal followed by an insertion for both a map and
    a sorted vector although it can be improved a little bit) becomes small, ie.
    if you have only few searches between modification, things start to change:
    a huge collection of data will cause a lot of moving for vectors while it will
    cause only a few pointer adjustments for maps. On the other hand, the pointer
    adjustments are more involved than simply moving a set of small PODs in a
    fairly small vector. The tricky part is figuring out where one approach is
    better than the other in such a situation because apart from the CPU cycles
    things like the memory foot-print also have an effect. This normally results
    in vectors with even a few hunderd elements being faster than maps.

    The details depend on the exact setting, including the size of elements, the
    size of the container, the total amount of memory used, the access patterns,
    etc. Thus, it is normally reasonable to implement a working solution using an
    appropriate class and later profile the whole application to find out whether
    this part is critical at all (normally the critical parts are others than one
    suspects...) and if so, replace the solution by alternatives to measure them,
    too. It may be reasonable to use a class with an appropriate interface which
    is internally implemented by one of the choices and whose implementation is
    later changed as needed.
    --
    <mailto:> <http://www.dietmar-kuehl.de/>
    Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>
     
    Dietmar Kuehl, Jan 28, 2004
    #8
  9. Hendrik Belitz wrote:
    > Uh well, I didn't even recognized that priority queues are available as an
    > STL adapter. Normally priority queues are implemented by fibonacci heaps
    > and are very efficient.


    Are they, now? Well, the 'std::priority_queue' adapter is by an order
    of magnitude faster than the best Fibonacci heap implementation I could
    do. Of course, this alone does not say much but I could also improve on
    the adapter implementation from SGI-STL, although only by something like
    5%.

    The advantage of Fibonacci heaps over the 2-heap normally used by the
    priority queue adapter is when it comes down to modifying elements:
    this operation is not supported at all while it is the operation
    Fibonacci heaps are good at. If this feature is needed, Fibonacci heaps
    are on par with d-heaps (d == 2 is actually not the best choice here;
    I think d == 5 was optimum for my tests) extended to provide this
    feature for all number of elements I tested it with (which was up to
    10 million if I remember correctly).

    > Maybe you should look for some implementation
    > which does it that way instead of using a stl container/adapter.


    Have a look at <http://www.dietmar-kuehl.de/cxxrt/heaps.tar.gz>. This
    is a collection of priority queues I have implemented. It includes
    quite a few different ones, including Fibonacci heaps.
    --
    <mailto:> <http://www.dietmar-kuehl.de/>
    Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>
     
    Dietmar Kuehl, Jan 28, 2004
    #9
  10. Fred Ma

    Fred Ma Guest

    Dietmar Kuehl wrote:
    >
    > If you need to access or change other elements than the smallest one, things
    > change quite a lot, however. The priority queue in the standard library is
    > not at all suited for this but there are priority queues which are helpful
    > here, too. However, the difference to other data structures like eg. maps
    > is reduced.


    See, that's the problem. In each outer-most iteration, I lookup n items
    that can occur anywhere in the N-length vector, not just at the ends. If
    n was either small or almost the same as N, then I can say whether I have
    lots of lookups per resorting of the vector (after replacing the n
    worst vectors with new elements created from the n elements that were
    looked up). (It's a genetic algorithm, in case that's of interest to
    anyone). If the standard C++ library's priority queue is not optimized
    for this, then I should probably choose between a vector or a map. I'm
    not adverse to exploring more custom solutions, but for the first time
    wandering away from the ol' vector, I prefer a smaller step, with lots of
    published documentation (books, website references, news articles, faqs).

    <<SNIP>>

    > Concerings Scott's comment about the advantages of vector over map: at first
    > sight, a map looks superiour to a vector. The first insight why it is not so
    > in all situations is the amount of work necessary to traverse the map's
    > internal tree compared to a binary search within the vector which just
    > effectively uses an "ad-hoc" binary tree. To move from one node to it child,
    > the map has to dereference a pointer whereas the vector simply a clever
    > addition or subtraction. This is often much faster. Towards the end of the
    > search where most of the work is going on and the steps become smaller,
    > the "nodes" (in case of the map the explicit tree nodes, in case of the vector
    > the implicit current location) are more likely to sit on one page for the
    > vector because they are [nearly] adjacent to each other. A page miss is
    > pretty expensive. Also, the vector's nodes are much smaller (just the data
    > you store there) compared to the map's nodes (which typically contain at least
    > three pointers for the map's maintenance in addition to the data you store)
    > increasing the chances to be on one page even more. Especially for small user
    > data, the vector's movements are likely to eventually move around only in a
    > cache-line of the CPU while the map is still causing page misses or at least
    > cache misses.
    >
    > For searching it is pretty obvious that you want to use a vector in favour of
    > a map. If you can go with an unorder structure, a hash is an even better
    > choice, BTW.
    >
    > Things become less obvious if you need to change the data: if the number of
    > searches compared to the number of modifications (insertions and removals;
    > a change is effectively a removal followed by an insertion for both a map and
    > a sorted vector although it can be improved a little bit) becomes small, ie.
    > if you have only few searches between modification, things start to change:
    > a huge collection of data will cause a lot of moving for vectors while it will
    > cause only a few pointer adjustments for maps. On the other hand, the pointer
    > adjustments are more involved than simply moving a set of small PODs in a
    > fairly small vector. The tricky part is figuring out where one approach is
    > better than the other in such a situation because apart from the CPU cycles
    > things like the memory foot-print also have an effect. This normally results
    > in vectors with even a few hunderd elements being faster than maps.
    >
    > The details depend on the exact setting, including the size of elements, the
    > size of the container, the total amount of memory used, the access patterns,
    > etc. Thus, it is normally reasonable to implement a working solution using an
    > appropriate class and later profile the whole application to find out whether
    > this part is critical at all (normally the critical parts are others than one
    > suspects...) and if so, replace the solution by alternatives to measure them,
    > too. It may be reasonable to use a class with an appropriate interface which
    > is internally implemented by one of the choices and whose implementation is
    > later changed as needed.


    Yes, I suspected that Meyer's reasoning about page misses would be more
    applicable toward the end of the search, since you are bisecting smaller
    pieces of your vector. I just wasn't sure whether it was enough for it
    to happen toward the end. Also, how much toward the end depends on the
    element sizes, as you mentioned, as well as the cache size. I have a
    vector of 24 doubles, plus various overhead, and I was trying to keep
    away from having to consider the hardware parameters e.g. cache size.
    But it seems there's no way to avoid it. Profiling seems to be the
    only sure way. I will put that off for now, since I'm trying to get
    some results for a deadline (no one else's problem, of course).

    I also get the point about the 3 or more pointers per node in a map.
    At first, I thought that the issue here was the size of the pointers,
    which is much smaller than the container elements in my case. However,
    you clarified that the dereferencing overhead also mattered. Thanks for
    clearing that up.

    I think I'll stick with the ol' vector this time. Expanding my horizons
    will happen another day, even such a small expansion as to maps. And, of
    course, to a comparison between maps and vectors.

    Thanks, all, for your elaborations.

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 28, 2004
    #10
  11. Fred Ma

    Fred Ma Guest

    tom_usenet wrote:
    >
    > Given the fixed size, vector seems a good choice. If your elements are
    > expensive to copy, I'd go for a vector or smart pointers, or just
    > pointers (with manual memory management).



    Yeah, vectors seem to have a strong case behind them.
    Thanks. (The vector length is constant).

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 28, 2004
    #11
  12. Fred Ma

    Fred Ma Guest

    "Daniel T." wrote:
    >
    > I guess the obvious answer is to create an adaptor that will allow you
    > to easily switch between the two. Then profile both and pick the most
    > efficient one.


    Daniel,

    I agree. Implementing both and profiling will certainly
    determine the answer. And I would certainly like to
    learn the profiler some time (it always seems like such
    a luxury, time-wise, but it would immediately answer
    lots of questions that can't be figured out otherwise).

    For starters, I will hammer out a vector implementation.

    Thanks.

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 28, 2004
    #12
  13. Fred Ma

    Jerry Coffin Guest

    In article <>,
    says...

    [ ... ]

    > In my case, I will have N items in the container,
    > will look up n (n<N) items, do some processing,
    > then replace the n worst items in the container
    > with the n new items from processing. Here, worse
    > is defined by the sorting function. The problem
    > is that n can vary from 1 to N, making the advantages
    > of vector vs. map very unclear. For small n,
    > map is better because I'm basically interspersing
    > 1 modification of the database with 1 lookup.
    > For large n, vector is better because it's faster
    > to sort a whole bunch of objects at once rather
    > than building a map in a piecemeal fashion.


    Finding n smallest (or largest) items in a container can be done
    considerably more quickly than simply sorting the container and then
    taking the first (or last) n items. The latter requires a sort, which
    is O(N lg N), where the former can be done in approximately O(n lg N)
    time -- a big advantage if n is quite a bit smaller than N, and a
    progressively smaller one as n approaches N.

    The usual way of doing this is basically a modification of a Quick Sort.
    In a Quick Sort, you pick a pivot, partition your elements, and then
    recursively do the same with each partition.

    To pick n elements instead, you simply pick the partition those elements
    will fall into, and only recurse into the partition, ignoring the other.
    For example, assume you want the 10 smallest elements out of 100. For
    the moment I'll assume you pick perfect pivot elements every time, so
    the first time you divide it into the 50 smallest and the 50 largest.
    Clearly the 10 smallest aren't in the 50 largest, so you simply ignore
    the right hand partition, and continue with the left partition. The
    next time, you divide it into two groups of 25 elements apiece. Again,
    you ignore the larger elements, and concentrate only on the smaller
    ones. The next time around, you're down to 12 elements in the left
    partition -- from there, it's probably easiest to just search linearly
    to find the to largest of what's left, and discard them, leaving the 10
    that you care about.

    IIRC, whether this is better or worse than a heap depends on the
    relative frequency of updating elements vs. inserting entirely new
    elements. If (for example) you start with 100 elements, and every time
    around, you insert 50 entirely new elements, find the 10 worst out of
    the entire array, and so on iteratively, then chances are that this will
    work better than a heap. At the opposite extreme, if the only updating
    on the data is due to the processing you do after finding your n
    elements, then chances are much better that a heap will work better.

    A heap puts in a bit more work up-front, and does a partial sorting on
    the entire array, so if you can put that to use, it will usually be a
    win. OTOH, if you just search for n elements, you put less work into
    arranging the data (so it's faster the first time) but you're leaving
    the rest of the array relatively random, so you haven't gained much for
    the next iteration.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Jan 29, 2004
    #13
  14. Fred Ma

    Daniel T. Guest

    Fred Ma <> wrote:

    > "Daniel T." wrote:
    > >
    > > I guess the obvious answer is to create an adaptor that will allow you
    > > to easily switch between the two. Then profile both and pick the most
    > > efficient one.

    >
    > I agree. Implementing both and profiling will certainly
    > determine the answer. And I would certainly like to
    > learn the profiler some time (it always seems like such
    > a luxury, time-wise, but it would immediately answer
    > lots of questions that can't be figured out otherwise).
    >
    > For starters, I will hammer out a vector implementation.


    For starters, you should design an interface that will make this special
    queue of yours easy to use for the classes that use it. Then make it so
    that you can switch out which kind of queue gets used by simply changing
    one or two lines of code. *Then* hammer out a vector implementation.

    This way, the endeavor isn't such a time waste. Instead it is a
    opportunity to improve your design.
     
    Daniel T., Jan 29, 2004
    #14
  15. Fred Ma

    Fred Ma Guest

    "Daniel T." wrote:
    >
    > Fred Ma <> wrote:
    >
    > > "Daniel T." wrote:
    > > >
    > > > I guess the obvious answer is to create an adaptor that will allow you
    > > > to easily switch between the two. Then profile both and pick the most
    > > > efficient one.

    > >
    > > I agree. Implementing both and profiling will certainly
    > > determine the answer. And I would certainly like to
    > > learn the profiler some time (it always seems like such
    > > a luxury, time-wise, but it would immediately answer
    > > lots of questions that can't be figured out otherwise).
    > >
    > > For starters, I will hammer out a vector implementation.

    >
    > For starters, you should design an interface that will make this special
    > queue of yours easy to use for the classes that use it. Then make it so
    > that you can switch out which kind of queue gets used by simply changing
    > one or two lines of code. *Then* hammer out a vector implementation.
    >
    > This way, the endeavor isn't such a time waste. Instead it is a
    > opportunity to improve your design.



    I believe it is not a waste of time either.
    However, I have to prioritize the things I investigate.
    I also have a deadline which I will miss if I try to
    pursue too much in terms of expanding my C++ abilities
    per se. I appreciate all the response, and my question
    was posed more for a better understanding of the relevant
    issues rather than solely determining the implementation
    path I will choose. In fact, after some more reflection,
    I think it's better to not even use a binary search (be it via
    a map or binary searching a vector). I will first sort
    the n elements I am searching for, then scan the N-element
    lookup table for those elements (n<N). I after finding
    the first sought-for element in the lookup table, I will
    scan for the next sought-for element starting from that
    point in the lookup table. On average, n=N/4, so that
    means I scan an average of 4 consecutive elements before
    encountering the next sought-for element. I can live with
    that.

    I'm sure there are more sophisticated ways that change
    depending on the size of n and N, but again, the sort
    and lookup is a tiny part of a much bigger problem, all
    of whose aspects I will have to shortly and entirely
    address. Thus the compromise in thoroughness of
    investigating the sort/lookup, though I do appreciate
    the thoughts on the things that matter for that problem.
    There will certainly be situations in future in which I
    will have to rely on the ideas that I've picked up in
    this thread.

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 30, 2004
    #15
  16. Fred Ma

    Fred Ma Guest

    Jerry Coffin wrote:
    >
    > In article <>,
    > says...
    >
    > [ ... ]
    >
    > > In my case, I will have N items in the container,
    > > will look up n (n<N) items, do some processing,
    > > then replace the n worst items in the container
    > > with the n new items from processing. Here, worse
    > > is defined by the sorting function. The problem
    > > is that n can vary from 1 to N, making the advantages
    > > of vector vs. map very unclear. For small n,
    > > map is better because I'm basically interspersing
    > > 1 modification of the database with 1 lookup.
    > > For large n, vector is better because it's faster
    > > to sort a whole bunch of objects at once rather
    > > than building a map in a piecemeal fashion.

    >
    > Finding n smallest (or largest) items in a container can be done
    > considerably more quickly than simply sorting the container and then
    > taking the first (or last) n items. The latter requires a sort, which
    > is O(N lg N), where the former can be done in approximately O(n lg N)
    > time -- a big advantage if n is quite a bit smaller than N, and a
    > progressively smaller one as n approaches N.
    >
    > The usual way of doing this is basically a modification of a Quick Sort.
    > In a Quick Sort, you pick a pivot, partition your elements, and then
    > recursively do the same with each partition.
    >
    > To pick n elements instead, you simply pick the partition those elements
    > will fall into, and only recurse into the partition, ignoring the other.
    > For example, assume you want the 10 smallest elements out of 100. For
    > the moment I'll assume you pick perfect pivot elements every time, so
    > the first time you divide it into the 50 smallest and the 50 largest.
    > Clearly the 10 smallest aren't in the 50 largest, so you simply ignore
    > the right hand partition, and continue with the left partition. The
    > next time, you divide it into two groups of 25 elements apiece. Again,
    > you ignore the larger elements, and concentrate only on the smaller
    > ones. The next time around, you're down to 12 elements in the left
    > partition -- from there, it's probably easiest to just search linearly
    > to find the to largest of what's left, and discard them, leaving the 10
    > that you care about.
    >
    > IIRC, whether this is better or worse than a heap depends on the
    > relative frequency of updating elements vs. inserting entirely new
    > elements. If (for example) you start with 100 elements, and every time
    > around, you insert 50 entirely new elements, find the 10 worst out of
    > the entire array, and so on iteratively, then chances are that this will
    > work better than a heap. At the opposite extreme, if the only updating
    > on the data is due to the processing you do after finding your n
    > elements, then chances are much better that a heap will work better.
    >
    > A heap puts in a bit more work up-front, and does a partial sorting on
    > the entire array, so if you can put that to use, it will usually be a
    > win. OTOH, if you just search for n elements, you put less work into
    > arranging the data (so it's faster the first time) but you're leaving
    > the rest of the array relatively random, so you haven't gained much for
    > the next iteration.


    Jerry, I am replacing the worst n of N elements (n<N) with new elements,
    and I realize that partitioning is the way to go for that step (I am
    going to use the partition() in STL for convenience). But prior to
    this step, I lookup n out of N elements, and these are not the same
    as the n worst elements. They can occur anywhere. I also neglected
    to mention details which ultimately decided that binary searching (either
    by a tree like a map, or binary search) is not the best way to perform
    those lookups. That is, n will average N/4. What I will do is sort
    the n elements to be looked up, then look for each one in turn by linearly
    scanning the N elements of the lookup table. When one of the n elements
    is found, the next one will be scanned for from that position in the lookup
    table. So there will be an average of 3 skipped elements in the lookup
    table for every one found. I take advantage of locality in the cache
    more so than using a binary search, since the lookup table is traversed
    sequentially, and the overhead of 3 skipped elements is acceptable because
    of fast incrementing between adjacent vector elements.

    I guess I'm exploiting a situation specific detail to get around the
    larger question of profiling vector/map implementations for the time
    being. The heap was a bit of a step which I prefer to find out more
    about, but in the future. Reason being, my background is not CS, and
    I posed my original choice as a vector/map decision because it's
    readily available in standardized form via the STL, and I've got a
    nice explanatory book about the formalities of their use in Josuttis.
    So restricting my choices is a tradeoff; certainly, I won't have the
    best solution, but I can choose the best one for which I can allot
    the time.

    Thanks for explanation of partitioning, and examples comparing it to
    the heap (though I have to admit, I need to follow it up with more
    textbook reading--I've briefly read about heaps in the past when
    pondering the sort/lookup problem).

    Fred
    --
    Fred Ma
    Dept. of Electronics, Carleton University
    1125 Colonel By Drive, Ottawa, Ontario
    Canada, K1S 5B6
     
    Fred Ma, Jan 30, 2004
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Elliott
    Replies:
    4
    Views:
    1,190
    Simon Elliott
    Mar 10, 2005
  2. Brian van den Broek
    Replies:
    7
    Views:
    783
    Brian van den Broek
    Apr 19, 2005
  3. Sheep
    Replies:
    1
    Views:
    406
    Sheep
    Aug 14, 2006
  4. LeTubs
    Replies:
    1
    Views:
    307
    Jonathan Mcdougall
    Dec 5, 2005
  5. Marcus
    Replies:
    2
    Views:
    620
    Marcus
    Dec 9, 2005
Loading...

Share This Page