simple list efficiency question

J

Jason

Hi,

I have a few million data items being received by my program and I wish to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?

I have checked out the STL and have thought of using vector, set, list,
deque and map. I dont know how these things are implemented so not sure
which is best to use or if i should implement another data structure.
Perhaps any or all of them can be used with no big problem.

I am just worried in case I waste effort coding with them and then finding i
need to implement something else because it runs so slow. I'm a novice
using c++ so it would be nice to know up front before I start learning how
to write it.

Thanks for your help.
jason
 
K

Karl Heinz Buchegger

Chris said:
Hello,




If I understand correctly you want to store these objects in a sorted
fashion. For a number like a few million my guess would be that you can
get away with storing them in a vector and after they are all read in
sorting them. Because sorting is (if the order in which the objects are
coming in is not too bad) an N*log(N) operation this should not take
prohibitively long. Furthermore, it is difficult to imagine that you
could be faster than N*log(N).

I would cross check with a map also.
 
C

Chris Dams

Hello,

Jason said:
I have a few million data items being received by my program and I wish to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?
I have checked out the STL and have thought of using vector, set, list,
deque and map. I dont know how these things are implemented so not sure
which is best to use or if i should implement another data structure.
Perhaps any or all of them can be used with no big problem.

If I understand correctly you want to store these objects in a sorted
fashion. For a number like a few million my guess would be that you can
get away with storing them in a vector and after they are all read in
sorting them. Because sorting is (if the order in which the objects are
coming in is not too bad) an N*log(N) operation this should not take
prohibitively long. Furthermore, it is difficult to imagine that you
could be faster than N*log(N).

Good luck,
Chris Dams
 
T

tom_usenet

Hi,

I have a few million data items being received by my program and I wish to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?

How do you decide where to insert? If it is based on some kind of
ordering, then set or map would be ideal.

If you are inserting in the middle, then vector and deque are not
good.

If you are inserting at a particular "index", then list isn't great,
since finding that index is O(n), but it will probably be better than
vector or deque.

So, basically, it depends on the details of the algorithm...

Tom

C++ FAQ: http://www.parashift.com/c++-faq-lite/
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
 
P

Peter van Merkerk

I have a few million data items being received by my program and I wish
to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?

I have checked out the STL and have thought of using vector, set, list,
deque and map.

I dont know how these things are implemented so not sure
which is best to use or if i should implement another data structure.
Perhaps any or all of them can be used with no big problem.

I am just worried in case I waste effort coding with them and then finding i
need to implement something else because it runs so slow. I'm a novice
using c++ so it would be nice to know up front before I start learning how
to write it.

When selecting a container the choice should be based on:
1. How are elements accessed (sequentially, random, by key...);
2. Which operations are to be performed on the container;
3. How often will each of those operations be performed;
4. The number of elements will be stored in the container (if the number is
low just pick the most convenient one).

The C++ standard doesn't directly dictate how the standard containers
should be implemented, but it does specify the complexity of the operations
on the container (which limits the possible choices for a container
implementation). These complexity guarantees can be used as guideline for
selecting the right container for the job.

Given that the container should hold a few million elements and that
elements will be inserted at arbitrary positions makes the std::vector
class a poor choice; if an element needs to be inserted at the first
position all those million elements need to be moved in memory. OTOH the
advantage of the std::vector class is that random access of elements is
very fast.

The std::list class (typically implemented as doubly linked list) may be a
good choice if insert performance is very important for your application.
However the std::list class doesn't support random access to individual
elements, they can only be accessed sequentially. If you need to access
element 1923402, std::list is not a good choice. The std::deque class would
have been a choice given that it does support random access, but its insert
performance is only good when inserting elements at the beginning or the
end.

If you need to access elements by key the std::map class may be a good
choice (this class is often based on red-black tree algorithm). With the
std::map you don't have control over the ordering of elements, as the
ording of the elements will be such that a key lookup can be done in O(log
N). An alternative for std::map may be the non-standard (but common)
hash_map class; with proper hashing it can do key lookups in O(1).

You may just start with one of the standard containers. If none of those
containers meet your performance requirements you may eventually replace
the standard container with your own container later. Note that one of the
nice features of standard containers is that they can be easilly exchanged
for each other.
 
C

Chris Dams

If I understand correctly you want to store these objects in a sorted
fashion. For a number like a few million my guess would be that you can
get away with storing them in a vector and after they are all read in
sorting them. Because sorting is (if the order in which the objects are
coming in is not too bad) an N*log(N) operation this should not take
prohibitively long. Furthermore, it is difficult to imagine that you
could be faster than N*log(N).

Forgot to mention that if you use a vector you should not forget to
call the .reserve() method with a sensible number.
 
J

Jason

Sorry, I didnt make myself too clear. I mean when I receive items they need
to be inserted into a vector in any position, not sorted once they are all
there. For example, item 1 is processed and placed at position 0. Item 2
is processed and placed at position 0 causing the item already there to be
shifted up to position 1 and so on, each new item causing a new insert
operation as opposed to being added at the end of the list.

so if I had a million data items already in and the next one needed to go in
somewhere in the middle or even at the beginning, thinking about it as
shifting everything up one then leads to a terrible situation of gross
inefficiency.

I was wondering for example if I could get away with using
vector.insert(nextitem, positionRequired) on such a huge list without
causing the program to be massively slow? Or would a set, list or map be
any better?

Thanks in advance
 
J

Jason

Thanks, reading your answer, because accessing elements is not such a worry
and insertion performance is needed I should use List. thank god it's that
easy :) phew
 
D

Dan W.

Sorry, I didnt make myself too clear. I mean when I receive items they need
to be inserted into a vector in any position, not sorted once they are all
there. For example, item 1 is processed and placed at position 0. Item 2
is processed and placed at position 0 causing the item already there to be
shifted up to position 1 and so on, each new item causing a new insert
operation as opposed to being added at the end of the list.

so if I had a million data items already in and the next one needed to go in
somewhere in the middle or even at the beginning, thinking about it as
shifting everything up one then leads to a terrible situation of gross
inefficiency.

I was wondering for example if I could get away with using
vector.insert(nextitem, positionRequired) on such a huge list without
causing the program to be massively slow? Or would a set, list or map be
any better?

( People around here are going to hang me by the balls for saying
this, yet, someone has to say it ;-)

What you should *really* use is std::hashmap

It is NOT part of the Standard Template Library, but if you have such
large quantities of objects that need to be inserted in sorted order,
and especially if you need to find them quickly, nothing beats it.

One of the good things about hashmap is that, as with list, elements
are never moved around in memory, and your iterators are not
invalidated when elements are inserted or removed.

Hashmap takes a bit of work to setup: you need to give it functors for
magnitude and equality comparisons, as well as a hash function.
A word of advice, though: use a rather large bucket table, if you
have so many objects!, because after identifying a bucket by the hash
function, searching is slower across the elements in that bucket.
So, if you have "millions" of elements, I'd use a bucket count as
large as tens of thousands or more.

Cheers!
 
P

Pete Becker

Jason said:
Hi,

I have a few million data items being received by my program and I wish to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?

It depends on some more details of how this container is going to be
used.

If the container is going to be initialized all at once and nothing
added later, then appending to a list or a vector and sorting after all
the elements have been added is faster. If there will be lookups
interspersed with additions (i.e. the container has to be in sorted
order while the elements are being added) then a set or map will be
slightly faster (and much simpler to use).

In both cases, though, having millions of elements raises the issue of
memory consumption. If the elements are large it doesn't matter much --
the container's overhead will be dwarfed by the size of the elements.
But if they're down near 100 bytes or less then memory overhead becomes
significant. A set or map typically holds three pointers plus a couple
of bits in each node, in addition to the actual data. A list holds two
pointers in each node. A vector doesn't have any per node overhead (but
it ties up a large contiguous block of memory).
 
E

EventHelix.com

I have a few million data items being received by my program and I wish to
store them in a list, with each item being inserted in any position in the
list. Any performance tips so that my program runs ok?

Since you need to insert data at any position, vector and deque will
give very poor performance.

If you need random access to any entry and the order of entries
is not important, use map.

If the order of entries is important and you do not need random
access to any entry, use list.

Sandeep
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top