Efficiency of std::sort on std::deque, plus an idiom issue

S

Stephen Horne

Is sorting a deque using std::sort O(n log n) as for an array or
vector?

It's a fairly common problem to generate/recieve a large number of
keys and related data unordered, potentially with recurring keys, and
later on to need both to find particular keys randomly, and to iterate
through all keys in order several times.

Would your first choice be to...

1. Store the keys and data in an std::map, so that duplicate keys are
detected and handled as soon as they occur, and the data is always
sorted.

2. Store the keys and data in an std::vector or std::deque. Once all
data is in, sort it and handle duplicate keys.

3. Store the data in a priority queue. Once all data is in, extract
it in order, handling duplicate keys, writing to the container
most appropriate for later use (std::vector/std::deque/std::list).

4. Something else

Assume the data is substantial, but not enough to cause out-of-memory
problems on even quite old machines like mine. Efficiency may be an
issue, but not the only issue.

Just questioning a common habit of mine. My reflex is the first
choice, but the second/third have some appealing features WRT
separating phases of processing - first get the data, then sort/handle
duplicates/generally process the data, then use the data. With the
first approach, being forced to handle duplicate keys in the first
loop tends to mean the first two phases get merged into one.

Obviously the generating/collecting and processing of the data is
still easily separated into different functions as needed, but maybe
that splitting into phases would be better.
 
B

Bo Persson

Stephen said:
Is sorting a deque using std::sort O(n log n) as for an array or
vector?

Yes, but perhaps with a larger constant factor, as the iterators have
another level of indirection. Sorting a deque might take perhaps twice
as long.
It's a fairly common problem to generate/recieve a large number of
keys and related data unordered, potentially with recurring keys,
and later on to need both to find particular keys randomly, and to
iterate through all keys in order several times.

Would your first choice be to...

1. Store the keys and data in an std::map, so that duplicate keys
are detected and handled as soon as they occur, and the data is
always sorted.

2. Store the keys and data in an std::vector or std::deque. Once
all data is in, sort it and handle duplicate keys.

3. Store the data in a priority queue. Once all data is in, extract
it in order, handling duplicate keys, writing to the container
most appropriate for later use
(std::vector/std::deque/std::list).

4. Something else

I would go with 2. vector, unless you have found some reason not to.
It definitely uses less memory than the other choices, unless there is
an excessive number of duplicate keys. In that case, a map might be
better.
Assume the data is substantial, but not enough to cause
out-of-memory problems on even quite old machines like mine.
Efficiency may be an issue, but not the only issue.

If you read all the data up front, loading a vector and sorting it
might be amazingly efficient. A map might be better when data is
constantly added and deleted.



Bo Persson
 
J

James Kanze

Is sorting a deque using std::sort O(n log n) as for an array
or vector?
It's a fairly common problem to generate/recieve a large
number of keys and related data unordered, potentially with
recurring keys, and later on to need both to find particular
keys randomly, and to iterate through all keys in order
several times.

I'm assuming by this that you get all of the data up front.
Would your first choice be to...
1. Store the keys and data in an std::map, so that duplicate keys are
detected and handled as soon as they occur, and the data is always
sorted.
2. Store the keys and data in an std::vector or std::deque. Once all
data is in, sort it and handle duplicate keys.
3. Store the data in a priority queue. Once all data is in, extract
it in order, handling duplicate keys, writing to the container
most appropriate for later use (std::vector/std::deque/std::list).
4. Something else
Assume the data is substantial, but not enough to cause
out-of-memory problems on even quite old machines like mine.
Efficiency may be an issue, but not the only issue.
Just questioning a common habit of mine. My reflex is the
first choice, but the second/third have some appealing
features WRT separating phases of processing - first get the
data, then sort/handle duplicates/generally process the data,
then use the data. With the first approach, being forced to
handle duplicate keys in the first loop tends to mean the
first two phases get merged into one.

The first is the obvious choice. It's the simplest. (As you
say, it handles the case of duplicates automatically.) I
wouldn't bother with even considering the others unless you
actually had performance problems. If performance (or memory
size) is an issue, and all of the data are read before any
processing occurs, 2 could be a fairly simple optimization;
another variant on 2 could be to use lower_bound to insert each
element in the correct location in the vector immediately.
(This is probably slower than sorting after the fact, but it
makes handling of duplicates somewhat easier, since they never
get into the data set. Of course, if the duplicates really are
duplicates---same key AND same data, so that it doesn't matter
which one you keep, then std::sort followed by std::unique is
pretty simple.)
Obviously the generating/collecting and processing of the data
is still easily separated into different functions as needed,
but maybe that splitting into phases would be better.

If you didn't have a tool already available which did it all at
once, maybe. Given the availability of std::map, however, it
seems like just extra effort.
 
S

Stephen Horne

I'm assuming by this that you get all of the data up front.

Sort of. Data may be processed either as part of collecting/generating
or as part of duplicate handling or both.

To give a simple example, imaging generating a report from a database
which doesn't have the right indexes to iterate directly in the needed
order. Phase 1, you scan the database, maybe doing simple row-by-row
calculations, but mostly just picking out the fields you need and
saving the results.

Phase 2 (which in the std::map case isn't a separate phase) you handle
output-order 'key' collisions by merging items. Maybe you keep totals,
maybe you want an average - whatever you need for your report.

Phase 3+, you iterate in output-key order, calculate subtotals etc,
and dump out the output.
If you didn't have a tool already available which did it all at
once, maybe. Given the availability of std::map, however, it
seems like just extra effort.

But I'm not doing quite such trivial things. The std::map or whatever
is important, but there is other work to be abstracted out.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top