sorting 5million minute data: choose std::list or std::vector

MM · Feb 2, 2011

I have about 5million rows like these (day, minute, and 4 doubles)

YYYYMMDD, HHMM, double, double, double, double

which I parse from a file and store into

struct one_line {
boost::uint32_t day;
boost::uint32_t minute;
double d1, d2, d3, d4;
};

I may need to sort the container once the data is parsed, sort based off the
day and minute.

Would sorting a std::list<one_line> be faster than std::vector<one_line>?
Is the best to use std::map<Key, Value> where Key is
std:

air<boost::uint32_t ,boost::uint32_t> (day and minute) ?

regards,

MM · Feb 2, 2011

Leigh Johnston said:
On 02/02/2011 15:09, Leigh Johnston wrote:
On VC++ I find sorting a std::vector of your struct to be twice as fast as
sorting a std::list of your struct.

/Leigh

strange, I guess it depends on how unordered the list was.
both sorts are Nlog(N) , aren't they?
I'd think moving pointers is faster than value moving 4 doubles and 2 ints?

thanks,

Juha Nieminen · Feb 2, 2011

MM said:
Would sorting a std::list<one_line> be faster than std::vector<one_line>?

Is it really that hard to test it?

Juha Nieminen · Feb 2, 2011

MM said:
strange, I guess it depends on how unordered the list was.
both sorts are Nlog(N) , aren't they?
I'd think moving pointers is faster than value moving 4 doubles and 2 ints?

A doubly-linked list takes significantly more memory than a vector
(especially for such a relatively small element type), and the elements
of the list will usually end up being significantly more dispersed in
RAM while the elements in a vector will be at contiguous memory locations.
This means that sorting (or even just traversing) the list will usually
cause significantly more cache misses than a vector.

Bo Persson · Feb 2, 2011

MM said:
strange, I guess it depends on how unordered the list was.
both sorts are Nlog(N) , aren't they?
I'd think moving pointers is faster than value moving 4 doubles and
2 ints?
thanks,

Note that you have to update 6 pointers to move a node in a doubly
linked list.

Often std::vector has an edge by being more cache friendly. The
elements are adjacent. And smaller than list nodes.

Bo Persson

Jorgen Grahn · Feb 3, 2011

Is it really that hard to test it?

AOL. Test it and report your results here. We don't really know what
will work best --or if the difference is big enough to matter.

/Jorgen

Jorgen Grahn · Feb 3, 2011

I have about 5million rows like these (day, minute, and 4 doubles)

YYYYMMDD, HHMM, double, double, double, double

which I parse from a file and store into

struct one_line {
boost::uint32_t day;
boost::uint32_t minute;
double d1, d2, d3, d4;
};

Nitpick: don't call HHMM 'minute', or you'll be surprised one day when
you forget that 102 doesn't mean 102 minutes but 62 minutes.

/Jorgen

James Kanze · Feb 4, 2011

Additional nitpick: don't use bit specified integer unless you *need*
to know the size of the integers in bits. Normal "unsigned integer"
would have been perfectly fine here.

Click to expand...

Normal int would have been perfectly fine here. On the other
hand, if he's got 5 million of them, short or even signed char
might be more appropriate (although in practice, alignment
requirements will probably mean that he won't gain anything by
using a type smaller than int).

Juha Nieminen · Feb 5, 2011

Leigh Johnston said:
Eschewing unsigned integral types is sometimes irrational. *If* it
makes no sense to have negative days or minutes then why use a signed
integral type?

Because if you are calculating the difference of two dates, you will
have to deal with the wrap-around problem if you are using unsigned
types, usually by explicitly casting the integrals being compared to
signed (which kind of defeats the whole purpose of them being unsigned
in the first place).

There might not be negative days in dates, but there may well be when
calculating differences between dates.

gwowen · Feb 7, 2011

Eschewing unsigned integral types is sometimes irrational. *If* it
makes no sense to have negative days or minutes then why use a signed
integral type?

There are sometimes good reasons. Unsigned overflow has defined
semantics; signed does not. If the bare silicon of the target
platform's registers does not follow the defined unsigned semantics,
using unsigned can result in a lot of unnecessary code generation
(checking for overflow and implementing correct behaviour)

I've written code for compilers/platforms for which

int find(char foo,const char arr[],size_t sz){
for(int q=0;q<sz;++sz){
if(arr[q] == foo) return q;
}
return sz;
}

produces *considerably* smaller code than

int find(char foo,const char arr[],size_t sz){
for(unsigned q=0;q<sz;++sz){
if(arr[q] == foo) return q;
}
return sz;
}

Of course, that latter has a larger domain of defined behaviour, but
in reality you may well know that you're never going to hit either
limit.

gwowen · Feb 7, 2011

I've written code for compilers/platforms for which [snip]

Click to expand...

[Your correct code fix snipped]
In which case on VC++ I get:

Then I think we can safely say that VC++ was not the platform to which
I was referring can't we? Sheesh, some people are so keen to be
thought of as smarter than everyone else in the whole world, they end
up looking like idiots.

Stick to arguing with Paul, at least you *are* probably smarter than
him.

gwowen · Feb 7, 2011

Here the only difference is intel opcode jl versus intel opcode jb which
IICR have the same performance characteristics.

I'm not talking about an intel processor, so the effacacy of intel
opcodes is utterly irrelevant.

gwowen · Feb 7, 2011

It was simply a balanced, pertinent counter-example to your claim of
multiple platforms

If you think "one platform agrees with me" provides a counter
example .

Like it or not the Intel platform is quite pervasive (not some esoteric platform that nobody cares much about).

I know that. In fact, that was entirely my point. I was talking
about an esoteric platform - specifically a custom DSP Asic who's
default signed integer arithmetic saturated (because on embedded DSPs,
that is almost always what you want signed arithmetic to do - clip
your signal rather than wrap it around).

So every time you do an unsigned addition on that processor, the C
standard insists that you check for overflow and correct the results
if overflow is about to occur. If you use signed arithmetic, the
compiler just loads the registers and calls the appropriate
instruction. (In fact, if memory recalls, unsigned arithmetic was done
using a subroutine, signed arithmetic was a single opcode, although I
may have misremembered that).

As to specific cases where the compiler can use the "overflow is
undefined" to optimize, with assembler generated, your attention is
drawn to http://www.airs.com/blog/archives/120

gwowen · Feb 7, 2011

The implementation specific behaviour of some esoteric platform (or any
specific platform for that matter) should not necessarily direct us as
to what constitutes good C++ coding practice.

Given that all I originally said was "There are sometimes good
reasons" -- the thing which you claimed to have disproved with a
counterexample [not sure how you disprove a "sometimes" with a
counterexample, but hey ho] -- its good to see you've come around.

Unsigned integral types - I use them a lot, they're handy, especially
when I'm writing code for Intel and other general purpose processors
[though I bet there are some GPUs that work like DSPs for similar
reasons]. Eschewing (good word, by the way) them without a good reason
is pretty silly, but as I originally -- and correctly, as you now
agree -- said *THERE ARE SOMETIMES GOOD REASONS*.

gwowen · Feb 7, 2011

My original reply stated more or less what you claim I *now* agree with:

So what was your "counterexample" meant to demonstrate?

Juha Nieminen · Feb 7, 2011

Leigh Johnston said:
Which is why is said "*If*" in my reply. On the implementation I use
static_cast<int> has no overhead and unlike some people I don't mind the
verbosity of the C++ style casts.

It's not about verbosity. It's about clarity and the maning of your
code. If you say that a variable is unsigned and then later you cast it
to signed, you *lied* in your variable declaration. That can only cause
confusion.

Juha Nieminen · Feb 7, 2011

Leigh Johnston said:
Wrong. The cast is sometimes required when using the unsigned variable
in an expression involving signed variables (as clearly indicated in the
example code I posted); a cast is not required in other contexts. It
seems to me that the only confusion is your understanding of the issue.

But that's exactly why you need to make the original variable signed:
Because it can be used in signed expressions. It makes no sense to make
it unsigned when it will be used as a signed value anyways.

Whenever you need to cast an unsigned variable to a signed one in an
expression, that's a sign of bad design. You are using the variable in
a way that contradicts its original declaration.

Bo Persson · Feb 7, 2011

Leigh said:
It was simply to add some balance to your initial reply in which you
mention compilers/platforms (plural) producing *considerably*
smaller code when not using unsigned which I initially viewed as
argument.

And now we know one platform (singular) where the code is not
considerably smaller?

Bo Persson

Kaba · Feb 7, 2011

Leigh said:
Wrong; it is not a sign of bad design at all. If a variable is used as
an unsigned value more often than as a signed value and it logically
makes sense for it to be unsigned (e.g. it is never negative) then
making the variable unsigned makes perfect sense.
--8x--

You might be in the "use int everywhere" camp but I am not. Perhaps
Java is a more suitable language for you rather than C++.

For what it's worth, here's my take on this. This is a piece of text I
have written earlier (explaining the form). I am in the 'use int
everywhere' camp

Ideas can't and shouldn't be forced, but they can be
shared; the following text reflects how I reason my position, not that
the other camps are wrong.

In this article I present two problems in using unsigned
integers in a program. These are the _zero boundary problem_,
and the _extended positive range problem_.

Zero boundary problem
---------------------

Most of the time we are assuming that, away from boundaries caused
by the finite representation, an integer object works like an element
of the integer ring ZZ. In addition, it seems plausible that most
integer calculations are concentrated around a neighborhood of zero.

The _zero-boundary problem_ (of unsigned integers) is that the zero
is on the boundary beyond which the assumption of working with integers
falls apart. Thus, the probability of introducing errors in computations
increases greatly. Furthermore, those errors can not be catched, since
every value is legal.

### Looping with unsigned integers

Looping over values from n to zero backwards demonstrates the zero
boundary problem with unsigned integers:

for (unsigned int i = n;i >= 0;--i)
{
// Do something.
}

Since there are no negative values to fail the loop test,
this loop never finishes. In contrast, with signed integers the problem
does not exist, since the boundaries are located far away from zero:

for (int i = n;i >= 0;--i)
{
// Do something.
}

Extended positive range problem
-------------------------------

Conversion between an unsigned integer and a signed integer
is an information destroying process in either direction.
The only way to avoid this is to use one or the other
consistently.

If the normal arithmetic is the intention, then a signed integer
represents a more general concept than an unsigned integer: the former
covers both the negative and positive integers, whereas the latter
only covers non-negative integers. In programming terms, it is
possible to create a program using signed integers alone, however,
the same can't be said about the unsigned integers. Therefore, if
only one type is to be used consistently, the choice should be a
signed integer. However, let us do some more analysis.

Since any non-trivial program must use signed integers, the use of
unsigned integers eventually leads to an unsigned-signed
conversion. In particular, because `std::size_t` in the Standard
Library is an unsigned integer, there are few programs that can
escape the unsigned-signed conversions.

Despite these conversions, programs still seem to work. The reason
for this, I reflect, is that the unsigned integers are normally
not taken into the extended positive range they allow.
The _extended positive range problem_ is that if the unsigned integers
are taken to their extended positive range, then the signed-unsigned
conversions become a reality. Ensuring correctness under such a threat
is hard, if not practically impossible.

Conclusion

Kaba · Feb 7, 2011

Leigh said:
It is debatable if the probability of introducing errors increases
greatly when using unsigned integral types; this is certainly an
assertion on your part, do you have any evidence to back it up?

Consider any function taking in an unsigned integer (e.g.
std::vector::resize()). It is possible that the computation of a new
size, as a bug, ends up being a negative number, which is then wrapped
to a large positive number. This bug can't be catched, since every value
is legal.

Something similar can happen with signed integers: a large negative
number wraps to large positive number. But this is much less probable:
the action is concentrated around zero.

Correctness is the most important property of a program. In the
described case using a signed integer _brings visible_ an often
occurring bug.

The problem of unchecked bugs gets worse with increasing levels of
abstraction. In the worst case you go through a deep layer of functions,
with an unexplained crash, although the actual problem was that an
argument of the top-most function did not fulfill its precondition (e.g.
size < 0). It's quite comparable to the thousand-lines template errors
when using Boost.Spirit.

for (unsigned int i = n + 1;i-- > 0
{
// Do something.
}

Nice

However, the point is that to carry out a task you have to be
careful to avoid a bug. That's something that should not be left to
humans (because by Murphy's law...).

You ensure correctness as you would ensure correctness using any other
language feature.

I don't think it can be afforded, because no one can put the effort to
check each unsigned-signed conversion for trouble. The practical way,
which I also use, is to close the eyes and avoid exercising a program to
its extremes (e.g. 2^32 - 1 objects).

Interestingly, the 64-bit integers help with the extended range problem
_in practice_: the extreme values are then often avoided naturally (e.g.
no one creates an std::vector with 2^64 elements). But the zero-boundary
problem remains.

Bugs can be created using any language feature therefore it is, IMO,
wrong to eschew a particular language feature based on the possibility
of bug creation unless there is clear evidence that the language feature
in question has a high probability of bug creation (for some definition
of "high").

The possibility of a bug creation is not a problem. The problem is that
using unsigned integers prevents me from catching bugs, and that way
decreases my confidence on program correctness.

What you have presented is little more than conjecture IMO.

Critical thinking appreciated.

Problem with std::list and sorting	3	Jan 11, 2005
Crash while sorting a vector of pointers	3	Feb 13, 2008
If I change from std::vector to std::list my compiler becomes very displeased with me	5	Jan 3, 2005
Creating a list of multiple element rows, and sorting them.	2	Jul 2, 2006
design choice	1	May 27, 2008
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

sorting 5million minute data: choose std::list or std::vector

MM

MM

Juha Nieminen

Juha Nieminen

Bo Persson

Jorgen Grahn

Jorgen Grahn

James Kanze

Juha Nieminen

gwowen

gwowen

gwowen

gwowen

gwowen

gwowen

Juha Nieminen

Juha Nieminen

Bo Persson

Kaba

Kaba

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads