sorting 5million minute data: choose std::list or std::vector

James Kanze · Feb 10, 2011

If you are in the "use int everywhere", the unsigned size_type
of the standard containers is arguably a mistake (or
non-optimal design, at best).

Many people think so.

You only need the extra range if you have things like a char
array larger than half the addressable space. And when you do, *that*
is probably a mistake too.

Practically speaking, you never need the extra range. Today at
least---I think the choice of unsigned was made when development
was on a 16 bit MS-DOS machine, and I've personally written code
which used arrays of over 32K char on such machines.

Having size unsigned is a problem if you try to compute it by
subtracting two pointers, because that result is signed. You then
quickly end up mixing signed and unsigned artihmetic, which is
definitely *not* good.

Historically, there is a very difficult problem to solve.
size_t must be big enough to contain the size of any possible
object, which usually means you either make it unsigned, or make
it have mor bits than a pointer has. Given its role, the choice
is obvious, but it does have the problem you mention: if
pointers are 32 bits, and you can use most of the memory (more
than half) in a single object, then you need 33 bits to
represent the difference between two pointers. But doing this
only affects a very small minority of programs, and comes at
a not insignificant runtime cost (or did, on older 32 bit
machines). The choice made by the C committee (sometime before
1989) represents a valid compromise, even if it has some serious
drawbacks.

Using size_t as the default size_type in the C++ standard
library is a different issue, and it's definitely not a good
choice.

James Kanze · Feb 10, 2011

Usually the only situation where using unsigned integers (of
any size) is legitimate is when you are playing with bits,
rather than with integers (in other words, when your
variable is more or less used as a bit container).

[/QUOTE]

What utter nonsense; when dealing with just *positive*
integers the unsigned integral types are fine.

No they're not. Unsigned integers are a very poor abstraction
of a cardinal type.

More nonsense; it is not fuzzy at all; the C++ draft Standard
provides a clear and definitive answer:
std::tr1::array::size_type is std::size_t (array indices).

std::array::size_type is size_t because that is the default
size_type for all of the earlier containers. It's a bad
decision we have to live with.

Paul · Feb 10, 2011

Juha Nieminen said:
You didn't answer my question.

3 people slagging me off on this thread in which I wasn't even
participating, and you suggest I have some attitude problem. I think my
reply was sufficient.

James Kanze · Feb 11, 2011

On 10/02/2011 19:15, James Kanze wrote:

Live with the fact that C++ provides unsigned integral types.

Live with the fact that they were designed for very specific
uses, and not as a generalized cardinal type. (At least
according the Kernighan and Richie.)

Paul · Feb 11, 2011

James Kanze said:
Live with the fact that they were designed for very specific
uses, and not as a generalized cardinal type. (At least
according the Kernighan and Richie.)

But this is appealing to authority and Leigh will call you a troll now.

Juha Nieminen · Feb 11, 2011

James Kanze said:
Live with the fact that they were designed for very specific
uses, and not as a generalized cardinal type. (At least
according the Kernighan and Richie.)

I suppose that signed and unsigned ints exist in C (and hence in C++)
because the most common CPU architectures of the time could handle the
CPU registers as either signed or unsigned integers (usually using the
2's complement format, which was the easiest and most logical way of
handling binary registers from the point of view of the hardware).
This idea of the same registers being handled as either signed or
unsigned became so ubiquitous that basically all CPU designs today do
so (I'm not aware of any exceptions; if there are, they are probably
extremely exotic and probably obsolete nowadays).

From a program design point of view whether to use signed or unsigned
integers is an interesting and very ambiguous question. Usually there is
no speed difference between the two, so the question is purely one of
program design and/or need.

The two opposing arguments could perhaps be summarized as:

1) The signedness of the integral type should be chosen according to
the meaning of the variable. If negative values are nonsensical, then
an unsigned type should be used to reflect this.

2) When dealing with integer math, signed values are the natural choice.
Unsigned integers should generally be used for more low-level, more
"hardware-level" programming, such as direct bit manipulation or
situations where the modulo behavior of hardware registers is actually
intended and explicitly desired (one good example of this being the
implementation of a linear congruential generator of the size of the
natural register size of the architecture).

In my personal experience argument 1 presents practical problems
because such unsigned integrals are often used (sometimes inadvertedly)
in expressions which mix them with signed types, causing implicit
promotions with their inherent problems, requiring explicit casting
to avoid them.

I have already mentioned this example, but I think it's rather telling,
and it has happened to me in practice in several occasions: Image
dimensions is something that one could argue can never be negative and
hence unsigned integrals should be used. The problem is that these
dimensions are then very easily used in expressions where they get mixed
with signed types (such as screen coordinates), the result of which then
gets promoted to floating point. You easily end up in a situation where
your coordinates end up accidentally in the billions, and this is
something that might not be evident immediately, but only after
extensive testing, and not even after that.

The problem is easily solved by having all the variables as signed.
There's little reason to have image dimensions as unsigned, as there's
seldom any benefit (an image with dimensions so large that it cannot be
represented with a signed integer would usually be so large that it
would not fit in RAM).

gwowen · Feb 11, 2011

The problem is easily solved by having all the variables as signed.
There's little reason to have image dimensions as unsigned, as there's
seldom any benefit (an image with dimensions so large that it cannot be
represented with a signed integer would usually be so large that it
would not fit in RAM).

I wonder if anyone has formulated a CPU-version of Zipf's law, along
the lines of "90% of all integers stored in computers are (in
absolute) between 0 and +9, of the remainder 90% are between 10 and
100, of the remainder 100 and 100 etc...". I'm not counting pointers,
bit masks etc ... I've no evidence to back that up, but it seems
plausible.

Robert Hairgrove · Feb 11, 2011

[huge snip]

Nonsense; see posts else-thread.

/Leigh

Now what does this say about the poster when he quotes dozens of lines
from previously posted messages, only to add "nonsense" at the end?

Troll ... *plonk*

James Kanze · Feb 11, 2011

On 11/02/2011 09:49, James Kanze wrote:

std::size_t

Didn't come along until much later, and was designed for a very
specific purpose.

Paul · Feb 11, 2011

Leigh Johnston said:
Who cares when it came along? I live in the present not the past.

Array indices are never negative. You cannot allocate nor have a negative
number of container elements. Join the dots.

std::size_t

std::tr1::array::size_type

std::allocator::size_type
std::vector::size_type

/Leigh

This problem stems from the introduction of the Standard library which
created different types of C++ programmers:
a) The old style programmers who never needed the standard library.
b) The programmers who choose to learn the standard library as part of the
C++ language.

I consider the std lib as an extension to the C++ language, some people
consider it as part of the C++ language. To me the C++ language is/was a
low/mid level programming language. The introduction of the std lib raises
the level to mid/high level programming IMHO.

The C++ language now seems a bit confused nowadays and without a definite
direction. I liked the old C++ where you had the power to do almost anything
you like within the realms of computer programming, for example you could
define a std::size_t if you wanted to but this was not forced upon you as
part of the language.

Leigh appears as newbie because he knows all the ins and outs of the std
lib, he reckons he has 18 years programming experience but the programs he
has created ref: his website, suggests he is quite noobish.

I think Leigh is probably *generally* right regarding the use of unsigned
for array indexing and counters but this may not always be the case on all
implementations. I do not pretend to understand the exact details of all
microchips and how they handle integers but I think this is the key to
understanding which is more efficient;.

Juha Nieminen · Feb 12, 2011

Leigh Johnston said:
Array indices are never negative.

Do you know what value std::string::npos usually is? That's right,
a value which would otherwise be a completely valid index, but which
is just agreed to mean "not any valid index". Usually this value is
size_t(-1). In other words, the natural value for npos would be -1,
but since the index type is unsigned, it cannot be used.

Even if we conceded that array indices are never negative, the result
of subtracting to array indices can be, in which case you have to do the
manual casting to avoid problems. (Why do you think ptrdiff_t is signed?)

Paul · Feb 12, 2011

Leigh Johnston said:
Array indices are never negative; the result of subtracting array indices
cannot be used as an array index if the result would otherwise be negative
(if signed).

char str[24];
char* p_str = str+12;
p_str[-12] ='A';

Miles Bader · Feb 12, 2011

James Kanze said:
Live with the fact that ...blah blah blah...

How many times a month does this particular religious war pop up in new
threads anyway?

-miles

Sjouke Burry · Feb 13, 2011

In fortran there is no problem with those indices.
If you want arr(-5:20)? No problem.

SG · Feb 13, 2011

I have about 5million rows like these (day, minute, and 4 doubles)

YYYYMMDD, HHMM, double, double, double, double

which I parse from a file and store into

struct one_line {
boost::uint32_t day;
boost::uint32_t minute;
double d1, d2, d3, d4;
};

I have to agree with the others about the way you store a date and a
time. This looks error-prone.

About vector<one_line> versus list<one_line> w.r.t. sorting speed:
Test it. I would expect the vector version to be faster. Here's why:
one_line is "trivially copyable/assignable" (in C++0x terms) which
means that a good compiler will copy the raw bytes as fast as
possible. The difference in time between copying a raw block of bytes
of this size and adjusting a couple of pointers is probably
insignificant in comparison to the differences in the memory access
patterns and the resulting number of cache misses. Consider the memory
access patterns of quicksort given a linear memory layout versus a
mergesort on a doubly linked list whose elements are more or less
arbitrarily distributed in memory. Keep in mind that linear memory
accesses (forwards and backwards) are typically very fast whereas
random access is slow.

Cheers!
SG

James Kanze · Feb 13, 2011

char str[24];
char* p_str = str+12;
p_str[-12] ='A';

Touché.

You've actually raised a significant point with regards to
Leigh's argument: it's find to say that "array indices are never
negative", except that in C and in C++, there aren't really any
such things as array indices, since there is no array
indexation; only pointer arithmetic.

Of course, user defined classes may (and generally do) override
[] to signify indexation. But there's certainly no requirement
there that array indices cannot be negative---it's purely a
question of how the author defines his class. (There are a few
algorithms where allowing arbitrary upper and lower bounds can
be quite useful.)

Paul · Feb 13, 2011

Leigh Johnston said:
"Leigh Johnston"<[email protected]> wrote in message

Click to expand...

On 12/02/2011 19:27, Juha Nieminen wrote:
Array indices are never negative.

Click to expand...

Do you know what value std::string::npos usually is? That's right,
a value which would otherwise be a completely valid index, but which
is just agreed to mean "not any valid index". Usually this value is
size_t(-1). In other words, the natural value for npos would be -1,
but since the index type is unsigned, it cannot be used.

Click to expand...

Even if we conceded that array indices are never negative, the result
of subtracting to array indices can be, in which case you have to do
the
manual casting to avoid problems. (Why do you think ptrdiff_t is
signed?)

Click to expand...

Array indices are never negative; the result of subtracting array
indices
cannot be used as an array index if the result would otherwise be
negative
(if signed).

Click to expand...

char str[24];
char* p_str = str+12;
p_str[-12] ='A';

Click to expand...

Touché.

Click to expand...

Touché? I don't think so. Pointers are not arrays. In C and C++ array
indexes are never negative. An array is a distinct concept in C/C++ and
has a distinct type; "p_str" above could subsequently be made to point to
something else (other than an array).

Most arrays are stored on the heap. So how else can we have a dynamic array
if it is not a pointer?

int* p_arr = new int[10];
p_arr+=5;

p_arrr[-5];
The above is a negative array index, there is no other reasonable
description for it.

Not true; an array is a distinct concept, a pointer is a distinct concept:

template <typename T>
void is_same_i_dont_think_so(const T&, const T&) {}
...
is_same_i_dont_think_so(str, p_str);

Just because an array can decay to a pointer does not mean that arrays are
the same as pointers.

The above argument is complete nonsense because even two *local* arrays can
fail this test if they are different types/sizes.
And IIRC is there not a rule in C++ about not referencing arrays?.
There is no other way to create a dynamic array, except by using a pointer.
It doesn't *decay* into a pointer it begins life as a pointer.

Of course, user defined classes may (and generally do) override
[] to signify indexation. But there's certainly no requirement
there that array indices cannot be negative---it's purely a
question of how the author defines his class. (There are a few
algorithms where allowing arbitrary upper and lower bounds can
be quite useful.)

Click to expand...

A user defined class is not a C/C++ array and overriding operator[] in a
user defined class is mostly irrelevant as I am talking about C/C++
arrays.

std::tr1::array is the standard array container class in C++0x and its
index type is std::size_t as array indices are never negative.

Um so its ok to use class types, such as std containers, when they support
your argument but not when they are against you?

The fact that Kanze chooses to act as a troll and chime in with Paul The
Troll is rather sad and pathetic. Kanze is now precariously close to my
killfile.

This is totally predictable Leigh behaviour, when he is losing an argument
he always resorts to this. He is like a baby throwing the dummy-tit out of
the pram.

Paul · Feb 13, 2011

Leigh Johnston said:
On 12/02/2011 19:27, Juha Nieminen wrote:
Array indices are never negative.

Do you know what value std::string::npos usually is? That's right,
a value which would otherwise be a completely valid index, but which
is just agreed to mean "not any valid index". Usually this value is
size_t(-1). In other words, the natural value for npos would be -1,
but since the index type is unsigned, it cannot be used.

Even if we conceded that array indices are never negative, the result
of subtracting to array indices can be, in which case you have to
do the
manual casting to avoid problems. (Why do you think ptrdiff_t is
signed?)

Array indices are never negative; the result of subtracting array
indices
cannot be used as an array index if the result would otherwise be
negative
(if signed).

char str[24];
char* p_str = str+12;
p_str[-12] ='A';

Touché.

Click to expand...

Touché? I don't think so. Pointers are not arrays. In C and C++ array
indexes are never negative. An array is a distinct concept in C/C++ and
has a distinct type; "p_str" above could subsequently be made to point
to something else (other than an array).

Click to expand...

Even if one considers the dynamic array case (and dynamic arrays are old
hat in C++ land, prefer std::vector instead) the *array* index can still
not be negative:

So now he implies that a dynamic array is obsolete, this is obviously a
joke.

int* pa = new int[10](); // (1)
pa += 5;
pa[-5] = 42; // (2)
pa -= 5;
delete[] pa;

In (1) above the pointer pa *points to* an array. In (2) above the pa
pointer no longer refers to an array; it just refers to an array element.
In (2) above the *array* index is actually 0 not -5 (5 + -5 = 0).

No such arithmetic takes place re

5+(-5)=0), this is complete nonsense.
If it was calculated as 0, it would refer to some out of the bounds memory
and be an error.

I repeat: arrays and pointers are distinct entities in C and C++.

An array on the heap, which most arrays are, is a pointer to memory.

Arrays can decay into pointers but that does not mean that arrays are
pointers.

It doesn't *decay* at all, it is born as a pointer.

A pointer can point to an array element but that does not mean that a
pointer is an array.

Nobody said a pointer is an array and that isn't the argument. The argument
is that an array cannot have a negative index. This displays his attempt to
shift the focus of the argument.
He said 'an array index cannot be negative', I think we have, once again,
proven him to be incorrect.

Paul · Feb 13, 2011

Leigh Johnston said:
On 13/02/2011 12:47, James Kanze wrote:

On 12/02/2011 19:27, Juha Nieminen wrote:
Array indices are never negative.

Do you know what value std::string::npos usually is? That's right,
a value which would otherwise be a completely valid index, but which
is just agreed to mean "not any valid index". Usually this value is
size_t(-1). In other words, the natural value for npos would be -1,
but since the index type is unsigned, it cannot be used.

Even if we conceded that array indices are never negative, the
result
of subtracting to array indices can be, in which case you have to
do the
manual casting to avoid problems. (Why do you think ptrdiff_t is
signed?)

Array indices are never negative; the result of subtracting array
indices
cannot be used as an array index if the result would otherwise be
negative
(if signed).

char str[24];
char* p_str = str+12;
p_str[-12] ='A';

Touché.

Touché? I don't think so. Pointers are not arrays. In C and C++ array
indexes are never negative. An array is a distinct concept in C/C++ and
has a distinct type; "p_str" above could subsequently be made to point
to something else (other than an array).

Click to expand...

Even if one considers the dynamic array case (and dynamic arrays are old
hat in C++ land, prefer std::vector instead) the *array* index can still
not be negative:

int* pa = new int[10](); // (1)
pa += 5;
pa[-5] = 42; // (2)
pa -= 5;
delete[] pa;

In (1) above the pointer pa *points to* an array. In (2) above the pa
pointer no longer refers to an array; it just refers to an array
element. In (2) above the *array* index is actually 0 not -5 (5 + -5 =
0).

Click to expand...

For the confused trolls (Paul), the stupid trolls (Paul) and the trolls
with nothing better to do (Kanze):

I am not implying that 5 + -5 is actually emitted by the compiler but that
the *underlying array index* of the array element that pa[-5] refers to is
0 and a compiler might even optimize accordingly reflecting this.

I never claimed that *pointer* arithmetic can not involve negative values
(std:trdiff_t is a signed integral type) instead I claimed that you
cannot have negative *array* indices.

In (2) above the pointer pa no longer refers to an array as it no longer
points to the start of an array instead it just refers to an array element
hence the -5 in pa[-5] is not an array index but a pointer offset; the
array index is 0.

I repeat for the third and final time: arrays and pointers are distinct
C/C++ concepts.

An array index is the value inside the square brackets that you use to index
an array.

He can repeat as many times as he likes, that an array is not a pointer but
this doesn't change the fact he is wrong to state that an array index cannot
be negative.

As far as dynamic arrays vs std::vector is concerned std::vector is
superior to dynamic arrays making dynamic arrays more or less obsolete in
C++. Even a container allocator will likely be calling the scalar form of
operator new (which will likely call malloc) rather then using the array
form of operator new to allocate an array of bytes into which the
container elements can be constructed.

Vector is not superior to a dynamic array. A vector is an STL container and
is used in completely different situations. The fact that Leigh seems to
compare the two as the same thing just shows his noobness.

Bjarke Hammersholt Roune · Feb 14, 2011

Do you know what value std::string::npos usually is? That's right,
a value which would otherwise be a completely valid index, but which
is just agreed to mean "not any valid index". Usually this value is
size_t(-1). In other words, the natural value for npos would be -1,
but since the index type is unsigned, it cannot be used.

In this case -1 is merely a shorter way of saying
std::numeric_limits<size_t>::max(), and once you've typed that a few
times you come to appreciate (size_t)-1. Once you realize that you
also realize that in fact npos could never be a valid index since
valid indexes are strictly less than the size of the array (or in this
case string) you are indexing into, and the maximum possible number of
addressable bytes is precisely std::numeric_limits<size_t>::max, or
npos. So (size_t)-1 is a reasonable value for npos to have - it is the
one and only unsigned value that could not be a valid index in any
imaginable system. Using signed indexes would, on the other hand, make
50% of otherwise perfectly good bit patterns unusable as indexes. I
don't know that that's a big problem, but your argument for preserving
an "otherwise completely valid index" comes down on the opposite side
than you intended.

Problem with std::list and sorting	3	Jan 11, 2005
Crash while sorting a vector of pointers	3	Feb 13, 2008
If I change from std::vector to std::list my compiler becomes very displeased with me	5	Jan 3, 2005
Creating a list of multiple element rows, and sorting them.	2	Jul 2, 2006
design choice	1	May 27, 2008
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

sorting 5million minute data: choose std::list or std::vector

James Kanze

James Kanze

Paul

James Kanze

Paul

Juha Nieminen

gwowen

Robert Hairgrove

James Kanze

Paul

Juha Nieminen

Paul

Miles Bader

Sjouke Burry

SG

James Kanze

Paul

Paul

Paul

Bjarke Hammersholt Roune

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads