Superbasic efficiency question

A

aaronfude

I'm working on a scientific computing application. I need a class
called Element which is no more than a collection of integers, or
"nodes" and has only on method int getNode(int i).

I would like to implement in the most efficient was possible. So I
summoned up my programming intellect and asked myself: Do I want to
have members such as int a, b, c, d, e or a single member such as int
a[5]. So I wrote the following snippet and compiled it with a -O3 flag:

int main(char *argv[], int argc) {

/*
int a, b, c, d, e;
for (int i = 0; i < 1000000000; i++) {
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
}

*/


int a[5];
for (int i = 0; i < 1000000000; i++) {
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
a[4] = 5;
}

return 0;
}

The first (commented out) version ran twice as fast. (For doubles
instead of ints, it was a factor of 4). So the simpleton part of me
thinks that that answers my question. The remaining part tells me that
it is never that simple. Finally, the cynical part of me thinks that it
all doesn't matter and other parts of the program are bound to be far
more time consuming.

Please help me resolve my internal struggle.
Many thanks in advance!

Aaron Fude
 
J

Joseph Turian

Aaron,

Performance on the toy code snippet is not informative of anything more
than performance on the toy code snippet.
Please help me resolve my internal struggle.

10 Write the code the most clear, maintable way possible.
20 Run the code under a profiler.
30 Optimize the actual performance bottleneck (as opposed to, say,
imaginary performance bottlenecks).
40 GOTO 20

If you have any problems with 30, get back to the group. Until then,
both you and us are just shooting in the dark.

Joseph
 
D

David White

Joseph Turian said:
Aaron,

Performance on the toy code snippet is not informative of anything more
than performance on the toy code snippet.


10 Write the code the most clear, maintable way possible.
20 Run the code under a profiler.
30 Optimize the actual performance bottleneck (as opposed to, say,
imaginary performance bottlenecks).

Optimizing might require significant changes, even major design changes. If
you know from the outset that speed is important, then consider it from the
outset. Toy code snippets can help you determine early on how you'll need to
go about doing certain things.
40 GOTO 20

If you have any problems with 30, get back to the group. Until then,
both you and us are just shooting in the dark.

DW
 
S

Sethalicious

I'm thinking that your getting a silent error on the code. I don't
believe an integer can handle the value 1000000000 in the following:
for (int i = 0; i < 1000000000; i++)

So what's happening is your integer "i" is overflowing and getting set
to an unpredictable value (possibly negative). You might have better
luck using an "unsigned long" or even "std::size_t".

I'm not sure what the standard says, but I think most compilers
implement 32 bit integers. So an "unsigned" int is 0 to 2^32, and the
"signed" int is -2^16 to 2^16. (Subtract or add a 1 where needed).

Check your compiler's docs to see the sizes of the integral data types.


Otherwise, there should be no real difference in performance of the
code with optimizations on. When you measure the time, remember to
record cpu clock cycles instead of time in seconds.
Hope this helps!

Chris J. (aka Sethalicious)
 
D

davidrubin

My feeling is that when you try to implement 'getNode', you will find
that the array-based implementation is faster than a switch statement.
You might be able to squeeze out some more performance though with a
variable-based implementation if you templatize 'getNode' on i, and
specialize for values of i. (Presumably there is some more generic code
in your application which makes 'getNodeA', 'getNodeB', etc.,
prohibitive.)
 
G

Gianni Mariani

I'm working on a scientific computing application. I need a class
called Element which is no more than a collection of integers, or
"nodes" and has only on method int getNode(int i).

I would like to implement in the most efficient was possible. So I
summoned up my programming intellect and asked myself: Do I want to
have members such as int a, b, c, d, e or a single member such as int
a[5]. So I wrote the following snippet and compiled it with a -O3 flag:

int main(char *argv[], int argc) {

/*
int a, b, c, d, e;
for (int i = 0; i < 1000000000; i++) {
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
}

*/


int a[5];
for (int i = 0; i < 1000000000; i++) {
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
a[4] = 5;
}

return 0;
}

The first (commented out) version ran twice as fast. (For doubles
instead of ints, it was a factor of 4). So the simpleton part of me
thinks that that answers my question. The remaining part tells me that
it is never that simple. Finally, the cynical part of me thinks that it
all doesn't matter and other parts of the program are bound to be far
more time consuming.

It is more than likely that the compiler re-arranged your code

int a, b, c, d, e;
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
for (int i = 0; i < 1000000000; i++) {
}

Or, perhaps the compiler placed the values in registers.

There is a deeper design question for you.

Are these values really related ? Do you do operations on them in
tandem ? Would you ever think that it might be interesting to write a
template function with a "pointer to member" of one of these values ?

I would go with the 5 separate values if they are truly separate. That
way it will be harder to run into other problems like going past array
bounds or issues with using the wrong index etc.

Anyhow, below is an example where the compiler can't (easily) make the
optimization above. The results are essentially identical with:
gcc version 4.0.0 20050102 (experimental)


#include <ostream>
#include <iostream>

struct X
{
virtual void F() = 0; // hard for compiler to optimize this
};

struct A
{
int a, b, c, d, e;
};

struct B
{
int a[5];
};


struct Av
: A, X
{
Av()
: A()
{
}

virtual void F()
{
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
}
};

struct Bv
: B, X
{
Bv()
: B()
{
}

virtual void F()
{
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
a[4] = 5;
a[5] = 6;
}
};

int main( int argc, char ** argv )
{
X * x;
if ( argc >= 2 )
{
std::cout << "Making an A\n";
x = new Av;
}
else
{
std::cout << "Making a B\n";
x = new Bv;
}

for (int i = 0; i < 1000000000; i++)
{
x->F();
}

}


$ g++ -O3 -o ./optimal_array_or_members ./optimal_array_or_members.cpp
$ time ./optimal_array_or_members
Making a B
6.900u 0.000s 0:06.92 99.7% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members d
Making an A
6.770u 0.000s 0:06.78 99.8% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members
Making a B
6.960u 0.010s 0:06.96 100.1% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members s
Making an A
6.920u 0.000s 0:06.92 100.0% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members
Making a B
7.010u 0.000s 0:07.00 100.1% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members s
Making an A
6.950u 0.000s 0:06.95 100.0% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members
Making a B
6.770u 0.000s 0:06.76 100.1% 0+0k 0+0io 216pf+0w
$ time ./optimal_array_or_members s
Making an A
6.850u 0.000s 0:06.84 100.1% 0+0k 0+0io 216pf+0w
 
D

David White

[snip]
int main(char *argv[], int argc) {

/*
int a, b, c, d, e;
for (int i = 0; i < 1000000000; i++) {
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
}

*/


int a[5];
for (int i = 0; i < 1000000000; i++) {
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
a[4] = 5;
}

return 0;
}

The first (commented out) version ran twice as fast. (For doubles
instead of ints, it was a factor of 4). So the simpleton part of me
thinks that that answers my question. The remaining part tells me that
it is never that simple. Finally, the cynical part of me thinks that it
all doesn't matter and other parts of the program are bound to be far
more time consuming.

The most likely reason for the difference is that the compiler automatically
places arrays in main memory but individual objects in registers when
possible. I don't see how this test helps for your particular problem. You
need your collection to be indexable, so you don't really have the option of
using individual named variables. Although test programs like this can be
useful, you have to determine the reason for the performance difference and
consider whether the results will hold for your program. If the reason is
register use, then the test probably doesn't mean much.

DW
 
J

Joseph Turian

Aaron,

Performance on the toy code snippet is not informative of anything more
than performance on the toy code snippet.
Please help me resolve my internal struggle.

10 Write the code the most clear, maintable way possible.
20 Run the code under a profiler.
30 Optimize the actual performance bottleneck (as opposed to, say,
imaginary performance bottlenecks).
40 GOTO 20

If you have any problems with 30, get back to the group. Until then,
both you and us are just shooting in the dark.

Joseph
 
D

davidrubin

My feeling is that when you try to implement 'getNode', you will find
that an array-based implemenation is faster than a switch statement.
OTOH, you might find that templatizing 'getNode' on i, and then
specializing for values of i is faster yet. Presumably, there is some
more generic part of your application that makes 'getNodeA',
'getNodeB', etc., prohibitive. /david
 
D

davidrubin

My feeling is that when you try to implement 'getNode', you will find
that an array-based implemenation is faster than a switch statement.
OTOH, you might find that templatizing 'getNode' on i, and then
specializing for values of i is faster yet. Presumably, there is some
more generic part of your application that makes 'getNodeA',
'getNodeB', etc., prohibitive.
 
D

davidrubin

My feeling is that when you try to implement 'getNode', you will find
that the array-based implementation is faster than a switch statement.
You might be able to squeeze out some more performance though with a
variable-based implementation if you templatize 'getNode' on i, and
specialize for values of i. (Presumably there is some more generic code
in your application which makes 'getNodeA', 'getNodeB', etc.,
prohibitive.)
 
D

davidrubin

My feeling is that when you try to implement 'getNode', you will find
that an array-based implemenation is faster than a switch statement.
OTOH, you might find that templatizing 'getNode' on i, and then
specializing for values of i is faster yet. Presumably, there is some
more generic part of your application that makes 'getNodeA',
'getNodeB', etc., prohibitive. /david
 
S

Sethalicious

I'm thinking that your getting a silent error on the code. I don't
believe an integer can handle the value 1000000000 in the following:
for (int i = 0; i < 1000000000; i++)

So what's happening is your integer "i" is overflowing and getting set
to an unpredictable value (possibly negative). You might have better
luck using an "unsigned long" or even "std::size_t".

I'm not sure what the standard says, but I think most compilers
implement 32 bit integers. So an "unsigned" int is 0 - 2^32, and the
"signed" int is -2^16 to 2^16. (Subtract or add a 1 where needed).

Check your compiler's docs to see the sizes of the integral data types.


Otherwise, there should be no real difference in performance of the
code with optimizations on. When you measure the time, remember to
record cpu clock cycles instead of time in seconds.
Hope this helps!

Chris J. (aka Sethalicious)
 
E

E. Robert Tisdale

I'm working on a scientific computing application.
I need a class called Element
which is no more than a collection of integers, or "nodes"
and has only on method int getNode(int i).
I would like to implement in the most efficient was possible. So I
summoned up my programming intellect and asked myself,
Do I want to have members such as int a, b, c, d and e?
Or a single member such as int a[5]?
So I wrote the following snippet and compiled it with a -O3 flag:
cat main.cc
#include <iostream>

int main(int argc, char* argv[]) {

if (1 < argc) {
const
size_t n = atoi(argv[1]);
#ifdef INNER
int a[5];
for (size_t j = 0; j < n; ++j) {
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
a[4] = 5;
}
#else //INNER
int a, b, c, d, e;
for (size_t j = 0; j < n; ++j) {
a = 1;
b = 2;
c = 3;
d = 4;
e = 5;
}
#endif//INNER
}
else {
std::cerr << argv[0] << " <iterations>"
<< std::endl;
}
return 0;
}
g++ --version g++ (GCC) 3.4.1
uname -srmpio
Linux 2.6.10-1.9_FC2smp i686 i686 i386 GNU/Linux
g++ -DINNER -Wall -ansi -pedantic -O3 -o main main.cc
time ./main 1000000000
3.362u 0.004s 0:03.36 100.0% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.357u 0.003s 0:03.36 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.388u 0.003s 0:03.39 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.388u 0.003s 0:03.39 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.396u 0.002s 0:03.40 99.7% 0+0k 0+0io 0pf+0w
g++ -Wall -ansi -pedantic -O3 -o main main.cc

time ./main 1000000000
3.396u 0.005s 0:03.40 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.360u 0.003s 0:03.37 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.371u 0.004s 0:03.37 100.0% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.388u 0.003s 0:03.39 99.7% 0+0k 0+0io 0pf+0w
time ./main 1000000000
3.371u 0.003s 0:03.37 100.0% 0+0k 0+0io 0pf+0w
The first (commented out) version ran twice as fast.

Mine doesn't.
(For doubles instead of ints, it was a factor of 4).
So the simpleton part of me thinks that that answers my question.
The remaining part tells me that it is never that simple.

We call that good intuition.
Finally, the [skeptical] part of me thinks that it all doesn't matter
and other parts of the program are bound to be far more time consuming.

Probably correct.
Please help me resolve my internal struggle.

Try upgrading your compiler.
 
G

Gianni Mariani

Sethalicious said:
I'm thinking that your getting a silent error on the code. I don't
believe an integer can handle the value 1000000000 in the following:




So what's happening is your integer "i" is overflowing and getting set
to an unpredictable value (possibly negative). You might have better
luck using an "unsigned long" or even "std::size_t".

Not in this case - 1billion is less that 2^31.
I'm not sure what the standard says, but I think most compilers
implement 32 bit integers. So an "unsigned" int is 0 to 2^32, and the
"signed" int is -2^16 to 2^16. (Subtract or add a 1 where needed).

Think again.

int has a range from - 2^31 to (2^31-1)
 
G

Gianni Mariani

My feeling is that when you try to implement 'getNode', you will find
that the array-based implementation is faster than a switch statement.

That's true. A pointer to member might be an alternative.
You might be able to squeeze out some more performance though with a
variable-based implementation if you templatize 'getNode' on i, and
specialize for values of i. (Presumably there is some more generic code
in your application which makes 'getNodeA', 'getNodeB', etc.,
prohibitive.)

Templates also take pointer to members.
 
S

Sethalicious

I'm thinking that your getting a silent error on the code. I don't
believe an integer can handle the value 1000000000 in the following:
for (int i = 0; i < 1000000000; i++)

So what's happening is your integer "i" is overflowing and getting set
to an unpredictable value (possibly negative). You might have better
luck using an "unsigned long" or even "std::size_t".

I'm not sure what the standard says, but I think most compilers
implement 32 bit integers. So an "unsigned" int is 0 - 2^32, and the
"signed" int is -2^16 to 2^16. (Subtract or add a 1 where needed).

Check your compiler's docs to see the sizes of the integral data types.


Otherwise, there should be no real difference in performance of the
code with optimizations on. When you measure the time, remember to
record cpu clock cycles instead of time in seconds.
Hope this helps!

Chris J. (aka Sethalicious)
 
J

Jonathan Turkanis

I'm working on a scientific computing application. I need a class
called Element which is no more than a collection of integers, or
"nodes" and has only on method int getNode(int i).

I would like to implement in the most efficient was possible. So I
summoned up my programming intellect and asked myself: Do I want to
have members such as int a, b, c, d, e or a single member such as int
a[5]. So I wrote the following snippet and compiled it with a -O3
flag:

Make sure that Element is well encapsulated so that you can switch easily
between different implementations. Then, take Joseph's advice and don't worry
about which implementation is better until you can do performance measurements
in real-world scenarios.

Jonathan
 
S

Sethalicious

I'm thinking that your getting a silent error on the code. I don't
believe an integer can handle the value 1000000000 in the following:
for (int i = 0; i < 1000000000; i++)

So what's happening is your integer "i" is overflowing and getting set
to an unpredictable value (possibly negative). You might have better
luck using an "unsigned long" or even "std::size_t".

I'm not sure what the standard says, but I think most compilers
implement 32 bit integers. So an "unsigned" int is 0 - 2^32, and the
"signed" int is -2^16 to 2^16. (Subtract or add a 1 where needed).

Check your compiler's docs to see the sizes of the integral data types.


Otherwise, there should be no real difference in performance of the
code with optimizations on. When you measure the time, remember to
record cpu clock cycles instead of time in seconds.
Hope this helps!

Chris J. (aka Sethalicious)
 
S

Sethalicious

Think again.
int has a range from - 2^31 to (2^31-1)

Whoops. Totally forgot about 2's complement. :-o I stand corrected!
Thanks!


Chris J. (aka Sethalicious)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top