Performance measurement and optimization levels

A

Alex Vinokur

For instance, we need to measure performance
of assignment 'ch1 = ch2' where ch1 and ch2 are of char type.
We need to do that for different optimization levels of the same compiler.


Here is some test program.


Environment
-----------
Windows 2000
Intel (R) Celeron (R) CPU 1.70 GHz
GNU g++ 3.3.1 (cygming special), MINGW



========== C++ code : foo.cpp : BEGIN ==========
// Note. To simplify this demo program
// the clock() return value isn't checked
// ---------------------------------------------
#include <ctime>
#include <iostream>
using namespace std;

int main()
{
clock_t t0, tn;
unsigned long i = 0;
char ch;

#define REPETITIONS 100000000

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) {}
tn = clock ();
cout << "Do noting : " << (tn - t0) << " ticks" << endl;

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) ch = 'a';
tn = clock ();

cout << "Do something : " << (tn - t0) << " ticks" << endl;

return 0;
}
========== C++ code : foo.cpp : END ============


========= Compilation : BEGIN =========

$ g++ --version
g++ (GCC) 3.3.1 (cygming special)
[---omitted---]

$ g++ -mno-cygwin foo.cpp -o a0

$ g++ -mno-cygwin -O1 foo.cpp -o a1

$ g++ -mno-cygwin -O2 foo.cpp -o a2

$ g++ -mno-cygwin -O3 foo.cpp -o a3

$ wc *.exe
394 5333 424460 a0.exe
398 5294 424460 a1.exe
397 5293 424460 a2.exe
396 5303 424478 a3.exe
1585 21223 1697858 total

========= Compilation : END ===========


========= Run : BEGIN =========

$ a0
Do noting : 250 ticks
Do something : 371 ticks

$ a1
Do noting : 120 ticks
Do something : 130 ticks

$ a2
Do noting : 120 ticks
Do something : 120 ticks

$ a3
Do noting : 120 ticks
Do something : 120 ticks

========= Run : END ===========


We can see that only a0 generates believable results.
Most probably, assignment ch = 'a' in a1, a2, a3 is performed without loop.

So, how should one measure performance in the program above for optimization levels O1, O2, O3?
 
O

Owen Jacobson

For instance, we need to measure performance
of assignment 'ch1 = ch2' where ch1 and ch2 are of char type.
We need to do that for different optimization levels of the same compiler.

Very likely this specific operation will be the same at all levels -- some
approximation of mov ch1, ch2.

....

(Source and build commands kept for context)
t0 = clock ();
for (i = 0; i < REPETITIONS; i++) {}
tn = clock ();
cout << "Do noting : " << (tn - t0) << " ticks" << endl;
t0 = clock ();
for (i = 0; i < REPETITIONS; i++) ch = 'a';
tn = clock ();
cout << "Do something : " << (tn - t0) << " ticks" << endl;

$ g++ -mno-cygwin foo.cpp -o a0
$ a0
Do noting : 250 ticks
Do something : 371 ticks
$ g++ -mno-cygwin -O1 foo.cpp -o a1
$ a1
Do noting : 120 ticks
Do something : 130 ticks ....
$ g++ -mno-cygwin -O3 foo.cpp -o a3
$ a3
Do noting : 120 ticks
Do something : 120 ticks
We can see that only a0 generates believable results.
Most probably, assignment ch = 'a' in a1, a2, a3 is performed without loop.

So, how should one measure performance in the program above for optimization levels O1, O2, O3?

What, exactly, were you expecting the optimizer to do? *Not* optimize
your program?
 
P

Peter van Merkerk

Alex said:
For instance, we need to measure performance
of assignment 'ch1 = ch2' where ch1 and ch2 are of char type.
We need to do that for different optimization levels of the same compiler.


Here is some test program.


Environment
-----------
Windows 2000
Intel (R) Celeron (R) CPU 1.70 GHz
GNU g++ 3.3.1 (cygming special), MINGW



========== C++ code : foo.cpp : BEGIN ==========
// Note. To simplify this demo program
// the clock() return value isn't checked
// ---------------------------------------------
#include <ctime>
#include <iostream>
using namespace std;

int main()
{
clock_t t0, tn;
unsigned long i = 0;
char ch;

#define REPETITIONS 100000000

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) {}
tn = clock ();
cout << "Do noting : " << (tn - t0) << " ticks" << endl;

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) ch = 'a';
tn = clock ();

cout << "Do something : " << (tn - t0) << " ticks" << endl;

return 0;
}
========== C++ code : foo.cpp : END ============


========= Compilation : BEGIN =========

$ g++ --version
g++ (GCC) 3.3.1 (cygming special)
[---omitted---]

$ g++ -mno-cygwin foo.cpp -o a0

$ g++ -mno-cygwin -O1 foo.cpp -o a1

$ g++ -mno-cygwin -O2 foo.cpp -o a2

$ g++ -mno-cygwin -O3 foo.cpp -o a3

$ wc *.exe
394 5333 424460 a0.exe
398 5294 424460 a1.exe
397 5293 424460 a2.exe
396 5303 424478 a3.exe
1585 21223 1697858 total

========= Compilation : END ===========


========= Run : BEGIN =========

$ a0
Do noting : 250 ticks
Do something : 371 ticks

$ a1
Do noting : 120 ticks
Do something : 130 ticks

$ a2
Do noting : 120 ticks
Do something : 120 ticks

$ a3
Do noting : 120 ticks
Do something : 120 ticks

========= Run : END ===========


We can see that only a0 generates believable results.

a1, a2 and a3 are IMHO believable too. In fact with a good optimizer I
would expect results close to 0 ticks, because with this code the 'for'
loops can be completely eliminated.
Most probably, assignment ch = 'a' in a1, a2, a3 is performed without loop.

So, how should one measure performance in the program above for optimization levels O1, O2, O3?

Keep in mind that code that has no observable effects can be completely
optimized away by the optimizer. Since in your code 'ch' is assigned to
but never used, the optimizer can replace the assignment with nothing.
To prevent this optimization you could for example output the ch
variable after the loop has completed:

Also the 'for' loop can be replaced with something that has the same
effect (which may be nothing). For example:

for (i = 0; i < REPETITIONS; i++) ch = 'a';

Can be replaced with:

ch = 'a';

MSVC can do this optimization, and can handle even more complex cases.
For example with optimization enabled the following code:

int main()
{
int i = 10;

for(int j= 0; j < 10; ++j)
{
i += 10;
}

return i;
}

Will produce the equivalent of:

int main()
{
return 110;
}

Like I said in another thread; making a good benchmark is extremely
tricky. Artifical code like you posted, is prone to produce
non-representative benchmark results.
 
S

Siemel Naran

int main()
{
clock_t t0, tn;
unsigned long i = 0;
char ch;

#define REPETITIONS 100000000

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) {}

A good optimizer will optimize the above out of existence. It does nothing
anyway.
tn = clock ();
cout << "Do noting : " << (tn - t0) << " ticks" << endl;

t0 = clock ();
for (i = 0; i < REPETITIONS; i++) ch = 'a';

A good compiler will optimize the above loop to { ch = 'a'; }, just one
assignment.
tn = clock ();

cout << "Do something : " << (tn - t0) << " ticks" << endl;

return 0;
}
$ a0
Do noting : 250 ticks
Do something : 371 ticks

$ a1
Do noting : 120 ticks
Do something : 130 ticks

$ a2
Do noting : 120 ticks
Do something : 120 ticks

$ a3
Do noting : 120 ticks
Do something : 120 ticks

========= Run : END ===========


We can see that only a0 generates believable results.
Most probably, assignment ch = 'a' in a1, a2, a3 is performed without loop.

So, how should one measure performance in the program above for
optimization levels O1, O2, O3?

We need to have side effects, or fool the optimizer to think there are side
effects, by calling external functions. There might be other ways too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top