performance of std::vector<double>, double[] and uBlas::vector ondifferent CPU

I

Ingo Nolden

Dear Group,



I am a little confused by the result of a code that should give me some
information about CPU cache effects.

I wrote a function that performs some flops on a vector/array of doubles
or floats and of changing size . While playing around and trying
different things I compared the use of a standard c array with a
std::vector and the vector from the boost library.
The program was compiled with a VC++ 7.1 comiler with the default
release settings, and later with also whole program optimization and
global optimization activated ( which made no difference ).

On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was extremely
surprising as the raw c array performed as expected in a range between
300 and 350 MFlops ( if my Flops calculation is right ).
The other arrays however were about 80 times !!!! slower.
I had them expected to be probably some percentage slower.
Investigating the asm code ( as far as I can guess what it means ) seemd
to be doing the same thing however.
This made me think that it must be an issue about memory access. It can
not be due to main memory size because the difference occurs at any
array size, beginning from 1 Mbyte.
Now I wanted to prove that it is not an compiler/optimization dependant
issue. I ran the executable on a different machine, which is a AMD
Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more
surprising result:
The std::vector and uBlas::vector performed well and even superceded the
plain c array.

I usually don't care so much about performance, but 8000% is worth
thinking about it.

Below is my source. If one has no uBlas at hand, he can comment the two
lines and it should work.

Also, to get back to my original intention, I want to change from
sequential access to arbitrary access of the vector items. Does anyone
know a good/standard way to do so? It should put additional effort on
the CPU.

So, here goes my code:


#include <iostream>
#include <fstream>

#include <vector>
#include <list>

#include <windows.h>

//#include <math.h>
//#include <float.h>

#include <boost/numeric/ublas/vector.hpp>

using namespace std;

ofstream trash( "trash.txt" );

namespace my
{
template< typename ValueT >
inline ValueT const& max( ValueT const& l, ValueT const& r )
{
return ( l > r ) ? l : r;
}

template< typename ValueT >
inline ValueT const& min( ValueT const& l, ValueT const& r )
{
return ( l < r ) ? l : r;
}
}


template< typename ValueT, typename ArrayT > inline
void InitializeArray( ArrayT array, unsigned &length, ValueT &initializer )
{
for( unsigned i = 0; i < length; ++i )
array[ i ] = initializer;
}

template< typename ValueT > inline
ValueT ProcessMixedOps( ValueT &value )
{
return static_cast<ValueT>( (1.0 + value) * (1.5 - value) / value );
}

template< typename ValueT, typename ArrayT > inline
ValueT ProcessMixedArray( ArrayT &array, unsigned &length, unsigned &loops )
{
ValueT result = 1;
for( unsigned j = 0; j < loops; ++j )
for( unsigned i = 0; i < length; ++i )
result *= ProcessMixedOps( array[ i ] );
return result;
}


template< typename ValueT >
class Memory
{
public:
template< typename ArrayT >
ArrayT Alloc( unsigned &length )
{
return ArrayT( length );
}

template<>
ValueT* Alloc< ValueT* >( unsigned &length )
{
return new ValueT[ length ];
}

template< typename ArrayT >
void Dealloc( ArrayT &array )
{
//array.clear( );
}
template<>
void Dealloc< ValueT* >( ValueT* &array )
{
delete array;
}
};


template< typename ValueT, typename ArrayT >
double Test( ValueT init, unsigned memLength )
{
unsigned length = memLength / sizeof( ValueT );

ArrayT Vector = Memory<ValueT>( ).Alloc<ArrayT>( length );

InitializeArray( Vector, length, init );

unsigned loops = my::max( 10000000 / length, (unsigned)1 );

unsigned tick = GetTickCount( );

double res = ProcessMixedArray<ValueT>( Vector, length, loops );


tick = GetTickCount( ) - tick;

Memory<ValueT>( ).Dealloc<ArrayT>( Vector );
double dSec = (double) tick / (double) 1000;

trash << res << endl; // output and forget result

double dFlops = (double) length * 5.0 * loops;
double dMFlops = dFlops / 1000000.0;

return dMFlops / dSec;
}

int main2( )
{
//unsigned min_size_p = 10; // 2 ^ 10 = 1.024
//unsigned min_size_p = 12; // 2 ^ 12 = 4.096
unsigned min_size_p = 1000000; // 2 ^ 3 = 16.384


unsigned max_size_p = 200000000; // 2 ^ 25 = 33.554.432
//unsigned max_size_p = 10; // 2 ^ 25 = 33.554.432

cout << "Max Vector memory length: ";
cout << (unsigned)pow( 2, max_size_p );
cout << endl;

DWORD dwNumber = GetTickCount( );
dwNumber = GetTickCount( ) / dwNumber;

short number = static_cast<short>( dwNumber );
cout << "number: " << number << endl;

//cout << "double \t float \t int\n";

cout << "\tdouble* 1\tvector<double>\tuBlas::vector<double>\n";

for( unsigned v_size_p = min_size_p ; v_size_p < max_size_p; v_size_p
+= 2000000 )
{
unsigned size = v_size_p;//(unsigned)pow( 2, v_size_p );
cout << fixed << size << "\t";
cout << Test<double, double*>( number, size ) << "\t";
cout << Test<double, vector<double> >( number, size ) << "\t";
cout << Test<double, boost::numeric::ublas::vector< double> >( number,
size ) << "\t";
//cout << Test<double, list<double> >( number, size ) << "\t";
//cout << Test<float, float*>( number, size ) << "\t";
//cout << Test<float, vector<float> >( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";



//cout << Test<float, float*>( number, size ) << "\t";
//cout << Test<int, int*>( number, size ) << "\t";

cout << "\n";
}


//cout << dMFlops << " / " << dSec << " = " << dMFlops / dSec;

cout << endl;





return 0;
}
int main( )
{
return main2( );
}
 
A

Axter

Ingo said:
Dear Group,



I am a little confused by the result of a code that should give me some
information about CPU cache effects.

I wrote a function that performs some flops on a vector/array of doubles
or floats and of changing size . While playing around and trying
different things I compared the use of a standard c array with a
std::vector and the vector from the boost library.
The program was compiled with a VC++ 7.1 comiler with the default
release settings, and later with also whole program optimization and
global optimization activated ( which made no difference ).

On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was extremely
surprising as the raw c array performed as expected in a range between
300 and 350 MFlops ( if my Flops calculation is right ).
The other arrays however were about 80 times !!!! slower.
I had them expected to be probably some percentage slower.
Investigating the asm code ( as far as I can guess what it means ) seemd
to be doing the same thing however.
This made me think that it must be an issue about memory access. It can
not be due to main memory size because the difference occurs at any
array size, beginning from 1 Mbyte.
Now I wanted to prove that it is not an compiler/optimization dependant
issue. I ran the executable on a different machine, which is a AMD
Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more
surprising result:
The std::vector and uBlas::vector performed well and even superceded the
plain c array.

I usually don't care so much about performance, but 8000% is worth
thinking about it.

Below is my source. If one has no uBlas at hand, he can comment the two
lines and it should work.

Also, to get back to my original intention, I want to change from
sequential access to arbitrary access of the vector items. Does anyone
know a good/standard way to do so? It should put additional effort on
the CPU.

So, here goes my code:


#include <iostream>
#include <fstream>

#include <vector>
#include <list>

#include <windows.h>

//#include <math.h>
//#include <float.h>

#include <boost/numeric/ublas/vector.hpp>

using namespace std;

ofstream trash( "trash.txt" );

namespace my
{
template< typename ValueT >
inline ValueT const& max( ValueT const& l, ValueT const& r )
{
return ( l > r ) ? l : r;
}

template< typename ValueT >
inline ValueT const& min( ValueT const& l, ValueT const& r )
{
return ( l < r ) ? l : r;
}
}


template< typename ValueT, typename ArrayT > inline
void InitializeArray( ArrayT array, unsigned &length, ValueT &initializer )
{
for( unsigned i = 0; i < length; ++i )
array[ i ] = initializer;
}

template< typename ValueT > inline
ValueT ProcessMixedOps( ValueT &value )
{
return static_cast<ValueT>( (1.0 + value) * (1.5 - value) / value );
}

template< typename ValueT, typename ArrayT > inline
ValueT ProcessMixedArray( ArrayT &array, unsigned &length, unsigned &loops )
{
ValueT result = 1;
for( unsigned j = 0; j < loops; ++j )
for( unsigned i = 0; i < length; ++i )
result *= ProcessMixedOps( array[ i ] );
return result;
}


template< typename ValueT >
class Memory
{
public:
template< typename ArrayT >
ArrayT Alloc( unsigned &length )
{
return ArrayT( length );
}

template<>
ValueT* Alloc< ValueT* >( unsigned &length )
{
return new ValueT[ length ];
}

template< typename ArrayT >
void Dealloc( ArrayT &array )
{
//array.clear( );
}
template<>
void Dealloc< ValueT* >( ValueT* &array )
{
delete array;
}
};


template< typename ValueT, typename ArrayT >
double Test( ValueT init, unsigned memLength )
{
unsigned length = memLength / sizeof( ValueT );

ArrayT Vector = Memory<ValueT>( ).Alloc<ArrayT>( length );

InitializeArray( Vector, length, init );

unsigned loops = my::max( 10000000 / length, (unsigned)1 );

unsigned tick = GetTickCount( );

double res = ProcessMixedArray<ValueT>( Vector, length, loops );


tick = GetTickCount( ) - tick;

Memory<ValueT>( ).Dealloc<ArrayT>( Vector );
double dSec = (double) tick / (double) 1000;

trash << res << endl; // output and forget result

double dFlops = (double) length * 5.0 * loops;
double dMFlops = dFlops / 1000000.0;

return dMFlops / dSec;
}

int main2( )
{
//unsigned min_size_p = 10; // 2 ^ 10 = 1.024
//unsigned min_size_p = 12; // 2 ^ 12 = 4.096
unsigned min_size_p = 1000000; // 2 ^ 3 = 16.384


unsigned max_size_p = 200000000; // 2 ^ 25 = 33.554.432
//unsigned max_size_p = 10; // 2 ^ 25 = 33.554.432

cout << "Max Vector memory length: ";
cout << (unsigned)pow( 2, max_size_p );
cout << endl;

DWORD dwNumber = GetTickCount( );
dwNumber = GetTickCount( ) / dwNumber;

short number = static_cast<short>( dwNumber );
cout << "number: " << number << endl;

//cout << "double \t float \t int\n";

cout << "\tdouble* 1\tvector<double>\tuBlas::vector<double>\n";

for( unsigned v_size_p = min_size_p ; v_size_p < max_size_p; v_size_p
+= 2000000 )
{
unsigned size = v_size_p;//(unsigned)pow( 2, v_size_p );
cout << fixed << size << "\t";
cout << Test<double, double*>( number, size ) << "\t";
cout << Test<double, vector<double> >( number, size ) << "\t";
cout << Test<double, boost::numeric::ublas::vector< double> >( number,
size ) << "\t";
//cout << Test<double, list<double> >( number, size ) << "\t";
//cout << Test<float, float*>( number, size ) << "\t";
//cout << Test<float, vector<float> >( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";
//cout << Test<double, double*>( number, size ) << "\t";



//cout << Test<float, float*>( number, size ) << "\t";
//cout << Test<int, int*>( number, size ) << "\t";

cout << "\n";
}


//cout << dMFlops << " / " << dSec << " = " << dMFlops / dSec;

cout << endl;





return 0;
}
int main( )
{
return main2( );
}

When you did the test that showed vector being slower, did you do that
test in DEBUG mode?
If you did, then your test is invalid.
You should perform all performance test in release mode only.
I've perform test with vector VS C-Style array, and in my test the
vector out performance the C-Style array.
My test used VC++ 6.0 and VC++ 7.1
 
I

Ingo Nolden

When you did the test that showed vector being slower, did you do that
test in DEBUG mode?
If you did, then your test is invalid.
You should perform all performance test in release mode only.
I've perform test with vector VS C-Style array, and in my test the
vector out performance the C-Style array.
My test used VC++ 6.0 and VC++ 7.1

Hi Axter,

thank you for your reply,

as I wrote I did the test in Release mode, and I explained in detail
what settings I used. So I didn't leave any space for guesses.
If it was in debug mode, it wouldn't have surprised me.
Also your result doens't surprise too much. As I wrote, on my AMD CPU I
got the same result as you. *** with the same exe build as on the intel
machine ***

So, but what I would like to know, what CPU do you have?

As long as there is nobody coming up with an idea whats going wrong, I
could try to examine on what type of CPU I get what behaviour.

thanks
Ingo
 
U

Uenal Mutlu

I am a little confused by the result of a code that should give me some
information about CPU cache effects.

I wrote a function that performs some flops on a vector/array of doubles
or floats and of changing size . While playing around and trying
different things I compared the use of a standard c array with a
std::vector and the vector from the boost library.
The program was compiled with a VC++ 7.1 comiler with the default
release settings, and later with also whole program optimization and
global optimization activated ( which made no difference ).

On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was extremely
surprising as the raw c array performed as expected in a range between
300 and 350 MFlops ( if my Flops calculation is right ).
The other arrays however were about 80 times !!!! slower.
I had them expected to be probably some percentage slower.
Investigating the asm code ( as far as I can guess what it means ) seemd
to be doing the same thing however.
This made me think that it must be an issue about memory access. It can
not be due to main memory size because the difference occurs at any
array size, beginning from 1 Mbyte.
Now I wanted to prove that it is not an compiler/optimization dependant
issue. I ran the executable on a different machine, which is a AMD
Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more
surprising result:
The std::vector and uBlas::vector performed well and even superceded the
plain c array.

How is that ever possible? I guess it is mostly due to your code and/or CPU caching.
I usually don't care so much about performance, but 8000% is worth
thinking about it.

Below is my source. If one has no uBlas at hand, he can comment the two
lines and it should work.

Also, to get back to my original intention, I want to change from
sequential access to arbitrary access of the vector items. Does anyone
know a good/standard way to do so? It should put additional effort on
the CPU.
....

Try this framework:

/*
Measuring array access overhead STL vs. RAW
Written by uenal.mutlu at t-online.de

Compiler: VC++6, but should work with any compiler
AppType: Console
Compile and Link: CL /GX /W3 /Od PerfTest.cpp
Sample output:
int clkTicksSTL: 1492 clkTicksRAW: 1412
float clkTicksSTL: 1482 clkTicksRAW: 1412
double clkTicksSTL: 1492 clkTicksRAW: 1412

Result: the performance penalty for STL is about 5%.
This is IMO neglectable.
*/

#include <iostream>
#include <vector>
#include <ctime>

template <typename T>
void PerfTestArrayAccess_STL_vs_RAW(const char* const pszTypename,
const size_t& nelems,
const size_t& niterations,
clock_t& retClkTicksSTL,
clock_t& retClkTicksRAW,
bool AfDump = true)
{
retClkTicksSTL = retClkTicksRAW = 0;
size_t i, j;


// Timing STL array access:
std::vector<T> vect(nelems);
clock_t clkTicksStart = clock();
unsigned dummycounter = 0;
for (i = 0; i < niterations; i++)
for (j = 0; j < nelems; j++)
{
unsigned ix = (rand() * rand()) % nelems;
if (!(unsigned(vect[ix]) % 2))
dummycounter++;
}
retClkTicksSTL = clock() - clkTicksStart;
vect.clear();


// Timing RAW array access:
T* pa = new T[nelems];
clkTicksStart = clock();
dummycounter = 0;
for (i = 0; i < niterations; i++)
for (j = 0; j < nelems; j++)
{
unsigned ix = (rand() * rand()) % nelems;
if (!(unsigned(pa[ix]) % 2))
dummycounter++;
}
retClkTicksRAW = unsigned(clock() - clkTicksStart);
delete pa;

if (AfDump)
std::cout << pszTypename << " "
<< "clkTicksSTL: " << retClkTicksSTL << " "
<< "clkTicksRAW: " << retClkTicksRAW << std::endl;
}

int main(int argc, char* argv[])
{
const size_t nelems = 1000000;
const size_t niterations = 5;

clock_t clkTicksSTL, clkTicksRAW;

PerfTestArrayAccess_STL_vs_RAW<int>( "int ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
PerfTestArrayAccess_STL_vs_RAW<float>( "float ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
PerfTestArrayAccess_STL_vs_RAW<double>("double", nelems, niterations, clkTicksSTL, clkTicksRAW, true);

return 0;
}
 
B

block111

Not sure if I'm right, but I think that the reason c-array appears to
be slower when the size is > 1Mb is that c-array is stack-based and
default stack size for windows apps compiled with vc is 1mb. Vector
doesn't store values on stack (maybe except cases for short array, but
it's not important in this case) so with large c-arrays there might be
some sort of overhead handling extra stack size. I didn't check your
long sorce code, but from what I read I have no other idea for such
strange results...
 
B

block111

Your code doesn't compile well (at least for me)
it seems that you use dynamic c-arrays so, I was wrong about stack
based overhead.
Why don't you want to use std::min and std::max defined in <algorithm>?
 
U

Uenal Mutlu

Your code doesn't compile well (at least for me)

Which compiler?
What error does it report?
Why don't you want to use std::min and std::max defined in <algorithm>?

Sorry, I don't know what you mean. There was no necessity
to use them in the posted code of mine.

BTW, in case you don't know: there is a possibility to quote the relevant
portions of a posting one replies to. This helps to understand what
the writer might have meant.
 
B

block111

does it matter which compiler as long as windows.h defines macros for
min and max and you DO use my::max in your code.
There were some portions of code that seem quite strange.
What does this part of code do:
DWORD dwNumber = GetTickCount( );
dwNumber = GetTickCount( ) / dwNumber;

Perhaps, for questions completely unrelated to any platform you would
want to avoid use of windows.h and use <ctime>
for example a timer could be

#include <ctime>

class timer {
std::clock_t t;
public:
timer() : t(std::clock()){}
double stop(){ return ( (static_cast<double>(std::clock() -t))/CLK_TCK
); }
};


And when I compiled the code I didn't have any unexpected results they
were all in a reasonable range, with vector/ublas::vector being a bit
better than c-style arrays. This probably is a result of your coding
and not any sort of optimization, IMO
 
B

block111

and yes, I know about quoting :)
I just use google::groups and do not subscribe to any usenet etc etc.
With google groups interface to have quoting I'd need to manually copy
your message and add "> " for each line
 
U

Uenal Mutlu

does it matter which compiler as long as windows.h defines macros for
min and max and you DO use my::max in your code.
There were some portions of code that seem quite strange.
What does this part of code do:
DWORD dwNumber = GetTickCount( );
dwNumber = GetTickCount( ) / dwNumber;

You are refering not to my code, it is the code of Ingo Nolden.
Mine does not use Windows stuff.
Here is a slightly updated version of my code:

/*
PerfTest.cpp v1.01

Measuring array access overhead STL vs. RAW
Written by uenal.mutlu at t-online.de

Compiler: VC++6, but should work with any compiler
AppType: Console
Compile and Link: CL /GX /W3 /Od PerfTest.cpp
Sample output:
int clkTicksSTL: 1492 clkTicksRAW: 1412
float clkTicksSTL: 1482 clkTicksRAW: 1412
double clkTicksSTL: 1492 clkTicksRAW: 1412

Result: the performance penalty for STL is about 5%.
This is IMO neglectable.
*/

#include <iostream>
#include <vector>
#include <ctime>

template <typename T>
void PerfTestArrayAccess_STL_vs_RAW(const char* const pszTypename,
const size_t& nelems,
const size_t& niterations,
clock_t& retClkTicksSTL,
clock_t& retClkTicksRAW,
bool AfDump = true)
{
retClkTicksSTL = retClkTicksRAW = 0;
size_t i, j;


// Timing STL array access:
std::vector<T> vect(nelems);
clock_t clkTicksStart = clock();
unsigned dummycounter = 0;
for (i = 0; i < niterations; i++)
for (j = 0; j < nelems; j++)
{
unsigned ix = (rand() * rand()) % nelems;
if (!(unsigned(vect[ix]) % 2))
dummycounter++;
}
retClkTicksSTL = clock() - clkTicksStart;
vect.clear();


// Timing RAW array access:
T* pa = new T[nelems];
clkTicksStart = clock();
dummycounter = 0;
for (i = 0; i < niterations; i++)
for (j = 0; j < nelems; j++)
{
unsigned ix = (rand() * rand()) % nelems;
if (!(unsigned(pa[ix]) % 2))
dummycounter++;
}
retClkTicksRAW = clock() - clkTicksStart;
delete pa;

if (AfDump)
std::cout << pszTypename << " "
<< "clkTicksSTL: " << retClkTicksSTL << " "
<< "clkTicksRAW: " << retClkTicksRAW << std::endl;
}

int main(int argc, char* argv[])
{
const size_t nelems = 1000000;
const size_t niterations = 5;

clock_t clkTicksSTL, clkTicksRAW;

PerfTestArrayAccess_STL_vs_RAW<int>( "int ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
PerfTestArrayAccess_STL_vs_RAW<float>( "float ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
PerfTestArrayAccess_STL_vs_RAW<double>("double", nelems, niterations, clkTicksSTL, clkTicksRAW, true);

return 0;
}
 
B

block111

Yeah, I thought I was talking with the original poster. I din't
try/look at your code and cannot comment on that :)
 
R

Richard Herring

In message said:
and yes, I know about quoting :)
I just use google::groups and do not subscribe to any usenet etc etc.
With google groups interface to have quoting I'd need to manually copy
your message and add "> " for each line
Not true. Unless they've changed the interface *again* :-( iIf you click
on "show options" at the top of the message, then the "followup" that is
revealed there, Google will quote the message for you.
 
I

Ingo Nolden

I wanted to get a 1 without the compiler being aware.
But true said:
You are refering not to my code, it is the code of Ingo Nolden.
Mine does not use Windows stuff.
Here is a slightly updated version of my code:

/*
PerfTest.cpp v1.01

Measuring array access overhead STL vs. RAW
Written by uenal.mutlu at t-online.de

Compiler: VC++6, but should work with any compiler
AppType: Console
Compile and Link: CL /GX /W3 /Od PerfTest.cpp

and compiled with optimization disabled, what are you going to do with
it? The performance penalty of anything is quite irrelevant, if it is
not compiled with the settings that I use for production code, is it?
The assembly of my code looked quite similar for both dynamic c-array
and std::vector. This is what I actually hoped to see, but the type of
memory used seems different.

Sample output:
int clkTicksSTL: 1492 clkTicksRAW: 1412
float clkTicksSTL: 1482 clkTicksRAW: 1412
double clkTicksSTL: 1492 clkTicksRAW: 1412

Result: the performance penalty for STL is about 5%.
This is IMO neglectable.

against wich runtime are you linking. I get results comparable to yours
if I link against no-debug runtime - but as you did - disabled
optimization /Od

Ingo
 
I

Ingo Nolden

And when I compiled the code I didn't have any unexpected results they
were all in a reasonable range, with vector/ublas::vector being a bit
better than c-style arrays. This probably is a result of your coding
and not any sort of optimization, IMO

do you tell me what CPU you have?
 
U

Uenal Mutlu

I wanted to get a 1 without the compiler being aware.


and compiled with optimization disabled, what are you going to do with
it? The performance penalty of anything is quite irrelevant, if it is
not compiled with the settings that I use for production code, is it?
The assembly of my code looked quite similar for both dynamic c-array
and std::vector. This is what I actually hoped to see, but the type of
memory used seems different.



against wich runtime are you linking. I get results comparable to yours
if I link against no-debug runtime - but as you did - disabled
optimization /Od

I tested on W2kP.
Using /Ox (Full Optimization) gives:

int clkTicksSTL: 1191 clkTicksRAW: 1202
float clkTicksSTL: 1202 clkTicksRAW: 1201
double clkTicksSTL: 1202 clkTicksRAW: 1202

So, using full optimization there is virtually no overhead.
 
J

Jerry Coffin

Uenal Mutlu wrote:

[ ... ]
I tested on W2kP.
Using /Ox (Full Optimization) gives:

int clkTicksSTL: 1191 clkTicksRAW: 1202
float clkTicksSTL: 1202 clkTicksRAW: 1201
double clkTicksSTL: 1202 clkTicksRAW: 1202

So, using full optimization there is virtually no overhead.

I'd go even further: the numbers are sufficiently similar that there's
probably no statistically significant difference at all. In fact, with
the right set of optimizations (/O2b2 /G6ry) I'm consistently getting
results like this:

int clkTicksSTL: 125 clkTicksRAW: 140
float clkTicksSTL: 125 clkTicksRAW: 140
double clkTicksSTL: 125 clkTicksRAW: 141

I haven't looked at the assembly code to figure out exactly why, but in
this particular case, vectors are a bit _faster_ than arrays -- and in
this case, the differences are sufficiently large and consistent to be
significant.

Optimizing for the Pentium 4 (/G7) improves the raw array to
approximately equal the vector:

int clkTicksSTL: 109 clkTicksRAW: 125
float clkTicksSTL: 125 clkTicksRAW: 109
double clkTicksSTL: 109 clkTicksRAW: 110

OTOH, the noise in the measurement appears larger than any difference
we might be measuring. Bumping up the iteration count gives:

int clkTicksSTL: 4671 clkTicksRAW: 4672
float clkTicksSTL: 4657 clkTicksRAW: 4671
double clkTicksSTL: 4657 clkTicksRAW: 4656

This still gives pretty inconclusive results -- we'd have to run the
tests quite a few times to be sure whether there was a difference or
not -- but if it's this close, the vector is an easy choice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top