performance of std::vector<double>, double[] and uBlas::vector ondifferent CPU

Discussion in 'C++' started by Ingo Nolden, Apr 24, 2005.

  1. Ingo Nolden

    Ingo Nolden Guest

    Dear Group,



    I am a little confused by the result of a code that should give me some
    information about CPU cache effects.

    I wrote a function that performs some flops on a vector/array of doubles
    or floats and of changing size . While playing around and trying
    different things I compared the use of a standard c array with a
    std::vector and the vector from the boost library.
    The program was compiled with a VC++ 7.1 comiler with the default
    release settings, and later with also whole program optimization and
    global optimization activated ( which made no difference ).

    On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was extremely
    surprising as the raw c array performed as expected in a range between
    300 and 350 MFlops ( if my Flops calculation is right ).
    The other arrays however were about 80 times !!!! slower.
    I had them expected to be probably some percentage slower.
    Investigating the asm code ( as far as I can guess what it means ) seemd
    to be doing the same thing however.
    This made me think that it must be an issue about memory access. It can
    not be due to main memory size because the difference occurs at any
    array size, beginning from 1 Mbyte.
    Now I wanted to prove that it is not an compiler/optimization dependant
    issue. I ran the executable on a different machine, which is a AMD
    Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more
    surprising result:
    The std::vector and uBlas::vector performed well and even superceded the
    plain c array.

    I usually don't care so much about performance, but 8000% is worth
    thinking about it.

    Below is my source. If one has no uBlas at hand, he can comment the two
    lines and it should work.

    Also, to get back to my original intention, I want to change from
    sequential access to arbitrary access of the vector items. Does anyone
    know a good/standard way to do so? It should put additional effort on
    the CPU.

    So, here goes my code:


    #include <iostream>
    #include <fstream>

    #include <vector>
    #include <list>

    #include <windows.h>

    //#include <math.h>
    //#include <float.h>

    #include <boost/numeric/ublas/vector.hpp>

    using namespace std;

    ofstream trash( "trash.txt" );

    namespace my
    {
    template< typename ValueT >
    inline ValueT const& max( ValueT const& l, ValueT const& r )
    {
    return ( l > r ) ? l : r;
    }

    template< typename ValueT >
    inline ValueT const& min( ValueT const& l, ValueT const& r )
    {
    return ( l < r ) ? l : r;
    }
    }


    template< typename ValueT, typename ArrayT > inline
    void InitializeArray( ArrayT array, unsigned &length, ValueT &initializer )
    {
    for( unsigned i = 0; i < length; ++i )
    array[ i ] = initializer;
    }

    template< typename ValueT > inline
    ValueT ProcessMixedOps( ValueT &value )
    {
    return static_cast<ValueT>( (1.0 + value) * (1.5 - value) / value );
    }

    template< typename ValueT, typename ArrayT > inline
    ValueT ProcessMixedArray( ArrayT &array, unsigned &length, unsigned &loops )
    {
    ValueT result = 1;
    for( unsigned j = 0; j < loops; ++j )
    for( unsigned i = 0; i < length; ++i )
    result *= ProcessMixedOps( array[ i ] );
    return result;
    }


    template< typename ValueT >
    class Memory
    {
    public:
    template< typename ArrayT >
    ArrayT Alloc( unsigned &length )
    {
    return ArrayT( length );
    }

    template<>
    ValueT* Alloc< ValueT* >( unsigned &length )
    {
    return new ValueT[ length ];
    }

    template< typename ArrayT >
    void Dealloc( ArrayT &array )
    {
    //array.clear( );
    }
    template<>
    void Dealloc< ValueT* >( ValueT* &array )
    {
    delete array;
    }
    };


    template< typename ValueT, typename ArrayT >
    double Test( ValueT init, unsigned memLength )
    {
    unsigned length = memLength / sizeof( ValueT );

    ArrayT Vector = Memory<ValueT>( ).Alloc<ArrayT>( length );

    InitializeArray( Vector, length, init );

    unsigned loops = my::max( 10000000 / length, (unsigned)1 );

    unsigned tick = GetTickCount( );

    double res = ProcessMixedArray<ValueT>( Vector, length, loops );


    tick = GetTickCount( ) - tick;

    Memory<ValueT>( ).Dealloc<ArrayT>( Vector );
    double dSec = (double) tick / (double) 1000;

    trash << res << endl; // output and forget result

    double dFlops = (double) length * 5.0 * loops;
    double dMFlops = dFlops / 1000000.0;

    return dMFlops / dSec;
    }

    int main2( )
    {
    //unsigned min_size_p = 10; // 2 ^ 10 = 1.024
    //unsigned min_size_p = 12; // 2 ^ 12 = 4.096
    unsigned min_size_p = 1000000; // 2 ^ 3 = 16.384


    unsigned max_size_p = 200000000; // 2 ^ 25 = 33.554.432
    //unsigned max_size_p = 10; // 2 ^ 25 = 33.554.432

    cout << "Max Vector memory length: ";
    cout << (unsigned)pow( 2, max_size_p );
    cout << endl;

    DWORD dwNumber = GetTickCount( );
    dwNumber = GetTickCount( ) / dwNumber;

    short number = static_cast<short>( dwNumber );
    cout << "number: " << number << endl;

    //cout << "double \t float \t int\n";

    cout << "\tdouble* 1\tvector<double>\tuBlas::vector<double>\n";

    for( unsigned v_size_p = min_size_p ; v_size_p < max_size_p; v_size_p
    += 2000000 )
    {
    unsigned size = v_size_p;//(unsigned)pow( 2, v_size_p );
    cout << fixed << size << "\t";
    cout << Test<double, double*>( number, size ) << "\t";
    cout << Test<double, vector<double> >( number, size ) << "\t";
    cout << Test<double, boost::numeric::ublas::vector< double> >( number,
    size ) << "\t";
    //cout << Test<double, list<double> >( number, size ) << "\t";
    //cout << Test<float, float*>( number, size ) << "\t";
    //cout << Test<float, vector<float> >( number, size ) << "\t";
    //cout << Test<double, double*>( number, size ) << "\t";
    //cout << Test<double, double*>( number, size ) << "\t";
    //cout << Test<double, double*>( number, size ) << "\t";
    //cout << Test<double, double*>( number, size ) << "\t";



    //cout << Test<float, float*>( number, size ) << "\t";
    //cout << Test<int, int*>( number, size ) << "\t";

    cout << "\n";
    }


    //cout << dMFlops << " / " << dSec << " = " << dMFlops / dSec;

    cout << endl;





    return 0;
    }
    int main( )
    {
    return main2( );
    }
     
    Ingo Nolden, Apr 24, 2005
    #1
    1. Advertising

  2. Ingo Nolden

    Axter Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    Ingo Nolden wrote:
    > Dear Group,
    >
    >
    >
    > I am a little confused by the result of a code that should give me

    some
    > information about CPU cache effects.
    >
    > I wrote a function that performs some flops on a vector/array of

    doubles
    > or floats and of changing size . While playing around and trying
    > different things I compared the use of a standard c array with a
    > std::vector and the vector from the boost library.
    > The program was compiled with a VC++ 7.1 comiler with the default
    > release settings, and later with also whole program optimization and
    > global optimization activated ( which made no difference ).
    >
    > On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was

    extremely
    > surprising as the raw c array performed as expected in a range

    between
    > 300 and 350 MFlops ( if my Flops calculation is right ).
    > The other arrays however were about 80 times !!!! slower.
    > I had them expected to be probably some percentage slower.
    > Investigating the asm code ( as far as I can guess what it means )

    seemd
    > to be doing the same thing however.
    > This made me think that it must be an issue about memory access. It

    can
    > not be due to main memory size because the difference occurs at any
    > array size, beginning from 1 Mbyte.
    > Now I wanted to prove that it is not an compiler/optimization

    dependant
    > issue. I ran the executable on a different machine, which is a AMD
    > Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more


    > surprising result:
    > The std::vector and uBlas::vector performed well and even superceded

    the
    > plain c array.
    >
    > I usually don't care so much about performance, but 8000% is worth
    > thinking about it.
    >
    > Below is my source. If one has no uBlas at hand, he can comment the

    two
    > lines and it should work.
    >
    > Also, to get back to my original intention, I want to change from
    > sequential access to arbitrary access of the vector items. Does

    anyone
    > know a good/standard way to do so? It should put additional effort on


    > the CPU.
    >
    > So, here goes my code:
    >
    >
    > #include <iostream>
    > #include <fstream>
    >
    > #include <vector>
    > #include <list>
    >
    > #include <windows.h>
    >
    > //#include <math.h>
    > //#include <float.h>
    >
    > #include <boost/numeric/ublas/vector.hpp>
    >
    > using namespace std;
    >
    > ofstream trash( "trash.txt" );
    >
    > namespace my
    > {
    > template< typename ValueT >
    > inline ValueT const& max( ValueT const& l, ValueT const& r )
    > {
    > return ( l > r ) ? l : r;
    > }
    >
    > template< typename ValueT >
    > inline ValueT const& min( ValueT const& l, ValueT const& r )
    > {
    > return ( l < r ) ? l : r;
    > }
    > }
    >
    >
    > template< typename ValueT, typename ArrayT > inline
    > void InitializeArray( ArrayT array, unsigned &length, ValueT

    &initializer )
    > {
    > for( unsigned i = 0; i < length; ++i )
    > array[ i ] = initializer;
    > }
    >
    > template< typename ValueT > inline
    > ValueT ProcessMixedOps( ValueT &value )
    > {
    > return static_cast<ValueT>( (1.0 + value) * (1.5 - value) / value );
    > }
    >
    > template< typename ValueT, typename ArrayT > inline
    > ValueT ProcessMixedArray( ArrayT &array, unsigned &length, unsigned

    &loops )
    > {
    > ValueT result = 1;
    > for( unsigned j = 0; j < loops; ++j )
    > for( unsigned i = 0; i < length; ++i )
    > result *= ProcessMixedOps( array[ i ] );
    > return result;
    > }
    >
    >
    > template< typename ValueT >
    > class Memory
    > {
    > public:
    > template< typename ArrayT >
    > ArrayT Alloc( unsigned &length )
    > {
    > return ArrayT( length );
    > }
    >
    > template<>
    > ValueT* Alloc< ValueT* >( unsigned &length )
    > {
    > return new ValueT[ length ];
    > }
    >
    > template< typename ArrayT >
    > void Dealloc( ArrayT &array )
    > {
    > //array.clear( );
    > }
    > template<>
    > void Dealloc< ValueT* >( ValueT* &array )
    > {
    > delete array;
    > }
    > };
    >
    >
    > template< typename ValueT, typename ArrayT >
    > double Test( ValueT init, unsigned memLength )
    > {
    > unsigned length = memLength / sizeof( ValueT );
    >
    > ArrayT Vector = Memory<ValueT>( ).Alloc<ArrayT>( length );
    >
    > InitializeArray( Vector, length, init );
    >
    > unsigned loops = my::max( 10000000 / length, (unsigned)1 );
    >
    > unsigned tick = GetTickCount( );
    >
    > double res = ProcessMixedArray<ValueT>( Vector, length, loops );
    >
    >
    > tick = GetTickCount( ) - tick;
    >
    > Memory<ValueT>( ).Dealloc<ArrayT>( Vector );
    > double dSec = (double) tick / (double) 1000;
    >
    > trash << res << endl; // output and forget result
    >
    > double dFlops = (double) length * 5.0 * loops;
    > double dMFlops = dFlops / 1000000.0;
    >
    > return dMFlops / dSec;
    > }
    >
    > int main2( )
    > {
    > //unsigned min_size_p = 10; // 2 ^ 10 = 1.024
    > //unsigned min_size_p = 12; // 2 ^ 12 = 4.096
    > unsigned min_size_p = 1000000; // 2 ^ 3 = 16.384
    >
    >
    > unsigned max_size_p = 200000000; // 2 ^ 25 = 33.554.432
    > //unsigned max_size_p = 10; // 2 ^ 25 = 33.554.432
    >
    > cout << "Max Vector memory length: ";
    > cout << (unsigned)pow( 2, max_size_p );
    > cout << endl;
    >
    > DWORD dwNumber = GetTickCount( );
    > dwNumber = GetTickCount( ) / dwNumber;
    >
    > short number = static_cast<short>( dwNumber );
    > cout << "number: " << number << endl;
    >
    > //cout << "double \t float \t int\n";
    >
    > cout << "\tdouble* 1\tvector<double>\tuBlas::vector<double>\n";
    >
    > for( unsigned v_size_p = min_size_p ; v_size_p < max_size_p;

    v_size_p
    > += 2000000 )
    > {
    > unsigned size = v_size_p;//(unsigned)pow( 2, v_size_p );
    > cout << fixed << size << "\t";
    > cout << Test<double, double*>( number, size ) << "\t";
    > cout << Test<double, vector<double> >( number, size ) << "\t";
    > cout << Test<double, boost::numeric::ublas::vector< double> >(

    number,
    > size ) << "\t";
    > //cout << Test<double, list<double> >( number, size ) << "\t";
    > //cout << Test<float, float*>( number, size ) << "\t";
    > //cout << Test<float, vector<float> >( number, size ) << "\t";
    > //cout << Test<double, double*>( number, size ) << "\t";
    > //cout << Test<double, double*>( number, size ) << "\t";
    > //cout << Test<double, double*>( number, size ) << "\t";
    > //cout << Test<double, double*>( number, size ) << "\t";
    >
    >
    >
    > //cout << Test<float, float*>( number, size ) << "\t";
    > //cout << Test<int, int*>( number, size ) << "\t";
    >
    > cout << "\n";
    > }
    >
    >
    > //cout << dMFlops << " / " << dSec << " = " << dMFlops / dSec;
    >
    > cout << endl;
    >
    >
    >
    >
    >
    > return 0;
    > }
    > int main( )
    > {
    > return main2( );
    > }


    When you did the test that showed vector being slower, did you do that
    test in DEBUG mode?
    If you did, then your test is invalid.
    You should perform all performance test in release mode only.
    I've perform test with vector VS C-Style array, and in my test the
    vector out performance the C-Style array.
    My test used VC++ 6.0 and VC++ 7.1
     
    Axter, Apr 24, 2005
    #2
    1. Advertising

  3. Ingo Nolden

    Ingo Nolden Guest

    Re: performance of std::vector<double>, double[] and uBlas::vectoron different CPU


    >
    > When you did the test that showed vector being slower, did you do that
    > test in DEBUG mode?
    > If you did, then your test is invalid.
    > You should perform all performance test in release mode only.
    > I've perform test with vector VS C-Style array, and in my test the
    > vector out performance the C-Style array.
    > My test used VC++ 6.0 and VC++ 7.1
    >


    Hi Axter,

    thank you for your reply,

    as I wrote I did the test in Release mode, and I explained in detail
    what settings I used. So I didn't leave any space for guesses.
    If it was in debug mode, it wouldn't have surprised me.
    Also your result doens't surprise too much. As I wrote, on my AMD CPU I
    got the same result as you. *** with the same exe build as on the intel
    machine ***

    So, but what I would like to know, what CPU do you have?

    As long as there is nobody coming up with an idea whats going wrong, I
    could try to examine on what type of CPU I get what behaviour.

    thanks
    Ingo
     
    Ingo Nolden, Apr 24, 2005
    #3
  4. Ingo Nolden

    Uenal Mutlu Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    "Ingo Nolden" wrote
    >
    > I am a little confused by the result of a code that should give me some
    > information about CPU cache effects.
    >
    > I wrote a function that performs some flops on a vector/array of doubles
    > or floats and of changing size . While playing around and trying
    > different things I compared the use of a standard c array with a
    > std::vector and the vector from the boost library.
    > The program was compiled with a VC++ 7.1 comiler with the default
    > release settings, and later with also whole program optimization and
    > global optimization activated ( which made no difference ).
    >
    > On my laptop intel P4 2.6GHz and 512 Mbyte RAM the result was extremely
    > surprising as the raw c array performed as expected in a range between
    > 300 and 350 MFlops ( if my Flops calculation is right ).
    > The other arrays however were about 80 times !!!! slower.
    > I had them expected to be probably some percentage slower.
    > Investigating the asm code ( as far as I can guess what it means ) seemd
    > to be doing the same thing however.
    > This made me think that it must be an issue about memory access. It can
    > not be due to main memory size because the difference occurs at any
    > array size, beginning from 1 Mbyte.
    > Now I wanted to prove that it is not an compiler/optimization dependant
    > issue. I ran the executable on a different machine, which is a AMD
    > Athlon 2400+ Desktop with 1Gb RAM. On this machine I got an even more
    > surprising result:
    > The std::vector and uBlas::vector performed well and even superceded the
    > plain c array.


    How is that ever possible? I guess it is mostly due to your code and/or CPU caching.

    > I usually don't care so much about performance, but 8000% is worth
    > thinking about it.
    >
    > Below is my source. If one has no uBlas at hand, he can comment the two
    > lines and it should work.
    >
    > Also, to get back to my original intention, I want to change from
    > sequential access to arbitrary access of the vector items. Does anyone
    > know a good/standard way to do so? It should put additional effort on
    > the CPU.

    ....

    Try this framework:

    /*
    Measuring array access overhead STL vs. RAW
    Written by uenal.mutlu at t-online.de

    Compiler: VC++6, but should work with any compiler
    AppType: Console
    Compile and Link: CL /GX /W3 /Od PerfTest.cpp
    Sample output:
    int clkTicksSTL: 1492 clkTicksRAW: 1412
    float clkTicksSTL: 1482 clkTicksRAW: 1412
    double clkTicksSTL: 1492 clkTicksRAW: 1412

    Result: the performance penalty for STL is about 5%.
    This is IMO neglectable.
    */

    #include <iostream>
    #include <vector>
    #include <ctime>

    template <typename T>
    void PerfTestArrayAccess_STL_vs_RAW(const char* const pszTypename,
    const size_t& nelems,
    const size_t& niterations,
    clock_t& retClkTicksSTL,
    clock_t& retClkTicksRAW,
    bool AfDump = true)
    {
    retClkTicksSTL = retClkTicksRAW = 0;
    size_t i, j;


    // Timing STL array access:
    std::vector<T> vect(nelems);
    clock_t clkTicksStart = clock();
    unsigned dummycounter = 0;
    for (i = 0; i < niterations; i++)
    for (j = 0; j < nelems; j++)
    {
    unsigned ix = (rand() * rand()) % nelems;
    if (!(unsigned(vect[ix]) % 2))
    dummycounter++;
    }
    retClkTicksSTL = clock() - clkTicksStart;
    vect.clear();


    // Timing RAW array access:
    T* pa = new T[nelems];
    clkTicksStart = clock();
    dummycounter = 0;
    for (i = 0; i < niterations; i++)
    for (j = 0; j < nelems; j++)
    {
    unsigned ix = (rand() * rand()) % nelems;
    if (!(unsigned(pa[ix]) % 2))
    dummycounter++;
    }
    retClkTicksRAW = unsigned(clock() - clkTicksStart);
    delete pa;

    if (AfDump)
    std::cout << pszTypename << " "
    << "clkTicksSTL: " << retClkTicksSTL << " "
    << "clkTicksRAW: " << retClkTicksRAW << std::endl;
    }

    int main(int argc, char* argv[])
    {
    const size_t nelems = 1000000;
    const size_t niterations = 5;

    clock_t clkTicksSTL, clkTicksRAW;

    PerfTestArrayAccess_STL_vs_RAW<int>( "int ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
    PerfTestArrayAccess_STL_vs_RAW<float>( "float ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
    PerfTestArrayAccess_STL_vs_RAW<double>("double", nelems, niterations, clkTicksSTL, clkTicksRAW, true);

    return 0;
    }
     
    Uenal Mutlu, Apr 25, 2005
    #4
  5. Ingo Nolden

    Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    Not sure if I'm right, but I think that the reason c-array appears to
    be slower when the size is > 1Mb is that c-array is stack-based and
    default stack size for windows apps compiled with vc is 1mb. Vector
    doesn't store values on stack (maybe except cases for short array, but
    it's not important in this case) so with large c-arrays there might be
    some sort of overhead handling extra stack size. I didn't check your
    long sorce code, but from what I read I have no other idea for such
    strange results...
     
    , Apr 25, 2005
    #5
  6. Ingo Nolden

    Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    Your code doesn't compile well (at least for me)
    it seems that you use dynamic c-arrays so, I was wrong about stack
    based overhead.
    Why don't you want to use std::min and std::max defined in <algorithm>?
     
    , Apr 25, 2005
    #6
  7. Ingo Nolden

    Uenal Mutlu Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    <> wrote

    > Your code doesn't compile well (at least for me)


    Which compiler?
    What error does it report?

    > Why don't you want to use std::min and std::max defined in <algorithm>?


    Sorry, I don't know what you mean. There was no necessity
    to use them in the posted code of mine.

    BTW, in case you don't know: there is a possibility to quote the relevant
    portions of a posting one replies to. This helps to understand what
    the writer might have meant.
     
    Uenal Mutlu, Apr 25, 2005
    #7
  8. Ingo Nolden

    Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    does it matter which compiler as long as windows.h defines macros for
    min and max and you DO use my::max in your code.
    There were some portions of code that seem quite strange.
    What does this part of code do:
    DWORD dwNumber = GetTickCount( );
    dwNumber = GetTickCount( ) / dwNumber;

    Perhaps, for questions completely unrelated to any platform you would
    want to avoid use of windows.h and use <ctime>
    for example a timer could be

    #include <ctime>

    class timer {
    std::clock_t t;
    public:
    timer() : t(std::clock()){}
    double stop(){ return ( (static_cast<double>(std::clock() -t))/CLK_TCK
    ); }
    };


    And when I compiled the code I didn't have any unexpected results they
    were all in a reasonable range, with vector/ublas::vector being a bit
    better than c-style arrays. This probably is a result of your coding
    and not any sort of optimization, IMO
     
    , Apr 25, 2005
    #8
  9. Ingo Nolden

    Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    and yes, I know about quoting :)
    I just use google::groups and do not subscribe to any usenet etc etc.
    With google groups interface to have quoting I'd need to manually copy
    your message and add "> " for each line
     
    , Apr 25, 2005
    #9
  10. Ingo Nolden

    Uenal Mutlu Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    <> wrote
    > does it matter which compiler as long as windows.h defines macros for
    > min and max and you DO use my::max in your code.
    > There were some portions of code that seem quite strange.
    > What does this part of code do:
    > DWORD dwNumber = GetTickCount( );
    > dwNumber = GetTickCount( ) / dwNumber;


    You are refering not to my code, it is the code of Ingo Nolden.
    Mine does not use Windows stuff.
    Here is a slightly updated version of my code:

    /*
    PerfTest.cpp v1.01

    Measuring array access overhead STL vs. RAW
    Written by uenal.mutlu at t-online.de

    Compiler: VC++6, but should work with any compiler
    AppType: Console
    Compile and Link: CL /GX /W3 /Od PerfTest.cpp
    Sample output:
    int clkTicksSTL: 1492 clkTicksRAW: 1412
    float clkTicksSTL: 1482 clkTicksRAW: 1412
    double clkTicksSTL: 1492 clkTicksRAW: 1412

    Result: the performance penalty for STL is about 5%.
    This is IMO neglectable.
    */

    #include <iostream>
    #include <vector>
    #include <ctime>

    template <typename T>
    void PerfTestArrayAccess_STL_vs_RAW(const char* const pszTypename,
    const size_t& nelems,
    const size_t& niterations,
    clock_t& retClkTicksSTL,
    clock_t& retClkTicksRAW,
    bool AfDump = true)
    {
    retClkTicksSTL = retClkTicksRAW = 0;
    size_t i, j;


    // Timing STL array access:
    std::vector<T> vect(nelems);
    clock_t clkTicksStart = clock();
    unsigned dummycounter = 0;
    for (i = 0; i < niterations; i++)
    for (j = 0; j < nelems; j++)
    {
    unsigned ix = (rand() * rand()) % nelems;
    if (!(unsigned(vect[ix]) % 2))
    dummycounter++;
    }
    retClkTicksSTL = clock() - clkTicksStart;
    vect.clear();


    // Timing RAW array access:
    T* pa = new T[nelems];
    clkTicksStart = clock();
    dummycounter = 0;
    for (i = 0; i < niterations; i++)
    for (j = 0; j < nelems; j++)
    {
    unsigned ix = (rand() * rand()) % nelems;
    if (!(unsigned(pa[ix]) % 2))
    dummycounter++;
    }
    retClkTicksRAW = clock() - clkTicksStart;
    delete pa;

    if (AfDump)
    std::cout << pszTypename << " "
    << "clkTicksSTL: " << retClkTicksSTL << " "
    << "clkTicksRAW: " << retClkTicksRAW << std::endl;
    }

    int main(int argc, char* argv[])
    {
    const size_t nelems = 1000000;
    const size_t niterations = 5;

    clock_t clkTicksSTL, clkTicksRAW;

    PerfTestArrayAccess_STL_vs_RAW<int>( "int ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
    PerfTestArrayAccess_STL_vs_RAW<float>( "float ", nelems, niterations, clkTicksSTL, clkTicksRAW, true);
    PerfTestArrayAccess_STL_vs_RAW<double>("double", nelems, niterations, clkTicksSTL, clkTicksRAW, true);

    return 0;
    }
     
    Uenal Mutlu, Apr 25, 2005
    #10
  11. Ingo Nolden

    Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    Yeah, I thought I was talking with the original poster. I din't
    try/look at your code and cannot comment on that :)
     
    , Apr 25, 2005
    #11
  12. Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    In message <>,
    writes
    >and yes, I know about quoting :)
    >I just use google::groups and do not subscribe to any usenet etc etc.
    >With google groups interface to have quoting I'd need to manually copy
    >your message and add "> " for each line
    >

    Not true. Unless they've changed the interface *again* :-( iIf you click
    on "show options" at the top of the message, then the "followup" that is
    revealed there, Google will quote the message for you.


    --
    Richard Herring
     
    Richard Herring, Apr 26, 2005
    #12
  13. Ingo Nolden

    Ingo Nolden Guest

    Re: performance of std::vector<double>, double[] and uBlas::vectoron different CPU

    Uenal Mutlu wrote:
    > <> wrote
    >
    >>does it matter which compiler as long as windows.h defines macros for
    >>min and max and you DO use my::max in your code.
    >>There were some portions of code that seem quite strange.
    >>What does this part of code do:
    >>DWORD dwNumber = GetTickCount( );
    >>dwNumber = GetTickCount( ) / dwNumber;

    >

    I wanted to get a 1 without the compiler being aware.
    But true, I should use standard stuff like <ctime> and will do so from now.

    >
    > You are refering not to my code, it is the code of Ingo Nolden.
    > Mine does not use Windows stuff.
    > Here is a slightly updated version of my code:
    >
    > /*
    > PerfTest.cpp v1.01
    >
    > Measuring array access overhead STL vs. RAW
    > Written by uenal.mutlu at t-online.de
    >
    > Compiler: VC++6, but should work with any compiler
    > AppType: Console
    > Compile and Link: CL /GX /W3 /Od PerfTest.cpp


    and compiled with optimization disabled, what are you going to do with
    it? The performance penalty of anything is quite irrelevant, if it is
    not compiled with the settings that I use for production code, is it?
    The assembly of my code looked quite similar for both dynamic c-array
    and std::vector. This is what I actually hoped to see, but the type of
    memory used seems different.


    > Sample output:
    > int clkTicksSTL: 1492 clkTicksRAW: 1412
    > float clkTicksSTL: 1482 clkTicksRAW: 1412
    > double clkTicksSTL: 1492 clkTicksRAW: 1412
    >
    > Result: the performance penalty for STL is about 5%.
    > This is IMO neglectable.


    against wich runtime are you linking. I get results comparable to yours
    if I link against no-debug runtime - but as you did - disabled
    optimization /Od

    Ingo
     
    Ingo Nolden, Apr 29, 2005
    #13
  14. Ingo Nolden

    Ingo Nolden Guest

    Re: performance of std::vector<double>, double[] and uBlas::vectoron different CPU


    >
    > And when I compiled the code I didn't have any unexpected results they
    > were all in a reasonable range, with vector/ublas::vector being a bit
    > better than c-style arrays. This probably is a result of your coding
    > and not any sort of optimization, IMO
    >


    do you tell me what CPU you have?
     
    Ingo Nolden, Apr 29, 2005
    #14
  15. Ingo Nolden

    Uenal Mutlu Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    "Ingo Nolden" wrote
    > Uenal Mutlu wrote:
    > > <> wrote
    > >
    > >>does it matter which compiler as long as windows.h defines macros for
    > >>min and max and you DO use my::max in your code.
    > >>There were some portions of code that seem quite strange.
    > >>What does this part of code do:
    > >>DWORD dwNumber = GetTickCount( );
    > >>dwNumber = GetTickCount( ) / dwNumber;

    > >

    > I wanted to get a 1 without the compiler being aware.
    > But true, I should use standard stuff like <ctime> and will do so from now.
    >
    > > You are refering not to my code, it is the code of Ingo Nolden.
    > > Mine does not use Windows stuff.
    > > Here is a slightly updated version of my code:
    > >
    > > /*
    > > PerfTest.cpp v1.01
    > >
    > > Measuring array access overhead STL vs. RAW
    > > Written by uenal.mutlu at t-online.de
    > >
    > > Compiler: VC++6, but should work with any compiler
    > > AppType: Console
    > > Compile and Link: CL /GX /W3 /Od PerfTest.cpp

    >
    > and compiled with optimization disabled, what are you going to do with
    > it? The performance penalty of anything is quite irrelevant, if it is
    > not compiled with the settings that I use for production code, is it?
    > The assembly of my code looked quite similar for both dynamic c-array
    > and std::vector. This is what I actually hoped to see, but the type of
    > memory used seems different.
    >
    >
    > > Sample output:
    > > int clkTicksSTL: 1492 clkTicksRAW: 1412
    > > float clkTicksSTL: 1482 clkTicksRAW: 1412
    > > double clkTicksSTL: 1492 clkTicksRAW: 1412
    > >
    > > Result: the performance penalty for STL is about 5%.
    > > This is IMO neglectable.

    >
    > against wich runtime are you linking. I get results comparable to yours
    > if I link against no-debug runtime - but as you did - disabled
    > optimization /Od


    I tested on W2kP.
    Using /Ox (Full Optimization) gives:

    int clkTicksSTL: 1191 clkTicksRAW: 1202
    float clkTicksSTL: 1202 clkTicksRAW: 1201
    double clkTicksSTL: 1202 clkTicksRAW: 1202

    So, using full optimization there is virtually no overhead.
     
    Uenal Mutlu, Apr 30, 2005
    #15
  16. Ingo Nolden

    Jerry Coffin Guest

    Re: performance of std::vector<double>, double[] and uBlas::vector on different CPU

    Uenal Mutlu wrote:

    [ ... ]

    > I tested on W2kP.
    > Using /Ox (Full Optimization) gives:
    >
    > int clkTicksSTL: 1191 clkTicksRAW: 1202
    > float clkTicksSTL: 1202 clkTicksRAW: 1201
    > double clkTicksSTL: 1202 clkTicksRAW: 1202
    >
    > So, using full optimization there is virtually no overhead.


    I'd go even further: the numbers are sufficiently similar that there's
    probably no statistically significant difference at all. In fact, with
    the right set of optimizations (/O2b2 /G6ry) I'm consistently getting
    results like this:

    int clkTicksSTL: 125 clkTicksRAW: 140
    float clkTicksSTL: 125 clkTicksRAW: 140
    double clkTicksSTL: 125 clkTicksRAW: 141

    I haven't looked at the assembly code to figure out exactly why, but in
    this particular case, vectors are a bit _faster_ than arrays -- and in
    this case, the differences are sufficiently large and consistent to be
    significant.

    Optimizing for the Pentium 4 (/G7) improves the raw array to
    approximately equal the vector:

    int clkTicksSTL: 109 clkTicksRAW: 125
    float clkTicksSTL: 125 clkTicksRAW: 109
    double clkTicksSTL: 109 clkTicksRAW: 110

    OTOH, the noise in the measurement appears larger than any difference
    we might be measuring. Bumping up the iteration count gives:

    int clkTicksSTL: 4671 clkTicksRAW: 4672
    float clkTicksSTL: 4657 clkTicksRAW: 4671
    double clkTicksSTL: 4657 clkTicksRAW: 4656

    This still gives pretty inconclusive results -- we'd have to run the
    tests quite a few times to be sure whether there was a difference or
    not -- but if it's this close, the vector is an easy choice.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Apr 30, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Manzanita
    Replies:
    3
    Views:
    6,098
    Oleg Trott
    Feb 20, 2004
  2. Anonymous
    Replies:
    20
    Views:
    4,305
    Pete Becker
    Mar 30, 2005
  3. utab
    Replies:
    1
    Views:
    370
    Markus Moll
    Jan 30, 2007
  4. Replies:
    1
    Views:
    384
  5. junyoung
    Replies:
    6
    Views:
    3,466
    jockhip12
    May 12, 2011
Loading...

Share This Page