Problem with STL vector peformance, benchmarks included

StephQ · Apr 30, 2007

I found that old post:
http://groups.google.com/group/comp...&q=vector+no+surprise&rnum=2#15519204726d01e8

I just erased the #include <kubux.....> lines.

****** old post for your convenince ********
You are right:

#include <vector>
#include <iostream>
#include <ctime>
#include <memory>

#include <kubux/bits/allocator.cc>
#include <kubux/bits/new_delete_allocator.cc>
#include <kubux/bits/malloc_free_allocator.cc>

template < typename T, typename Alloc = std::allocator<T> >
class stupid {
public:

typedef Alloc allocator;
typedef typename allocator::value_type value_type;
typedef typename allocator::size_type size_type;
typedef typename allocator::difference_type difference_type;
typedef typename allocator:

ointer pointer;
typedef typename allocator::const_pointer const_pointer;
typedef typename allocator::reference reference;
typedef typename allocator::const_reference const_reference;

typedef pointer iterator;
typedef const_pointer
const_iterator;
typedef typename std::reverse_iterator< iterator >
reverse_iterator;
typedef typename std::reverse_iterator< const_iterator >
const_reverse_iterator;

private:

pointer ptr;
size_type the_size;

public:

stupid ( size_type length ) :
ptr ( new T [ length ] ),
the_size ( length )
{
for ( iterator iter = this->ptr;
iter != this->ptr + the_size;
++ iter ) {
::new( static_cast<void*>(iter) ) T();
}
}

~stupid ( void ) {
iterator iter = ptr + the_size;
while ( iter > ptr ) {
-- iter;
iter->~T();
}
{
allocator alloc;
alloc.deallocate( ptr, the_size );
}
the_size = 0;
}

reference operator[] ( size_type index ) {
return( this->ptr[ index ] );
}

const_reference operator[] ( size_type index ) const {
return( this->ptr[ index ] );
}

}; // stupid

int main ( void ) {
const unsigned long l = 50000000;
{
std::vector< int > v ( l );
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "vector: " << loop_end - loop_start << std::endl;
}
{
int* v = new int [ l ];
std::fill_n(v, l, 0);
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "array: " << loop_end - loop_start << std::endl;
}
{
stupid< int, std::allocator<int> > v ( l );
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "stupid: " << loop_end - loop_start << std::endl;
}
{
std::vector<int> v ( l );
std::clock_t loop_start = std::clock();
for ( std::vector<int>::iterator i = v.begin();
i != v.end(); ++i ) {
*i = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "ptr: " << loop_end - loop_start << std::endl;
}
{
int* v = new int [ l ];
std::fill_n(v, l, 0);
std::clock_t loop_start = std::clock();
for ( int* i = v; i < v+l; ++i ) {
*i = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "ptr: " << loop_end - loop_start << std::endl;
}

}

a.out

Click to expand...

vector: 320000
array: 320000
stupid: 350000
iterator: 340000
ptr: 340000

No surprises anymore.

Thanks

Kai-Uwe Bux
***************************************************

I ran the reported test on visual studio professional 2005 with its
standard STL implementation, which should be supplyed by Dinkumware.
My cpu is a dual core t2500 with 2gb ddr2.

I tryed both the intel 9.1 compiler and the Microsoft one.
In both cases I used the O3 optimizations, release mode, and with the
Intel one I also tryed the /Qansi_alias /Qipo options.

Results:

Microsoft:
vector: 141
array: 94
stupid: 93
ptr: 172
ptr: 78

Intel:
vector: 312
array: 156 // becomes 45 if I require P4 extensions, other values
remains nearly the same
stupid: 157
ptr: 1047
ptr: 156

I admit I'm quite disappointed wit the reults obtained with the Intel
compiler.
Is there any fault in the way the tast was conducted or with the
source code I posted?
If everything is correct, how could I investigate where is the
problem?

Cheers
StephQ

Roland Pibinger · Apr 30, 2007

I ran the reported test on visual studio professional 2005 with its
standard STL implementation, which should be supplyed by Dinkumware.
My cpu is a dual core t2500 with 2gb ddr2.
I tryed both the intel 9.1 compiler and the Microsoft one.
In both cases I used the O3 optimizations, release mode, and with the
Intel one I also tryed the /Qansi_alias /Qipo options.

Have you turned off checked iterators? (see:
http://www.codeproject.com/vcpp/stl/checkediterators.asp)

StephQ · Apr 30, 2007

Have you turned off checked iterators? (see:http://www.codeproject.com/vcpp/stl/checkediterators.asp)

Thank you for very usefull suggestion. I didn't know that checked
iterators were turned on even in release mode in vc8 by default.

The new results (with checked iterators turned off) are:

Microsoft:
vector: 94
array: 94
stupid: 94
ptr: 141
ptr: 96

Intel:
vector: 141
array: 141 //62 if I eanble SSE2
stupid: 141 //62 if I enable SSE2 and disable exception handling
ptr: 141
ptr: 140

The situation is now much better.
Howere is seems that the Microsofr compiler is still doing 35% better
in all the situations except the "vector iterator" one.

Do you have any other suggestion to try?
I know nothing of lowe level instructions, but if I post the
"assembler - like" code here would it be of any help for you?

Thank you

Cheers
StephQ

StephQ · Apr 30, 2007

Thank you for very usefull suggestion. I didn't know that checked
iterators were turned on even in release mode in vc8 by default.

The new results (with checked iterators turned off) are:

Microsoft:
vector: 94
array: 94
stupid: 94
ptr: 141
ptr: 96

Intel:
vector: 141
array: 141 //62 if I eanble SSE2
stupid: 141 //62 if I enable SSE2 and disable exception handling
ptr: 141
ptr: 140

The situation is now much better.
Howere is seems that the Microsofr compiler is still doing 35% better
in all the situations except the "vector iterator" one.

Do you have any other suggestion to try?
I know nothing of lowe level instructions, but if I post the
"assembler - like" code here would it be of any help for you?

Thank you

Cheers
StephQ

I reply to myself just to tell you that I don't mind investigating any
more these issues.
I ran the test using doubles instead of int and the results are very
similar, with the microsoft compiler having something like 3% more
performance.

However the Stepanov Abstraction test favours the intel compiler by a
large margin.
Abstraction penalty with Intel:
0.85
0.68 with sse2

With Microsoft:
1.11

A curiosity..... how is it possible to get an abstraction penalty
below 1 ?

Chhers
StephQ

Puppet_Sock · Apr 30, 2007

On Apr 30 said:
However the Stepanov Abstraction test favours the intel compiler by a
large margin.

Could you clue me in on what a "Stepanov Abstraction" test is?
Socks

peter koch · Apr 30, 2007

Could you clue me in on what a "Stepanov Abstraction" test is?
Socks

The Stepanov abstraction penalty is a benchmark made by Alexander
Stepanov, the man behind STL and templates in C++. Google if you want
to know more.

/Peter

peter koch · Apr 30, 2007

I reply to myself just to tell you that I don't mind investigating any
more these issues.
I ran the test using doubles instead of int and the results are very
similar, with the microsoft compiler having something like 3% more
performance.

However the Stepanov Abstraction test favours the intel compiler by a
large margin.
Abstraction penalty with Intel:
0.85
0.68 with sse2

With Microsoft:
1.11

A curiosity..... how is it possible to get an abstraction penalty
below 1 ?

Perhaps because you had a bad test? Rerun the benchmarks more than one
time and remember that caching has a huge effect on results (I believe
a factor of ten is quite normal). So you should know how to e.g. clear
(or fill) the cache as appropriate.
Writing a good benchmark is not easy.

/Peter

StephQ · Apr 30, 2007

Perhaps because you had a bad test? Rerun the benchmarks more than one
time and remember that caching has a huge effect on results (I believe
a factor of ten is quite normal). So you should know how to e.g. clear
(or fill) the cache as appropriate.
Writing a good benchmark is not easy.

/Peter

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right?
But I obtained these results in a stable way with different runs....

I remember that caching influences the results of subsequent runs of
benchmarks, but I don't understand why. Isn't cache/memory freed after
the software exit?

Anyway I increased the number of calculations in the test becouse it
was taking too few time to run.

StephQ

Roland Pibinger · Apr 30, 2007

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right?
But I obtained these results in a stable way with different runs....

It may be any 'position effect'. Divide the test into functions (one
for each test) and call the functions several (many) times in
randomized order. Include a 'warm up' at the beginning.

peter koch · Apr 30, 2007

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right? Yes.

But I obtained these results in a stable way with different runs....

I remember that caching influences the results of subsequent runs of
benchmarks, but I don't understand why. Isn't cache/memory freed after
the software exit?

No. Caching takes place at the hardware level, so no freeing takes
place just as freeing memory does not remove physical memory.

Anyway I increased the number of calculations in the test becouse it
was taking too few time to run.

Right. But try to follow Roland Pibingers advice and see if that
explains anything.

/Peter

Markus Schoder · Apr 30, 2007

peter said:
No. Caching takes place at the hardware level, so no freeing takes
place just as freeing memory does not remove physical memory.

Every sane operating system will clear the memory handed out to a new
process otherwise you could accidentally read what another process
maybe run by another user stored in memory.

I know that Linux does this and I am pretty sure that Windows does it
too nowadays. So memory caching between program runs should never
occur.

problem with iterator (map iterator)	3	Jan 10, 2014
fixing absence of operator[] and at in list	3	Apr 22, 2013
Embarrassing problem with vector	9	Sep 11, 2009
How to write your own allocator?	11	Apr 1, 2008
Problem with vector from STL	1	Nov 22, 2008
STL ??	2	Jan 25, 2006
Eliminate second invocation of class (example_visitor) object	6	Jun 29, 2013
Problem storing tvmet vector objects in an stl vector container	2	Sep 8, 2008

Problem with STL vector peformance, benchmarks included

StephQ

Roland Pibinger

StephQ

StephQ

Puppet_Sock

peter koch

peter koch

StephQ

Roland Pibinger

peter koch

Markus Schoder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads