Problem with STL vector peformance, benchmarks included

S

StephQ

I found that old post:
http://groups.google.com/group/comp...&q=vector+no+surprise&rnum=2#15519204726d01e8

I just erased the #include <kubux.....> lines.

****** old post for your convenince ********
You are right:

#include <vector>
#include <iostream>
#include <ctime>
#include <memory>

#include <kubux/bits/allocator.cc>
#include <kubux/bits/new_delete_allocator.cc>
#include <kubux/bits/malloc_free_allocator.cc>

template < typename T, typename Alloc = std::allocator<T> >
class stupid {
public:

typedef Alloc allocator;
typedef typename allocator::value_type value_type;
typedef typename allocator::size_type size_type;
typedef typename allocator::difference_type difference_type;
typedef typename allocator::pointer pointer;
typedef typename allocator::const_pointer const_pointer;
typedef typename allocator::reference reference;
typedef typename allocator::const_reference const_reference;

typedef pointer iterator;
typedef const_pointer
const_iterator;
typedef typename std::reverse_iterator< iterator >
reverse_iterator;
typedef typename std::reverse_iterator< const_iterator >
const_reverse_iterator;

private:

pointer ptr;
size_type the_size;

public:

stupid ( size_type length ) :
ptr ( new T [ length ] ),
the_size ( length )
{
for ( iterator iter = this->ptr;
iter != this->ptr + the_size;
++ iter ) {
::new( static_cast<void*>(iter) ) T();
}
}

~stupid ( void ) {
iterator iter = ptr + the_size;
while ( iter > ptr ) {
-- iter;
iter->~T();
}
{
allocator alloc;
alloc.deallocate( ptr, the_size );
}
the_size = 0;
}

reference operator[] ( size_type index ) {
return( this->ptr[ index ] );
}

const_reference operator[] ( size_type index ) const {
return( this->ptr[ index ] );
}

}; // stupid

int main ( void ) {
const unsigned long l = 50000000;
{
std::vector< int > v ( l );
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "vector: " << loop_end - loop_start << std::endl;
}
{
int* v = new int [ l ];
std::fill_n(v, l, 0);
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "array: " << loop_end - loop_start << std::endl;
}
{
stupid< int, std::allocator<int> > v ( l );
std::clock_t loop_start = std::clock();
for ( unsigned long i = 0; i < l; ++i ) {
v = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "stupid: " << loop_end - loop_start << std::endl;
}
{
std::vector<int> v ( l );
std::clock_t loop_start = std::clock();
for ( std::vector<int>::iterator i = v.begin();
i != v.end(); ++i ) {
*i = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "ptr: " << loop_end - loop_start << std::endl;
}
{
int* v = new int [ l ];
std::fill_n(v, l, 0);
std::clock_t loop_start = std::clock();
for ( int* i = v; i < v+l; ++i ) {
*i = 5;
}
std::clock_t loop_end = std::clock();
std::cout << "ptr: " << loop_end - loop_start << std::endl;
}

}

vector: 320000
array: 320000
stupid: 350000
iterator: 340000
ptr: 340000

No surprises anymore.

Thanks

Kai-Uwe Bux
***************************************************

I ran the reported test on visual studio professional 2005 with its
standard STL implementation, which should be supplyed by Dinkumware.
My cpu is a dual core t2500 with 2gb ddr2.

I tryed both the intel 9.1 compiler and the Microsoft one.
In both cases I used the O3 optimizations, release mode, and with the
Intel one I also tryed the /Qansi_alias /Qipo options.

Results:

Microsoft:
vector: 141
array: 94
stupid: 93
ptr: 172
ptr: 78

Intel:
vector: 312
array: 156 // becomes 45 if I require P4 extensions, other values
remains nearly the same
stupid: 157
ptr: 1047
ptr: 156

I admit I'm quite disappointed wit the reults obtained with the Intel
compiler.
Is there any fault in the way the tast was conducted or with the
source code I posted?
If everything is correct, how could I investigate where is the
problem?

Cheers
StephQ
 
R

Roland Pibinger

I ran the reported test on visual studio professional 2005 with its
standard STL implementation, which should be supplyed by Dinkumware.
My cpu is a dual core t2500 with 2gb ddr2.
I tryed both the intel 9.1 compiler and the Microsoft one.
In both cases I used the O3 optimizations, release mode, and with the
Intel one I also tryed the /Qansi_alias /Qipo options.

Have you turned off checked iterators? (see:
http://www.codeproject.com/vcpp/stl/checkediterators.asp)
 
S

StephQ

Have you turned off checked iterators? (see:http://www.codeproject.com/vcpp/stl/checkediterators.asp)

Thank you for very usefull suggestion. I didn't know that checked
iterators were turned on even in release mode in vc8 by default.

The new results (with checked iterators turned off) are:

Microsoft:
vector: 94
array: 94
stupid: 94
ptr: 141
ptr: 96

Intel:
vector: 141
array: 141 //62 if I eanble SSE2
stupid: 141 //62 if I enable SSE2 and disable exception handling
ptr: 141
ptr: 140

The situation is now much better.
Howere is seems that the Microsofr compiler is still doing 35% better
in all the situations except the "vector iterator" one.

Do you have any other suggestion to try?
I know nothing of lowe level instructions, but if I post the
"assembler - like" code here would it be of any help for you?

Thank you

Cheers
StephQ
 
S

StephQ

Thank you for very usefull suggestion. I didn't know that checked
iterators were turned on even in release mode in vc8 by default.

The new results (with checked iterators turned off) are:

Microsoft:
vector: 94
array: 94
stupid: 94
ptr: 141
ptr: 96

Intel:
vector: 141
array: 141 //62 if I eanble SSE2
stupid: 141 //62 if I enable SSE2 and disable exception handling
ptr: 141
ptr: 140

The situation is now much better.
Howere is seems that the Microsofr compiler is still doing 35% better
in all the situations except the "vector iterator" one.

Do you have any other suggestion to try?
I know nothing of lowe level instructions, but if I post the
"assembler - like" code here would it be of any help for you?

Thank you

Cheers
StephQ

I reply to myself just to tell you that I don't mind investigating any
more these issues.
I ran the test using doubles instead of int and the results are very
similar, with the microsoft compiler having something like 3% more
performance.

However the Stepanov Abstraction test favours the intel compiler by a
large margin.
Abstraction penalty with Intel:
0.85
0.68 with sse2

With Microsoft:
1.11

A curiosity..... how is it possible to get an abstraction penalty
below 1 ?

Chhers
StephQ
 
P

peter koch

Could you clue me in on what a "Stepanov Abstraction" test is?
Socks

The Stepanov abstraction penalty is a benchmark made by Alexander
Stepanov, the man behind STL and templates in C++. Google if you want
to know more.

/Peter
 
P

peter koch

I reply to myself just to tell you that I don't mind investigating any
more these issues.
I ran the test using doubles instead of int and the results are very
similar, with the microsoft compiler having something like 3% more
performance.

However the Stepanov Abstraction test favours the intel compiler by a
large margin.
Abstraction penalty with Intel:
0.85
0.68 with sse2

With Microsoft:
1.11

A curiosity..... how is it possible to get an abstraction penalty
below 1 ?

Perhaps because you had a bad test? Rerun the benchmarks more than one
time and remember that caching has a huge effect on results (I believe
a factor of ten is quite normal). So you should know how to e.g. clear
(or fill) the cache as appropriate.
Writing a good benchmark is not easy.

/Peter
 
S

StephQ

Perhaps because you had a bad test? Rerun the benchmarks more than one
time and remember that caching has a huge effect on results (I believe
a factor of ten is quite normal). So you should know how to e.g. clear
(or fill) the cache as appropriate.
Writing a good benchmark is not easy.

/Peter

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right?
But I obtained these results in a stable way with different runs....

I remember that caching influences the results of subsequent runs of
benchmarks, but I don't understand why. Isn't cache/memory freed after
the software exit?

Anyway I increased the number of calculations in the test becouse it
was taking too few time to run.

StephQ
 
R

Roland Pibinger

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right?
But I obtained these results in a stable way with different runs....

It may be any 'position effect'. Divide the test into functions (one
for each test) and call the functions several (many) times in
randomized order. Include a 'warm up' at the beginning.
 
P

peter koch

I'm quite a newbie....
Do you suggest that the initial run is the "right" one, while
subsequent runs get distrorted by caching or the opposite thing?
By caching you mean that the objects of interests are loaded in the L1/
L2 cache right? Yes.

But I obtained these results in a stable way with different runs....

I remember that caching influences the results of subsequent runs of
benchmarks, but I don't understand why. Isn't cache/memory freed after
the software exit?
No. Caching takes place at the hardware level, so no freeing takes
place just as freeing memory does not remove physical memory.
Anyway I increased the number of calculations in the test becouse it
was taking too few time to run.
Right. But try to follow Roland Pibingers advice and see if that
explains anything.

/Peter
 
M

Markus Schoder

peter said:
No. Caching takes place at the hardware level, so no freeing takes
place just as freeing memory does not remove physical memory.

Every sane operating system will clear the memory handed out to a new
process otherwise you could accidentally read what another process
maybe run by another user stored in memory.

I know that Linux does this and I am pretty sure that Windows does it
too nowadays. So memory caching between program runs should never
occur.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top