object on stack/heap performance problems

O

Obnoxious User

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Weird. Compiled with g++ on my system yields a difference of almost 20x.
 
O

orobalage

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?
 
B

benben

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Well, it seems that I cannot reproduce what you just described:

benben@watersidem $ g++ main_fast.cpp cnn.cpp -O2 -o fast
benben@watersidem $ g++ main_slow.cpp cnn.cpp -O2 -o slow
benben@watersidem $ ./fast
520000
benben@watersidem $ ./slow
520000

Theoretically there shouldn't be any difference between performance of
operations on an object on the stack and the same operation on an object
on the heap. At least on my machine I cannot reproduce such difference.

In a highly unlikely event the stack memory may be swapped out before
the process() call, resulting in swapping back in the memory. But this
is unlikely judging the straightforward manor of your program, plus the
swapping can happened to heap memory equally likely anyway...

Regards,
benben
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Hi!

I was developing some number-crunching algorithms for my university,
and I put the processor into a class.
While testing, I found a quite *severe performance problem* when the
object was created on the stack.

I uploaded a test archive here: http://digitus.itk.ppke.hu/~oroba/stack_test.zip

Inside you'll find the number cruncher class (CNN in cnn.h and
cnn.cpp), as well as two test files: test_slow.cpp and test_fast.cpp.
They differ ONLY in where the processor object is created. In one, it
is created on the stack, in the other, it is created on the heap. Yet,
when I call the member function process(), the performance difference
is 5x!!!

Can someone with a higher knowledge of object layout and whatsoever,
tell me why this is happening?

Results when I compile/run your code with Visual C++ Codename Orcas
Express Beta1 (Visual C++ 2008)

Debug:
heap: 12868
stack: 13118
Release:
heap: 38666
stack: 4383

That's a difference of about 8.8 times faster when using the stack. I
have not used any profilers or such but there are some stuff in your
code that I find highly dubious, especially the allocation for the
RowMatrix. From what I can understand of the code you do some "magic" to
make sure the code is aligned properly, but does it work? Are you sure
your computer (or the it will run on) really works best with 32 byte
boundaries? This also makes your code totally unportable, I had to change
data = (float*) ((((long)(real_data))+31L) & (-32L));
to
data = (float*) ((((long long)(real_data))+31L) & (-32L));
before my compiler would let it through, and I'm still not sure what you
are trying to achieve with it.

Another thing that strikes me is that you use malloc, and while I'm no
expert I think this will cause your program to use two heaps, one for
new'ed memory and one for malloc'ed, this might slow things down.

I'm not sure what your number-crunching algorithm is supposed to do, so
I can't give you any better advice than to try to make the RowMatrix
simpler and try again.
 
K

Kai-Uwe Bux

Obnoxious said:
Weird. Compiled with g++ on my system yields a difference of almost 20x.

I am using g++, too; but I cannot confirm your observation. I do:

stack_test> ls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_test> make
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_test> time test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_test> time test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>


Best

Kai-Uwe Bux
 
O

Obnoxious User

I am using g++, too; but I cannot confirm your observation. I do:

stack_test> ls
Makefile cnn.h main_fast.cpp main_slow.cpp test_fast.exe
cnn.cpp cnn.o main_fast.o main_slow.o test_slow.exe
stack_test> make
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
stack_test> time test_slow.exe
640000

real 0m0.713s
user 0m0.648s
sys 0m0.012s
stack_test> time test_fast.exe
640000

real 0m0.705s
user 0m0.644s
sys 0m0.008s
stack_test>

~/stack_test$ make
g++ -Wall -O3 cnn.cpp -c
g++ -Wall -O3 main_slow.cpp -c
g++ -s -o test_slow.exe cnn.o main_slow.o
g++ -Wall -O3 main_fast.cpp -c
g++ -s -o test_fast.exe cnn.o main_fast.o
~/stack_test$ time ./test_slow.exe
8200000

real 0m8.209s
user 0m8.205s
sys 0m0.000s
~/stack_test$ time ./test_fast.exe
310000

real 0m0.315s
user 0m0.316s
sys 0m0.000s
~/stack_test$

Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.
 
O

orobalage

Thanks for your comments Erik.

I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :), it's just the fact, that the
exact same thing performs very differently when the object is on the
stack or on the heap. And what makes it more complicated, is that for
some people the heap is better, for some the stack, yet for others it
makes no difference at all.

I want to have an answer to why this happens.
 
O

orobalage

And by the way: if I change the code to use new instead of malloc, and
I remove all the "magic" from the RowMatrix code, the stack version is
still slower on my computer. You can do that too, just replace
(cnn.cpp around line 40)

real_data = (float*) malloc( size * sizeof(float) + 31L );
data = (float*) ((((long)(real_data))+31L) & (-32L));

with

real_data = new float[size];
data = real_data;

and replace (cnn.cpp around line 20)

free( real_data );

with

delete[] real_data;


In the meantime, I did some profiling with gprof, and it shows, that
for me, the stack version spends a LOT more time in
CNN::NonLinearity().

Still I'm puzzled, perhaps this has something to do with how the
member variables come one after another in the class definition?
 
M

Martijn van Buul

* (e-mail address removed):
I wrote that code almost 2 years ago. Actually, you don't need to be
sure about what the algorithm does :)

I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data. Most importantly, CNN::lower_limit,
CNN::upper_limit, CNN::Z and a few others never seem to be initialised,
but I could be wrong. Have you tried checking the outcome (whatever
that may be) of Process() in both cases?

Most of your calculations are done on the .data members of the various
rows. SInce these are always malloc-ed (in a nasty premature-optimalisation
way), the location of the instance of the CNN class has little influence.

Uninitialised data (Along with subtle out-of-bounds errors) would also
explain why some people don't seem to be having this "problem", while
others do.

(fwiw:

atlas(1):~/stacktest/stack_test% time ./test_fast.exe
56
../test_fast.exe 0.57s user 0.01s system 97% cpu 0.600 total
atlas(1):~/stacktest/stack_test% time ./test_slow.exe
358
../test_slow.exe 3.59s user 0.02s system 98% cpu 3.687 total
atlas(1):~/stacktest/stack_test% gcc --version
gcc (GCC) 4.1.2 20070110 prerelease (NetBSD nb1 20070603)
[..]
atlas(1):~/stacktest/stack_test% uname -a
NetBSD atlas 4.99.20 NetBSD 4.99.20 (ATLAS) #0: Sat Jun 9 02:53:14 CEST 2007 martijnb@atlas:/usr/obj/sys/arch/amd64/compile/ATLAS amd64
 
M

Martijn van Buul

* Obnoxious User:
Although the test results for test_slow.exe varies some,
between 4160000 - 8640000, being most at the upper part,
while test_fast.exe produces stable test results.

A tell-tale sign for unitialised data. The "results" of these test programs
is the CPU time used for calculation, using clock(), which *should* be
the actual CPU time used by this very process. While other activity on
the system will skew the results a little bit, a variation in runtime like this
on an algorithm that takes no user input, does no I/O, doesn't do wild
memory allocation and always uses the same arguments clearly indicates that
it's *not* doing the same job on every invocation.
 
K

Kai-Uwe Bux

Martijn said:
* (e-mail address removed):

I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data. Most importantly, CNN::lower_limit,
CNN::upper_limit, CNN::Z and a few others never seem to be initialised,
but I could be wrong. Have you tried checking the outcome (whatever
that may be) of Process() in both cases?

Most of your calculations are done on the .data members of the various
rows. SInce these are always malloc-ed (in a nasty
premature-optimalisation way), the location of the instance of the CNN
class has little influence.

Uninitialised data (Along with subtle out-of-bounds errors) would also
explain why some people don't seem to be having this "problem", while
others do.

That is an interesting idea. This is what valgrind has to say:

==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)
==15156== at 0x8049328:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x8049AD6:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)
==15156== at 0x80491EC:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x8049AD6:
(within /home/bux/bux/todo/towrite/c++/experiments/news_group/cnn/stack_test/test_fast.exe)
==15156== by 0x414C825: (below main) (in /lib/libc-2.5.so)
==15156==
==15156== Conditional jump or move depends on uninitialised value(s)

And it goes on like this forever. Similar output for test_slow.exe



Best

Kai-Uwe Bux
 
O

orobalage

Thanks Martijn!

Indeed, it was uninitialized data!
Problem is solved it seems, and another big experience in my bag,
thanks for it! :)

What made it somewhat obscure for me, is that if I rearranged the
members, it also became fast sometimes.

Anyway, we make mistakes, that I made a huge one, hope others will
learn from this too.

Good day to you, and thanks again everyone!
 
M

Martijn van Buul

* (e-mail address removed):
Thanks Martijn!

Indeed, it was uninitialized data!
Problem is solved it seems, and another big experience in my bag,
thanks for it! :)

Glad I could help.
 
J

jmoy

* (e-mail address removed):


I tend to disagree. It wouldn't surprise me if what you're seeing is the
result of uninitialised data.

I wrote some functions to compare the structures member-by-member,
comparing the data arrays element by element.

The answers come out different when Process() is called with the same
initial data on a CNN allocated on the stack and a CNN allocated on
the heap. Also, the differences are in different places and the values
from each method itself are different on different runs.

So I would tend to agree that it is uninitialized data and/or maybe
some other bug which is the culprit.

Jyotirmoy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top