Problems with performance

Leandro · Feb 25, 2013

I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.

We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.

The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with []are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)

Here are the results, compiled with g++:

Version1:

0.15s void FdtdSolver::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.01s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
0.38s caz[k]*(Hy[j][k] - Hy[j][k-1]);
0.10s idxl[j][k] = idxl[j][k] + curl_h;
Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
0.29s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
}
}
}

// Other loops with the same behavior...
}

Version2:

0.01s void FDTDEngine::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.00s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
0.01s curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
0.53s caz[k]*(Hy[j][k] - Hy[j][k-1]);
0.10s idxl[j][k] = idxl[j][k] + curl_h;
0.02s Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
0.36s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
}
}
}

// Other loops with the same behavior...
}

The question is: What kind of think can I do to solve this problem?

Tks!

ps.: Sorry for the language. Non native speaker...

Öö Tiib · Feb 25, 2013

We have two versions. The first one runs just the calculus (version1). The
second one is the whole application (GUI version, using wxWidgets), with
the same calculus routine (version2). The problem is that version2 runs
almost twice slower than version1, and I can't understand why.

When not seeing full code base we can only speculate.

Things that are cut out of context tend to run lot faster. That is
1) because of better memory locality (both for data and for code)
2) because the compiler has less opportunities to waste precious resources
(like registers) to improve some less important things.

The FDTD calculation is a lot of loops (one loop in time and three in x,
y and z direction). So I removed all of them but one and tried
"Very Sleepy" to profile the code.

"Very Sleepy" is "often good enough". There are plenty of other bit more
accurate and detailed profiling tools.

The question is: What kind of think can I do to solve this problem?

What problem? The difference is normal. I have always observed similar
differences.

As for the performance in general ... what have you tried to improve it?

Rui Maciel · Feb 25, 2013

Leandro said:
Here are the results, compiled with g++:

The question is: What kind of think can I do to solve this problem?

Without access to the code it's hard, if not impossible, to tell. If it
isn't possible to access a fully working example which reproduces the
problem you've experiencing then no one can really say anything about what's
going on in your code.

Rui Maciel

Leandro · Feb 25, 2013

I'll start from the end...

"What problem? The difference is normal. I have always observed similar differences. "

When I simulate a big problem using the console application, the console version runs in, for example, 3 hours. The GUI version runs in more than 5 hours. So, this can be normal, but it's a problem for me (or for my availabletime...).

"As for the performance in general ... what have you tried to improve it? "

No. The performance is important only during the simulation. When opening afile, it doesn't matter if it takes 1 or 2 seconds. So, I'm focusing only in the simulation. Maybe I'm doing it wrong. What king of performance improvement do you mean?

"Very Sleepy is often good enough. There are plenty of other bit more accurate and detailed profiling tools. "

Can you suggest other profiling tools? The project runs in Windows, and this is not a comercial software, so it would be great if the tool were free.

"When not seeing full code base we can only speculate.

Things that are cut out of context tend to run lot faster. That is
1) because of better memory locality (both for data and for code)
2) because the compiler has less opportunities to waste precious resources
(like registers) to improve some less important things. "

Unfortunately I'm not allowed to show the code of the whole system. But your tips gave me some hope. Is there any software that allows me to check thetwo versions and see if the difference is due better memory locality? Is it possible to rearrange things in the class declaration to get a faster code? The simulation code is in a class that inherits from others and there are few relations others. Can this fact slower the main code? If I extract the simulation code to an external file/library, is it possible it runs faster? Do you have any good reference to suggest about this subject?

Latter I'll try to isolate the GUI and runs just the simulation code. Maybethe problem is with my class structure.

I know that without the whole code you guys can't discover the problem, butI'm not asking for this. In fact, I want some sort of tip just like yours:"this might be yyy", "read about zzz in the book/link www", "try tool kkk to check if there are zzz", and so on and so forth. In fact, I must learn how to debug this kind of "problem".

So, thanks for your answer and patience! I realy appreciate it.

Best regards.

Stuart · Feb 25, 2013

Am 25.02.13 11:35, schrieb Leandro:

I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.

We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.

The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with [] are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)

Here are the results, compiled with g++:

Version1:

0.15s void FdtdSolver::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.01s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
0.38s caz[k]*(Hy[j][k] - Hy[j][k-1]);
0.10s idxl[j][k] = idxl[j][k] + curl_h;
Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
0.29s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
}
}
}

// Other loops with the same behavior...
}

Version2:

0.01s void FDTDEngine::CalculateDx()
{
int i, j, k;
double curl_h;
// Calculate the Dx field
for(i = 1; i < ia; i++)
{
0.00s for(j = 1; j < Ny; j++)
{
0.06s for(k = 1; k < Nz; k++)
{
0.01s curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
0.53s caz[k]*(Hy[j][k] - Hy[j][k-1]);
0.10s idxl[j][k] = idxl[j][k] + curl_h;
0.02s Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
0.36s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
}
}
}

// Other loops with the same behavior...
}

The question is: What kind of think can I do to solve this problem?

Just a guess:
Check whether both executables are build using the same compiler
settings. For example, if the GUI executable is build with DEBUG
#defined, your code may use a version of operator[] which does bounds
checking (only if you are using std::vector instead of plain arrays).

If nothing helps, you could put the relevant code into a library and
compare once again (this should rule out that you are profiling two
differently compiled versions of the same code).

Another guess: Your second version is running inside a GUI app, so it is
probably running on a worker thread. Maybe the system (either the
library or the OS) lowers the priority of background threads?

Regards,
Stuart

Leandro · Feb 25, 2013

Vielen Dank Stuart!

I've already checked item 1 and it's ok. In fact, with #defined enable, operator[] of TNT::Array3D slowers by a factor of 2 (bounds checking), i.e, almost 4 times slower.

"Another guess: Your second version is running inside a GUI app, so it is probably running on a worker thread. Maybe the system (either the library or the OS) lowers the priority of background threads?"

In this case, when I start the simulation, I also start a wxProgressDialog, so the user can cancel the simulation. I tried this same idea using version1 (use a Frame and a wxProgressDialog), but the performance difference holds.

"If nothing helps, you could put the relevant code into a library and compare once again (this should rule out that you are profiling two differently compiled versions of the same code). "

I'll try this and profile the code again. Maybe it helps. Thanks a lot.

vg

Em segunda-feira, 25 de fevereiro de 2013 12h37min18s UTC-3, Stuart escreveu:

Öö Tiib · Feb 25, 2013

I'll start from the end...

"What problem? The difference is normal. I have always observed similar
differences. "

When I simulate a big problem using the console application, the console
version runs in, for example, 3 hours. The GUI version runs in more than 5
hours. So, this can be normal, but it's a problem for me (or for my available
time...).

Measuring without changing does not help much.

"As for the performance in general ... what have you tried to improve it? "

What king of performance improvement do you mean?

Change that leaves code same but makes it (hopefully) faster. Help the
compiler to optimize where it fails? For example you have such
code:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
- caz[k]*(Hy[j][k] - Hy[j][k-1]);
// ...
// rest of the stuff
}
}

Now you replace it ... (assuming the arrays contain doubles) with that:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
double cay_j = cay[j];
double (&Hz_i_j)[Nz] = Hz[j];
double (&Hz_i_j_1)[Nz] = Hz[j-1];
double (&Hy_i_j)[Nz] = Hy[j];

for(k = 1; k < Nz; k++)
{
curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
- caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
// ...
// rest of the stuff
}
}

That maybe speeds it maybe not. Reading produced assembler or testing can
show.

"Very Sleepy is often good enough. There are plenty of other bit more
accurate and detailed profiling tools. "

Can you suggest other profiling tools? The project runs in Windows,
and this is not a comercial software, so it would be great if the
tool were free.

Click to expand...

MS compilers can do profile guided optimizations themselves. From
"Build/Profile Guided Optimization" menu.

Unfortunately I'm not allowed to show the code of the whole system.

Click to expand...

It is not commercial but source can't be shown so free neither. What
remains ... ? Criminal or military goals? Those can pay even better
than commercial. Anyway we do not care about your whole code. Cut out
the part that you want to optimize in a way that we can compile it and
run.

I know that without the whole code you guys can't discover the problem,
but I'm not asking for this. In fact, I want some sort of tip just like
yours: "read about zzz in the book/link www".

Click to expand...

Internet is full of suggestions how to "optimize C and C++ code". Some better,
some worse, none perfect. Similar suggestions you get from here.

Jorgen Grahn · Feb 25, 2013

I'm writing a FDTD code (electromagnetic simulation) and I'm having
some troubles with the performance of the code.

It's not my area, but all the loops and foo[m][n] seem like linear
algebra to me. If it is, you might want to check for linalg or matrix
libraries. They can (I'm told) make your code more readable /and/
make it run significantly faster.

/Jorgen

Alain Ketterlin · Feb 25, 2013

Leandro said:
We have two versions. The first one runs just the calculus (version1).
The second one is the whole application (GUI version, using
wxWidgets), with the same calculus routine (version2). The problem is
that version2 runs almost twice slower than version1, and I can't
understand why.

As far as I can see, the two codes are identical. If that is the case,
there is no reason why they should behave differently, except for
memory-related causes (like alignment, fragmentation, etc.) Make sure
you allocate your data in similar conditions (e.g., allocate everything
at program start).

The FDTD calculation is a lot of loops (one loop in time and three in
x, y and z direction). So I removed all of them but one and tried
"Very Sleepy" to profile the code. It shows me that the exactly piece
of code runs with very different speed in the two versions, reproduced
below (variable with [] are of type TNT::Array3D -
http://math.nist.gov/tnt/overview.html)

I had a quick look at this, and it doesn't seem very efficient.
Especially, allocating a 3D arrays of doubles (I guess) allocates a 2D
array of pointers (to rows of doubles), which itself causes the
allocation of a 1D array of pointer (to rows of pointers). The only
reason for this is the support of the [j][k] notation. All this will
fragment your memory, and may blow the cache (an access to [j][k]
requires accessing 3 arrays). If your code is that simple, use linear
arrays and do the index arithmetic by yourself. You'll save a lot.

But there is no reason this changes from version 1 to version 2, except
if your array allocations are interleaved with other allocations.

-- Alain.

Seungbeom Kim · Feb 26, 2013

Change that leaves code same but makes it (hopefully) faster. Help the
compiler to optimize where it fails? For example you have such
code:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
- caz[k]*(Hy[j][k] - Hy[j][k-1]);
// ...
// rest of the stuff
}
}

Now you replace it ... (assuming the arrays contain doubles) with that:

// stuff
// ...
for(j = 1; j < Ny; j++)
{
double cay_j = cay[j];
double (&Hz_i_j)[Nz] = Hz[j];
double (&Hz_i_j_1)[Nz] = Hz[j-1];
double (&Hy_i_j)[Nz] = Hy[j];

for(k = 1; k < Nz; k++)
{
curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
- caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
// ...
// rest of the stuff
}
}

That maybe speeds it maybe not. Reading produced assembler or testing can
show.

I'm just speculating, but reflecting that references just create aliases,
I doubt that using references like this will affect the generated code
in any significant way. (Creating a reference will not even trigger a
prefetch, will it?) And if it actually does, the compiler must not have
been doing a very good job even at merely grasping common subexpressions.

On the other hand, value copying instead of just aliasing (as for cay_j
above) may have a better chance of improvement.

Öö Tiib · Feb 26, 2013

I'm just speculating, but reflecting that references just create aliases,
I doubt that using references like this will affect the generated code
in any significant way. (Creating a reference will not even trigger a
prefetch, will it?) And if it actually does, the compiler must not have
been doing a very good job even at merely grasping common subexpressions.

I asked for real code and what optimizations he has tried. He did ask as
return what I mean by optimizations. Above was example how optimizations
may look like.

On the other hand, value copying instead of just aliasing (as for cay_j
above) may have a better chance of improvement.

Real compilers manage to do that also pretty well.

Leandro · Feb 26, 2013

(e-mail address removed):

"As others have mentioned, we're not seeing enough to really be able to
tell, but... "

OK, you're right. I tried to clean up the code of my class, removing all data members and leaving just the necessary to do the calculation. As a result, both versions run with the same speed (the difference were irrelevant).

"Certainly things like (un)fortuitous memory alignment in one case or
the other case can make a big difference. You may want to check how
that's different between the codes. You shouldn't really be seeing
that much of a performance hit just from being in the GUI application,
I assume the system is mostly not doing anything other than running
your thread and the GUI is mostly just sitting there, so there should
not be too much extra cache contention, although it's something to
consider."

I don't think the problem is with the GUI. I also tried to run the console version with a frame, but the difference holds. Another thing I can do is insert the "good" code at the GUI version and check the performance.

Considering the 30 days trial, I'll download Intel VTune and look for a tutorial. Tks.

----------------------------
Öö Tiib:

"Change that leaves code same but makes it (hopefully) faster. Help the
compiler to optimize where it fails? For example you have such
code:..."

I've not tried this yet. First I'm trying to understand why the same code runs with completly different speed in the two versions. But you're right. In fact, someone said that TNT is not efficient, so I'll check other library(maybe Armadillo or Blitz++, don't know yet).

"t is not commercial but source can't be shown so free neither. What
remains ... ? Criminal or military goals? Those can pay even better
than commercial. Anyway we do not care about your whole code. Cut out
the part that you want to optimize in a way that we can compile it and
run. "

This is an academic project, but there are other people that owns the project. I cut out the part I want to optimize, but then the both versions runs with the same speed. I'll try other things and, if it doesn't work, I'll cut out the part I want to optimize (leaving the most part unnecessary to a test).

----------------------------
Alain Ketterlin

"I had a quick look at this, and it doesn't seem very efficient.
Especially, allocating a 3D arrays of doubles (I guess) allocates a 2D
array of pointers (to rows of doubles), which itself causes the
allocation of a 1D array of pointer (to rows of pointers). "

Tks. You're right, I'll look for a better library.

Alain Ketterlin · Feb 26, 2013

Seungbeom Kim said:
On 2013-02-25 09:39, Ã–Ã¶ Tiib wrote: [...]

for(j = 1; j < Ny; j++)
{
for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
- caz[k]*(Hy[j][k] - Hy[j][k-1]);
// ...
// rest of the stuff
}
}

Now you replace it ... (assuming the arrays contain doubles) with that: [...]
for(j = 1; j < Ny; j++)
{
double cay_j = cay[j];
double (&Hz_i_j)[Nz] = Hz[j];
double (&Hz_i_j_1)[Nz] = Hz[j-1];
double (&Hy_i_j)[Nz] = Hy[j];

for(k = 1; k < Nz; k++)
{
curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
- caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
// ...
// rest of the stuff
}
}

Click to expand...

I'm just speculating, but reflecting that references just create aliases,
I doubt that using references like this will affect the generated code
in any significant way.

Click to expand...

It will, and a lot. In the fragment above you have more array reference
that actual computations (like + - etc.). Consider a loop like:

for ( i=... )
for ( j=... )
... T.at(i,j) ...

(I use at(i,j) to abstract away the effective access). Now assume T is a
linearized array. This access is equivalent to:

T.data[i*N+j]

Calculating i*N on every iteration of the j loop costs something. What
Ã–Ã´ suggests is moving this computation out of the j loop.

Note that if the array is stored as a 1D array of pointers to 1D arrays
of doubles, the same argument applies. T[j] is:

*(*(T.rows+i)+j)

again, *(T.rows+i) could be hoisted out of the j loop (if that location
is not overwritten during the run of the loop -- unfortunately it is
difficult for a compiler to realize this, and that's where restrict is
useful).

There is a large amount of computation going on during array accesses,
and factoring as much as possible to put it out of the loop is one of
the major optimization opportunity.

(To the OP: the library you use just makes array accesses much more
costly than they should be, and probably prevents this kind of
optimization.)

(Creating a reference will not even trigger a prefetch, will it?)
No.

And if it actually does, the compiler must not have been doing a very
good job even at merely grasping common subexpressions.

Click to expand...

Common subexpressions are important, loop-invariant code motion is even
more, because it reduces the amount of code by a factor equal to the
number of iterations of the loop.

On the other hand, value copying instead of just aliasing (as for cay_j
above) may have a better chance of improvement.

Click to expand...

This is actually no different from extracting a partial array access:
same idea, same potential gain.

-- Alain.

Leandro · Feb 26, 2013

OK, so I'm trying to improve the performance of the code. I would like to substitute TNT library for another, more efficient (as suggest by Alain Ketterlin). So I tried to use direct pointers to store the data. Unfortunately,the code with TNT runs ~3 times faster than using pointers. How could I change this code to get a better performance over TNT? I've followed the suggestions of Öö Tiib and Alain Ketterlin (put some arrays in the outer loop) and the performance improved 10%~15%, but yet worse than using TNT (http://math.nist.gov/tnt/overview.html).

Code with 3 tests:

#include "tnt/tnt.h"
#include "TicTac.h"

using namespace std;
using namespace TNT;

inline double getCM(double* array, int m0, int m1, int i, int j, int k) {
return array[i + m0*j + m0*m1*k];
}
inline void setCM(double* array, int m0, int m1, int i, int j, int k, double value) {
array[i + m0*j + m0*m1*k] = value;
}
inline void addCM(double* array, int m0, int m1, int i, int j, int k, double value) {
array[i + m0*j + m0*m1*k] += value;
}
int main() {

int Npml = 10;
int NxTOTAL = 100;
int NyTOTAL = 150;
int NzTOTAL = 200;
int TMAX = 1000;
int jb = NyTOTAL - Npml - 1;
int jyh;
int i, j, k, t;
double curl_h;

/* TNT - TEST 1 */
// Array1D<double> gi3 = Array1D<double>(NxTOTAL, 1.0);
// Array1D<double> cax = Array1D<double>(NxTOTAL, 0.0);
// Array1D<double> cay = Array1D<double>(NyTOTAL, 0.0);
// Array1D<double> caz = Array1D<double>(NzTOTAL, 0.0);
//
// Array3D<double> Hx = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
// Array3D<double> Hy = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
// Array3D<double> Hz = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
//
// Array1D<double> gk2 = Array1D<double>(NzTOTAL, 1.0);
// Array1D<double> gk3 = Array1D<double>(NzTOTAL, 1.0);
//
// Array3D<double> idyh = Array3D<double>(NxTOTAL, Npml, NzTOTAL, 0.0);
//
// Array3D<double> Dy = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
//
// Array1D<double> gi2 = Array1D<double>(NxTOTAL, 1.0);
// Array1D<double> gj1 = Array1D<double>(NyTOTAL, 0.0);
//
// TicTac tt;
// tt.Tic("TNT: ");
// for (t = 0; t < TMAX; t++) {
// for(i = 1; i < NxTOTAL; i++) {
// for(j = jb+1; j < NyTOTAL; j++) {
// jyh = j - jb - 1;
// for(k = 1; k < NzTOTAL; k++)
// {
// curl_h = caz[k]*(Hx[j][k] - Hx[j][k-1]) -
// cax*(Hz[j][k] - Hz[i-1][j][k]);
// idyh[jyh][k] = idyh[jyh][k] + curl_h;
// Dy[j][k] = gi3*gk3[k]*Dy[j][k] +
// gi2*gk2[k]*(curl_h + gj1[j]*idyh[jyh][k]);
// }
// }
// }
// }
// tt.Tac();

/* C-ARRAY - TEST 2 */
// double* gi3 = new double[NxTOTAL];
// double* cax = new double[NxTOTAL];
// double* cay = new double[NyTOTAL];
// double* caz = new double[NzTOTAL];
//
// double* Hx = new double[NxTOTAL*NyTOTAL*NzTOTAL];
// double* Hy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
// double* Hz = new double[NxTOTAL*NyTOTAL*NzTOTAL];
//
// double* gk2 = new double[NxTOTAL];
// double* gk3 = new double[NxTOTAL];
//
// double* idyh = new double[NxTOTAL*NyTOTAL*NzTOTAL];
// double* Dy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
//
// double* gi2 = new double[NxTOTAL];
// double* gj1 = new double[NxTOTAL];
//
// TicTac tt;
// tt.Tic("Array");
// for (t = 0; t < TMAX; t++) {
// for(i = 1; i < NxTOTAL; i++) {
// for(j = jb+1; j < NyTOTAL; j++) {
// jyh = j - jb - 1;
// for(k = 1; k < NzTOTAL; k++)
// {
// curl_h = caz[k]*(getCM(Hx, NxTOTAL, NyTOTAL, i, j, k) - getCM(Hx, NxTOTAL, NyTOTAL, i, j, k-1)) -
// cax*(getCM(Hz, NxTOTAL, NyTOTAL, i, j, k) - getCM(Hz, NxTOTAL,NyTOTAL, i-1, j, k));
//
// addCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k, curl_h);
//
// double value = gi3*gk3[k]*getCM(Dy, NxTOTAL, NyTOTAL, i, j, k) +
// gi2*gk2[k]*(curl_h + gj1[j]*getCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k));
// setCM(Dy, NxTOTAL, NyTOTAL, i, j, k, value);
//
// }
// }
// }
// }
// tt.Tac();
//
// delete[] gi3;
// delete[] cax;
// delete[] cay;
// delete[] caz;
//
// delete[] Hx;
// delete[] Hy;
// delete[] Hz;
//
// delete[] gk2;
// delete[] gk3;
//
// delete[] idyh;
// delete[] Dy;
//
// delete[] gi2;
// delete[] gj1;

/* C-ARRAY (IMPROVED) - TEST 3*/
double* gi3 = new double[NxTOTAL];
double* cax = new double[NxTOTAL];
double* cay = new double[NyTOTAL];
double* caz = new double[NzTOTAL];

double* Hx = new double[NxTOTAL*NyTOTAL*NzTOTAL];
double* Hy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
double* Hz = new double[NxTOTAL*NyTOTAL*NzTOTAL];

double* gk2 = new double[NxTOTAL];
double* gk3 = new double[NxTOTAL];

double* idyh = new double[NxTOTAL*NyTOTAL*NzTOTAL];
double* Dy = new double[NxTOTAL*NyTOTAL*NzTOTAL];

double* gi2 = new double[NxTOTAL];
double* gj1 = new double[NxTOTAL];

TicTac tt;
tt.Tic("Array (improved): ");
for (t = 0; t < TMAX; t++) {
for(i = 1; i < NxTOTAL; i++) {
for(j = jb+1; j < NyTOTAL; j++) {
double* Dy_ij = Dy + (i + j*NxTOTAL);
double* Hx_ij = Hx + (i + j*NxTOTAL);
double* Hz_ij = Hz + (i + j*NxTOTAL);
double* Hz_i1_j = Hz + (i - 1 + j*NxTOTAL);

double jyh = j - jb - 1;
for(k = 1; k < NzTOTAL; k++)
{
double curl_h = caz[k]*(Hx_ij[NxTOTAL*NyTOTAL*k] - Hx_ij[NxTOTAL*NyTOTAL*k-1]) -
cax*(Hz_ij[NxTOTAL*NyTOTAL*k] - Hz_i1_j[NxTOTAL*NyTOTAL*k]);

addCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k, curl_h);

double value = gi3*gk3[k]*Dy_ij[NxTOTAL*NyTOTAL*k] +
gi2*gk2[k]*(curl_h + gj1[j]*getCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k));
Dy_ij[NxTOTAL*NyTOTAL*k] = value;
}
}
}
}
tt.Tac();

delete[] gi3;
delete[] cax;
delete[] cay;
delete[] caz;

delete[] Hx;
delete[] Hy;
delete[] Hz;

delete[] gk2;
delete[] gk3;

delete[] idyh;
delete[] Dy;

delete[] gi2;
delete[] gj1;
}
----------------
TicTac just measures time and print:

class TicTac {
private:
string message;
double tstart;
double tend;
public:
TicTac();
~TicTac();
void Tic(string msg);
void Tac();
};

void TicTac::Tic(string msg) {
message = msg;
tstart = (double)clock()/CLOCKS_PER_SEC;
}

void TicTac::Tac() {
tend = (double)clock()/CLOCKS_PER_SEC;
cout << message << ". TIME: " << (tend-tstart) << "s" << endl;
}

Tks

Victor Bazarov · Feb 26, 2013

OK, so I'm trying to improve the performance of the code. I would
like to substitute TNT library for another, more efficient (as
suggest by Alain Ketterlin). So I tried to use direct pointers to
store the data. Unfortunately, the code with TNT runs ~3 times faster
than using pointers. How could I change this code to get a better
performance over TNT? I've followed the suggestions of Öö Tiib and
Alain Ketterlin (put some arrays in the outer loop) and the
performance improved 10%~15%, but yet worse than using TNT
(http://math.nist.gov/tnt/overview.html).

Give it another thought. Take a look at the following.

double a[N*M] = {};
for (int i = 0; i < N; i++)
{
for (int j = 0; j < M; j++)
{
do_something( a[i*M + j] ); // pay attention here
}
}

And compare it to

double a[N*M] = {};
for (int i = 0; i < N; i++)
{
double* pai = a + i*M;
for (int j = 0; j < M; j++)
{
do_something( pai[j] ); // pay attention here too
}
}

What do you see different here?

I do not want to get your hopes too much up, so to speak, but IME tricks
like that *have shown* improvement (although not three-fold) over the
use of full indexing arithmetic in the innermost loop.

[..]

V

Öö Tiib · Feb 26, 2013

OK, so I'm trying to improve the performance of the code.
I would like to substitute TNT library for another, more efficient
(as suggest by Alain Ketterlin). So I tried to use direct pointers
to store the data. Unfortunately, the code with TNT runs ~3 times
faster than using pointers. How could I change this code to get a
better performance over TNT? I've followed the suggestions of Öö Tiib
and Alain Ketterlin (put some arrays in the outer loop) and the
performance improved 10%~15%, but yet worse than using TNT
(http://math.nist.gov/tnt/overview.html).

You want to beat someone then you need to master the whole thing,
that is impossible to achieve by just single thread in clc++. For
example how to replace multiplication with addition:

for ( t = 0; t < TMAX; t++ ) {
for ( i = 1; i < NxTOTAL; i++ ) {
// multiply once
double* Dy_ij = Dy + (i + jb*NxTOTAL);
double* Hx_ij = Hx + (i + jb*NxTOTAL);
double* Hz_ij = Hz + (i + jb*NxTOTAL);
double* Hz_i1_j = Hz + (i - 1 + jb*NxTOTAL);
for ( j = jb+1; j < NyTOTAL; j++ ) {
// just add rest of the time
Dy_ij += NxTOTAL;
Hx_ij += NxTOTAL;
Hz_ij += NxTOTAL;
Hz_i1_j += NxTOTAL;

In general ... profile -> read books & optimize -> profile in cycle until
you can not make it any better.

Alain Ketterlin · Feb 27, 2013

Leandro said:
inline double getCM(double* array, int m0, int m1, int i, int j, int k) {
return array[i + m0*j + m0*m1*k];
}

Leandro, this is "wrong". C/C++ stores arrays in row major mode, so if
the array is of size N*M*P, element [j][k] is at i*M*P + j*P + k (the
problem is the same inside the loop).

Having it the wrong way completely kills spatial locality, it is
therefore not surprising you get worse results, even though your numbers
are probably right.

Also, in a first experiment, leave it the compiler to move code out of
the loops (compile with -O1 at least, gcc should be able to cope).

-- Alain.

Seungbeom Kim · Apr 8, 2013

Consider a loop like:

for ( i=... )
for ( j=... )
... T.at(i,j) ...

(I use at(i,j) to abstract away the effective access). Now assume T is a
linearized array. This access is equivalent to:

T.data[i*N+j]

Calculating i*N on every iteration of the j loop costs something. What
Ã–Ã´ suggests is moving this computation out of the j loop.

I have experimented with a simple program that calculates element-wise
multiplication of two matrices.

// Version A:
for (size_t i = 0; i < M; ++i) {
for (size_t j = 0; j < N; ++j) {
C[N * i + j] = A[N * i + j] * B[N * i + j];
}
}

// Version B:
for (size_t i = 0; i < M; ++i) {
const T* A_i = &A[N * i];
const T* B_i = &B[N * i];
T* C_i = &C[N * i];
for (size_t j = 0; j < N; ++j) {
C_i[j] = A_i[j] * B_i[j];
}
}

(T=double, M=N=256, repeating for 10000 times)

95% confidence intervals of the running time based on 100 trials are:

A B
g++ -O2: 1.594Â±0.005 1.592Â±0.005
g++ -O3: 1.648Â±0.005 1.607Â±0.005
(using g++ 4.6.3 on x86_64 Linux)

Under g++ -O2, the difference between the two versions are negligible.
However, g++ -O3 makes both worse, and makes Version A much worse.

Anyway, this shows that compilers may not always be as smart as I wish
them to be. :-(

James Kanze · Apr 9, 2013

On 2013-02-25 09:39, Öö Tiib wrote: [...]

for(j = 1; j < Ny; j++)
{
for(k = 1; k < Nz; k++)
{
curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
- caz[k]*(Hy[j][k] - Hy[j][k-1]);
// ...
// rest of the stuff
}
}
Now you replace it ... (assuming the arrays contain doubles) with that: [...]
for(j = 1; j < Ny; j++)
{
double cay_j = cay[j];
double (&Hz_i_j)[Nz] = Hz[j];
double (&Hz_i_j_1)[Nz] = Hz[j-1];
double (&Hy_i_j)[Nz] = Hy[j];
for(k = 1; k < Nz; k++)
{
curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
- caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
// ...
// rest of the stuff
}
}

Click to expand...

I'm just speculating, but reflecting that references just
create aliases, I doubt that using references like this will
affect the generated code in any significant way.

Click to expand...

It will, and a lot.

Click to expand...

If it does, then there's something seriously wrong with the
quality of the compiler. Hoisting loop invariants was
a standard optmization technique twenty or thirty years ago.

In the fragment above you have more array reference
that actual computations (like + - etc.). Consider a loop like:

Click to expand...

for ( i=... )
for ( j=... )
... T.at(i,j) ...

Click to expand...

(I use at(i,j) to abstract away the effective access). Now assume T is a
linearized array. This access is equivalent to:

T.data[i*N+j]

Click to expand...

Calculating i*N on every iteration of the j loop costs
something. What Öô suggests is moving this computation out of
the j loop.

Click to expand...

Except that pretty much every compiler in the world will do this
for you. And typically, when the compiler does it, it can do it
better than if you try to do it by hand, since it knows the
finality of the references and pointers it creates.

Note that if the array is stored as a 1D array of pointers to 1D arrays
of doubles, the same argument applies. T[j] is:

*(*(T.rows+i)+j)

Click to expand...

again, *(T.rows+i) could be hoisted out of the j loop (if that location
is not overwritten during the run of the loop -- unfortunately it is
difficult for a compiler to realize this, and that's where restrict is
useful).

Click to expand...

There is a large amount of computation going on during array accesses,
and factoring as much as possible to put it out of the loop is one of
the major optimization opportunity.

Click to expand...

(To the OP: the library you use just makes array accesses much more
costly than they should be, and probably prevents this kind of
optimization.)

Click to expand...

Common subexpressions are important, loop-invariant code motion is even
more, because it reduces the amount of code by a factor equal to the
number of iterations of the loop.

Click to expand...

This is actually no different from extracting a partial array access:
same idea, same potential gain.

Click to expand...

Extracting the value has an important benefit; when the compiler
must access the values, it has to take into account possible
aliasing. Thus, in `idx[j][k] = ...` and `D[j][k] = ...`,
the compiler will probably have to assume that anything
expression that references into any of the other arrays might
have different results.

This depends on the definitions of the other arrays. If e.g.
`Hz` and `idx` are both C style arrays—_not_ pointers
which point to the first element, but the actual data
definitions—then the compiler can know that they don't
alias one another. Otherwise, it has to assume the worst, and
reread the elements each time through the loop. Manually
hoisting the reads of `gj3[j]` etc. out of the inner most loop
might make a significant difference, because the compiler
probably cannot do this (since if all it has are pointers, it
has to assume that one of the other assignments in the innermost
loop might modify this value). Depending on the dimensions, it
might actually be worth copying `Hz[j]` et al. into one
dimensional local arrays (which the compiler can see arn't being
modified).

Even better, of course, would be if the compiler has extensions
which allow you to tell it that there isn't any aliasing. C++11
didn't adopt C's `restrict` keyword, but this is one place where
it could help enormously (and some C++ compilers might support
it as an extension).

Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
Help me with this task, please.	3	Mar 22, 2023
Help optimize nbody bench program (c++ sse2 intrinsics)	3	Oct 12, 2012
help to convert this fortan program to c++	2	Apr 11, 2009
valarray performance	11	Aug 4, 2010
Compiler Performance with Compile-Time Array?	2	Oct 27, 2012
Needing Help With My First JavaScript script	1	Mar 22, 2018
Performance considerations in 3D arrays	2	Jul 9, 2006

Problems with performance

Leandro

Öö Tiib

Rui Maciel

Leandro

Stuart

Leandro

Öö Tiib

Jorgen Grahn

Alain Ketterlin

Seungbeom Kim

Öö Tiib

Leandro

Alain Ketterlin

Leandro

Victor Bazarov

Öö Tiib

Alain Ketterlin

Seungbeom Kim

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads