Problems with performance

Discussion in 'C++' started by Leandro, Feb 25, 2013.

  1. Leandro

    Leandro Guest

    I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.

    We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.

    The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with []are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)

    Here are the results, compiled with g++:

    Version1:

    0.15s void FdtdSolver::CalculateDx()
    {
    int i, j, k;
    double curl_h;
    // Calculate the Dx field
    for(i = 1; i < ia; i++)
    {
    0.01s for(j = 1; j < Ny; j++)
    {
    0.06s for(k = 1; k < Nz; k++)
    {
    curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
    0.38s caz[k]*(Hy[j][k] - Hy[j][k-1]);
    0.10s idxl[j][k] = idxl[j][k] + curl_h;
    Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
    0.29s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
    }
    }
    }

    // Other loops with the same behavior...
    }

    Version2:

    0.01s void FDTDEngine::CalculateDx()
    {
    int i, j, k;
    double curl_h;
    // Calculate the Dx field
    for(i = 1; i < ia; i++)
    {
    0.00s for(j = 1; j < Ny; j++)
    {
    0.06s for(k = 1; k < Nz; k++)
    {
    0.01s curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
    0.53s caz[k]*(Hy[j][k] - Hy[j][k-1]);
    0.10s idxl[j][k] = idxl[j][k] + curl_h;
    0.02s Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
    0.36s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
    }
    }
    }

    // Other loops with the same behavior...
    }

    The question is: What kind of think can I do to solve this problem?

    Tks!

    ps.: Sorry for the language. Non native speaker...
     
    Leandro, Feb 25, 2013
    #1
    1. Advertising

  2. Leandro

    Öö Tiib Guest

    On Monday, 25 February 2013 12:35:38 UTC+2, Leandro wrote:
    > We have two versions. The first one runs just the calculus (version1). The
    > second one is the whole application (GUI version, using wxWidgets), with
    > the same calculus routine (version2). The problem is that version2 runs
    > almost twice slower than version1, and I can't understand why.


    When not seeing full code base we can only speculate.

    Things that are cut out of context tend to run lot faster. That is
    1) because of better memory locality (both for data and for code)
    2) because the compiler has less opportunities to waste precious resources
    (like registers) to improve some less important things.

    > The FDTD calculation is a lot of loops (one loop in time and three in x,
    > y and z direction). So I removed all of them but one and tried
    > "Very Sleepy" to profile the code.


    "Very Sleepy" is "often good enough". There are plenty of other bit more
    accurate and detailed profiling tools.

    > The question is: What kind of think can I do to solve this problem?


    What problem? The difference is normal. I have always observed similar
    differences.

    As for the performance in general ... what have you tried to improve it?
     
    Öö Tiib, Feb 25, 2013
    #2
    1. Advertising

  3. Leandro

    Rui Maciel Guest

    Leandro wrote:

    > Here are the results, compiled with g++:
    >

    <snip/>
    >
    > The question is: What kind of think can I do to solve this problem?


    Without access to the code it's hard, if not impossible, to tell. If it
    isn't possible to access a fully working example which reproduces the
    problem you've experiencing then no one can really say anything about what's
    going on in your code.


    Rui Maciel
     
    Rui Maciel, Feb 25, 2013
    #3
  4. Leandro

    Leandro Guest

    I'll start from the end...

    "What problem? The difference is normal. I have always observed similar differences. "

    When I simulate a big problem using the console application, the console version runs in, for example, 3 hours. The GUI version runs in more than 5 hours. So, this can be normal, but it's a problem for me (or for my availabletime...).

    "As for the performance in general ... what have you tried to improve it? "

    No. The performance is important only during the simulation. When opening afile, it doesn't matter if it takes 1 or 2 seconds. So, I'm focusing only in the simulation. Maybe I'm doing it wrong. What king of performance improvement do you mean?

    "Very Sleepy is often good enough. There are plenty of other bit more accurate and detailed profiling tools. "

    Can you suggest other profiling tools? The project runs in Windows, and this is not a comercial software, so it would be great if the tool were free.

    "When not seeing full code base we can only speculate.

    Things that are cut out of context tend to run lot faster. That is
    1) because of better memory locality (both for data and for code)
    2) because the compiler has less opportunities to waste precious resources
    (like registers) to improve some less important things. "

    Unfortunately I'm not allowed to show the code of the whole system. But your tips gave me some hope. Is there any software that allows me to check thetwo versions and see if the difference is due better memory locality? Is it possible to rearrange things in the class declaration to get a faster code? The simulation code is in a class that inherits from others and there are few relations others. Can this fact slower the main code? If I extract the simulation code to an external file/library, is it possible it runs faster? Do you have any good reference to suggest about this subject?

    Latter I'll try to isolate the GUI and runs just the simulation code. Maybethe problem is with my class structure.

    I know that without the whole code you guys can't discover the problem, butI'm not asking for this. In fact, I want some sort of tip just like yours:"this might be yyy", "read about zzz in the book/link www", "try tool kkk to check if there are zzz", and so on and so forth. In fact, I must learn how to debug this kind of "problem".

    So, thanks for your answer and patience! I realy appreciate it.

    Best regards.
     
    Leandro, Feb 25, 2013
    #4
  5. Leandro

    Stuart Guest

    Am 25.02.13 11:35, schrieb Leandro:
    > I'm writing a FDTD code (electromagnetic simulation) and I'm having some troubles with the performance of the code.
    >
    > We have two versions. The first one runs just the calculus (version1). The second one is the whole application (GUI version, using wxWidgets), with the same calculus routine (version2). The problem is that version2 runs almost twice slower than version1, and I can't understand why.
    >
    > The FDTD calculation is a lot of loops (one loop in time and three in x, y and z direction). So I removed all of them but one and tried "Very Sleepy" to profile the code. It shows me that the exactly piece of code runs with very different speed in the two versions, reproduced below (variable with [] are of type TNT::Array3D - http://math.nist.gov/tnt/overview.html)
    >
    > Here are the results, compiled with g++:
    >
    > Version1:
    >
    > 0.15s void FdtdSolver::CalculateDx()
    > {
    > int i, j, k;
    > double curl_h;
    > // Calculate the Dx field
    > for(i = 1; i < ia; i++)
    > {
    > 0.01s for(j = 1; j < Ny; j++)
    > {
    > 0.06s for(k = 1; k < Nz; k++)
    > {
    > curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
    > 0.38s caz[k]*(Hy[j][k] - Hy[j][k-1]);
    > 0.10s idxl[j][k] = idxl[j][k] + curl_h;
    > Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
    > 0.29s gj2[j]*gk2[k]*(curl_h + gi1*idxl[j][k]);
    > }
    > }
    > }
    >
    > // Other loops with the same behavior...
    > }
    >
    > Version2:
    >
    > 0.01s void FDTDEngine::CalculateDx()
    > {
    > int i, j, k;
    > double curl_h;
    > // Calculate the Dx field
    > for(i = 1; i < ia; i++)
    > {
    > 0.00s for(j = 1; j < Ny; j++)
    > {
    > 0.06s for(k = 1; k < Nz; k++)
    > {
    > 0.01s curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k]) -
    > 0.53s caz[k]*(Hy[j][k] - Hy[j][k-1]);
    > 0.10s idxl[j][k] = idxl[j][k] + curl_h;
    > 0.02s Dx[j][k] = gj3[j]*gk3[k]*Dx[j][k] +
    > 0.36s gj2[j]*gk2[k]*(curl_h + gi1*idxl[i][j][k]);
    > }
    > }
    > }
    >
    > // Other loops with the same behavior...
    > }
    >
    > The question is: What kind of think can I do to solve this problem?[/i]


    Just a guess:
    Check whether both executables are build using the same compiler
    settings. For example, if the GUI executable is build with DEBUG
    #defined, your code may use a version of operator[] which does bounds
    checking (only if you are using std::vector instead of plain arrays).

    If nothing helps, you could put the relevant code into a library and
    compare once again (this should rule out that you are profiling two
    differently compiled versions of the same code).

    Another guess: Your second version is running inside a GUI app, so it is
    probably running on a worker thread. Maybe the system (either the
    library or the OS) lowers the priority of background threads?

    Regards,
    Stuart
     
    Stuart, Feb 25, 2013
    #5
  6. Leandro

    Leandro Guest

    Vielen Dank Stuart!

    I've already checked item 1 and it's ok. In fact, with #defined enable, operator[] of TNT::Array3D slowers by a factor of 2 (bounds checking), i.e, almost 4 times slower.

    "Another guess: Your second version is running inside a GUI app, so it is probably running on a worker thread. Maybe the system (either the library or the OS) lowers the priority of background threads?"

    In this case, when I start the simulation, I also start a wxProgressDialog, so the user can cancel the simulation. I tried this same idea using version1 (use a Frame and a wxProgressDialog), but the performance difference holds.

    "If nothing helps, you could put the relevant code into a library and compare once again (this should rule out that you are profiling two differently compiled versions of the same code). "

    I'll try this and profile the code again. Maybe it helps. Thanks a lot.

    vg

    Em segunda-feira, 25 de fevereiro de 2013 12h37min18s UTC-3, Stuart escreveu:

    > Just a guess:
    >
    > Check whether both executables are build using the same compiler
    >
    > settings. For example, if the GUI executable is build with DEBUG
    >
    > #defined, your code may use a version of operator[] which does bounds
    >
    > checking (only if you are using std::vector instead of plain arrays).
    >
    >
    >
    > If nothing helps, you could put the relevant code into a library and
    >
    > compare once again (this should rule out that you are profiling two
    >
    > differently compiled versions of the same code).
    >
    >
    >
    > Another guess: Your second version is running inside a GUI app, so it is
    >
    > probably running on a worker thread. Maybe the system (either the
    >
    > library or the OS) lowers the priority of background threads?
    >
    >
    >
    > Regards,
    >
    > Stuart
     
    Leandro, Feb 25, 2013
    #6
  7. Leandro

    Öö Tiib Guest

    On Monday, 25 February 2013 14:50:06 UTC+2, Leandro wrote:
    > I'll start from the end...
    >
    > "What problem? The difference is normal. I have always observed similar
    > differences. "
    >
    > When I simulate a big problem using the console application, the console
    > version runs in, for example, 3 hours. The GUI version runs in more than 5
    > hours. So, this can be normal, but it's a problem for me (or for my available
    > time...).


    Measuring without changing does not help much.

    > "As for the performance in general ... what have you tried to improve it? "
    >
    > What king of performance improvement do you mean?


    Change that leaves code same but makes it (hopefully) faster. Help the
    compiler to optimize where it fails? For example you have such
    code:

    // stuff
    // ...
    for(j = 1; j < Ny; j++)
    {
    for(k = 1; k < Nz; k++)
    {
    curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
    - caz[k]*(Hy[j][k] - Hy[j][k-1]);
    // ...
    // rest of the stuff
    }
    }

    Now you replace it ... (assuming the arrays contain doubles) with that:

    // stuff
    // ...
    for(j = 1; j < Ny; j++)
    {
    double cay_j = cay[j];
    double (&Hz_i_j)[Nz] = Hz[j];
    double (&Hz_i_j_1)[Nz] = Hz[j-1];
    double (&Hy_i_j)[Nz] = Hy[j];

    for(k = 1; k < Nz; k++)
    {
    curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
    - caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
    // ...
    // rest of the stuff
    }
    }

    That maybe speeds it maybe not. Reading produced assembler or testing can
    show.

    > "Very Sleepy is often good enough. There are plenty of other bit more
    > accurate and detailed profiling tools. "
    >
    > Can you suggest other profiling tools? The project runs in Windows,
    > and this is not a comercial software, so it would be great if the
    > tool were free.


    MS compilers can do profile guided optimizations themselves. From
    "Build/Profile Guided Optimization" menu.

    > Unfortunately I'm not allowed to show the code of the whole system.


    It is not commercial but source can't be shown so free neither. What
    remains ... ? Criminal or military goals? Those can pay even better
    than commercial. Anyway we do not care about your whole code. Cut out
    the part that you want to optimize in a way that we can compile it and
    run.

    > I know that without the whole code you guys can't discover the problem,
    > but I'm not asking for this. In fact, I want some sort of tip just like
    > yours: "read about zzz in the book/link www".


    Internet is full of suggestions how to "optimize C and C++ code". Some better,
    some worse, none perfect. Similar suggestions you get from here.
     
    Öö Tiib, Feb 25, 2013
    #7
  8. Leandro

    Jorgen Grahn Guest

    On Mon, 2013-02-25, Leandro wrote:
    > I'm writing a FDTD code (electromagnetic simulation) and I'm having
    > some troubles with the performance of the code.


    It's not my area, but all the loops and foo[m][n] seem like linear
    algebra to me. If it is, you might want to check for linalg or matrix
    libraries. They can (I'm told) make your code more readable /and/
    make it run significantly faster.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Feb 25, 2013
    #8
  9. Leandro <> writes:

    > We have two versions. The first one runs just the calculus (version1).
    > The second one is the whole application (GUI version, using
    > wxWidgets), with the same calculus routine (version2). The problem is
    > that version2 runs almost twice slower than version1, and I can't
    > understand why.


    As far as I can see, the two codes are identical. If that is the case,
    there is no reason why they should behave differently, except for
    memory-related causes (like alignment, fragmentation, etc.) Make sure
    you allocate your data in similar conditions (e.g., allocate everything
    at program start).

    > The FDTD calculation is a lot of loops (one loop in time and three in
    > x, y and z direction). So I removed all of them but one and tried
    > "Very Sleepy" to profile the code. It shows me that the exactly piece
    > of code runs with very different speed in the two versions, reproduced
    > below (variable with [] are of type TNT::Array3D -
    > http://math.nist.gov/tnt/overview.html)


    I had a quick look at this, and it doesn't seem very efficient.
    Especially, allocating a 3D arrays of doubles (I guess) allocates a 2D
    array of pointers (to rows of doubles), which itself causes the
    allocation of a 1D array of pointer (to rows of pointers). The only
    reason for this is the support of the [j][k] notation. All this will
    fragment your memory, and may blow the cache (an access to [j][k]
    requires accessing 3 arrays). If your code is that simple, use linear
    arrays and do the index arithmetic by yourself. You'll save a lot.

    But there is no reason this changes from version 1 to version 2, except
    if your array allocations are interleaved with other allocations.

    -- Alain.
     
    Alain Ketterlin, Feb 25, 2013
    #9
  10. On 2013-02-25 09:39, Öö Tiib wrote:
    >
    > Change that leaves code same but makes it (hopefully) faster. Help the
    > compiler to optimize where it fails? For example you have such
    > code:
    >
    > // stuff
    > // ...
    > for(j = 1; j < Ny; j++)
    > {
    > for(k = 1; k < Nz; k++)
    > {
    > curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
    > - caz[k]*(Hy[j][k] - Hy[j][k-1]);
    > // ...
    > // rest of the stuff
    > }
    > }
    >
    > Now you replace it ... (assuming the arrays contain doubles) with that:
    >
    > // stuff
    > // ...
    > for(j = 1; j < Ny; j++)
    > {
    > double cay_j = cay[j];
    > double (&Hz_i_j)[Nz] = Hz[j];
    > double (&Hz_i_j_1)[Nz] = Hz[j-1];
    > double (&Hy_i_j)[Nz] = Hy[j];
    >
    > for(k = 1; k < Nz; k++)
    > {
    > curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
    > - caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
    > // ...
    > // rest of the stuff
    > }
    > }
    >
    > That maybe speeds it maybe not. Reading produced assembler or testing can
    > show.


    I'm just speculating, but reflecting that references just create aliases,
    I doubt that using references like this will affect the generated code
    in any significant way. (Creating a reference will not even trigger a
    prefetch, will it?) And if it actually does, the compiler must not have
    been doing a very good job even at merely grasping common subexpressions.

    On the other hand, value copying instead of just aliasing (as for cay_j
    above) may have a better chance of improvement.

    --
    Seungbeom Kim
     
    Seungbeom Kim, Feb 26, 2013
    #10
  11. Leandro

    Öö Tiib Guest

    On Tuesday, 26 February 2013 09:05:11 UTC+2, Seungbeom Kim wrote:
    > On 2013-02-25 09:39, Öö Tiib wrote:
    > >
    > > Change that leaves code same but makes it (hopefully) faster. Help the
    > > compiler to optimize where it fails? For example you have such
    > > code:
    > > Now you replace it ... (assuming the arrays contain doubles) with that:
    > > That maybe speeds it maybe not. Reading produced assembler or testing can
    > > show.

    >
    > I'm just speculating, but reflecting that references just create aliases,
    > I doubt that using references like this will affect the generated code
    > in any significant way. (Creating a reference will not even trigger a
    > prefetch, will it?) And if it actually does, the compiler must not have
    > been doing a very good job even at merely grasping common subexpressions.


    I asked for real code and what optimizations he has tried. He did ask as
    return what I mean by optimizations. Above was example how optimizations
    may look like.

    > On the other hand, value copying instead of just aliasing (as for cay_j
    > above) may have a better chance of improvement.


    Real compilers manage to do that also pretty well.
     
    Öö Tiib, Feb 26, 2013
    #11
  12. Leandro

    Leandro Guest

    :

    "As others have mentioned, we're not seeing enough to really be able to
    tell, but... "

    OK, you're right. I tried to clean up the code of my class, removing all data members and leaving just the necessary to do the calculation. As a result, both versions run with the same speed (the difference were irrelevant).

    "Certainly things like (un)fortuitous memory alignment in one case or
    the other case can make a big difference. You may want to check how
    that's different between the codes. You shouldn't really be seeing
    that much of a performance hit just from being in the GUI application,
    I assume the system is mostly not doing anything other than running
    your thread and the GUI is mostly just sitting there, so there should
    not be too much extra cache contention, although it's something to
    consider."

    I don't think the problem is with the GUI. I also tried to run the console version with a frame, but the difference holds. Another thing I can do is insert the "good" code at the GUI version and check the performance.

    Considering the 30 days trial, I'll download Intel VTune and look for a tutorial. Tks.

    ----------------------------
    Öö Tiib:

    "Change that leaves code same but makes it (hopefully) faster. Help the
    compiler to optimize where it fails? For example you have such
    code:..."

    I've not tried this yet. First I'm trying to understand why the same code runs with completly different speed in the two versions. But you're right. In fact, someone said that TNT is not efficient, so I'll check other library(maybe Armadillo or Blitz++, don't know yet).

    "t is not commercial but source can't be shown so free neither. What
    remains ... ? Criminal or military goals? Those can pay even better
    than commercial. Anyway we do not care about your whole code. Cut out
    the part that you want to optimize in a way that we can compile it and
    run. "

    This is an academic project, but there are other people that owns the project. I cut out the part I want to optimize, but then the both versions runs with the same speed. I'll try other things and, if it doesn't work, I'll cut out the part I want to optimize (leaving the most part unnecessary to a test).

    ----------------------------
    Alain Ketterlin

    "I had a quick look at this, and it doesn't seem very efficient.
    Especially, allocating a 3D arrays of doubles (I guess) allocates a 2D
    array of pointers (to rows of doubles), which itself causes the
    allocation of a 1D array of pointer (to rows of pointers). "

    Tks. You're right, I'll look for a better library.
     
    Leandro, Feb 26, 2013
    #12
  13. Seungbeom Kim <> writes:

    > On 2013-02-25 09:39, Öö Tiib wrote:

    [...]
    >> for(j = 1; j < Ny; j++)
    >> {
    >> for(k = 1; k < Nz; k++)
    >> {
    >> curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
    >> - caz[k]*(Hy[j][k] - Hy[j][k-1]);
    >> // ...
    >> // rest of the stuff
    >> }
    >> }
    >>
    >> Now you replace it ... (assuming the arrays contain doubles) with that:

    [...]
    >> for(j = 1; j < Ny; j++)
    >> {
    >> double cay_j = cay[j];
    >> double (&Hz_i_j)[Nz] = Hz[j];
    >> double (&Hz_i_j_1)[Nz] = Hz[j-1];
    >> double (&Hy_i_j)[Nz] = Hy[j];
    >>
    >> for(k = 1; k < Nz; k++)
    >> {
    >> curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
    >> - caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
    >> // ...
    >> // rest of the stuff
    >> }
    >> }


    > I'm just speculating, but reflecting that references just create aliases,
    > I doubt that using references like this will affect the generated code
    > in any significant way.


    It will, and a lot. In the fragment above you have more array reference
    that actual computations (like + - etc.). Consider a loop like:

    for ( i=... )
    for ( j=... )
    ... T.at(i,j) ...

    (I use at(i,j) to abstract away the effective access). Now assume T is a
    linearized array. This access is equivalent to:

    T.data[i*N+j]

    Calculating i*N on every iteration of the j loop costs something. What
    Öô suggests is moving this computation out of the j loop.

    Note that if the array is stored as a 1D array of pointers to 1D arrays
    of doubles, the same argument applies. T[j] is:

    *(*(T.rows+i)+j)

    again, *(T.rows+i) could be hoisted out of the j loop (if that location
    is not overwritten during the run of the loop -- unfortunately it is
    difficult for a compiler to realize this, and that's where restrict is
    useful).

    There is a large amount of computation going on during array accesses,
    and factoring as much as possible to put it out of the loop is one of
    the major optimization opportunity.

    (To the OP: the library you use just makes array accesses much more
    costly than they should be, and probably prevents this kind of
    optimization.)

    > (Creating a reference will not even trigger a prefetch, will it?)


    No.

    > And if it actually does, the compiler must not have been doing a very
    > good job even at merely grasping common subexpressions.


    Common subexpressions are important, loop-invariant code motion is even
    more, because it reduces the amount of code by a factor equal to the
    number of iterations of the loop.

    > On the other hand, value copying instead of just aliasing (as for cay_j
    > above) may have a better chance of improvement.


    This is actually no different from extracting a partial array access:
    same idea, same potential gain.

    -- Alain.
     
    Alain Ketterlin, Feb 26, 2013
    #13
  14. Leandro

    Leandro Guest

    OK, so I'm trying to improve the performance of the code. I would like to substitute TNT library for another, more efficient (as suggest by Alain Ketterlin). So I tried to use direct pointers to store the data. Unfortunately,the code with TNT runs ~3 times faster than using pointers. How could I change this code to get a better performance over TNT? I've followed the suggestions of Öö Tiib and Alain Ketterlin (put some arrays in the outer loop) and the performance improved 10%~15%, but yet worse than using TNT (http://math.nist.gov/tnt/overview.html).

    Code with 3 tests:

    #include "tnt/tnt.h"
    #include "TicTac.h"

    using namespace std;
    using namespace TNT;

    inline double getCM(double* array, int m0, int m1, int i, int j, int k) {
    return array[i + m0*j + m0*m1*k];
    }
    inline void setCM(double* array, int m0, int m1, int i, int j, int k, double value) {
    array[i + m0*j + m0*m1*k] = value;
    }
    inline void addCM(double* array, int m0, int m1, int i, int j, int k, double value) {
    array[i + m0*j + m0*m1*k] += value;
    }
    int main() {

    int Npml = 10;
    int NxTOTAL = 100;
    int NyTOTAL = 150;
    int NzTOTAL = 200;
    int TMAX = 1000;
    int jb = NyTOTAL - Npml - 1;
    int jyh;
    int i, j, k, t;
    double curl_h;


    /* TNT - TEST 1 */
    // Array1D<double> gi3 = Array1D<double>(NxTOTAL, 1.0);
    // Array1D<double> cax = Array1D<double>(NxTOTAL, 0.0);
    // Array1D<double> cay = Array1D<double>(NyTOTAL, 0.0);
    // Array1D<double> caz = Array1D<double>(NzTOTAL, 0.0);
    //
    // Array3D<double> Hx = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
    // Array3D<double> Hy = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
    // Array3D<double> Hz = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
    //
    // Array1D<double> gk2 = Array1D<double>(NzTOTAL, 1.0);
    // Array1D<double> gk3 = Array1D<double>(NzTOTAL, 1.0);
    //
    // Array3D<double> idyh = Array3D<double>(NxTOTAL, Npml, NzTOTAL, 0.0);
    //
    // Array3D<double> Dy = Array3D<double>(NxTOTAL,NyTOTAL,NzTOTAL, 0.0);
    //
    // Array1D<double> gi2 = Array1D<double>(NxTOTAL, 1.0);
    // Array1D<double> gj1 = Array1D<double>(NyTOTAL, 0.0);
    //
    // TicTac tt;
    // tt.Tic("TNT: ");
    // for (t = 0; t < TMAX; t++) {
    // for(i = 1; i < NxTOTAL; i++) {
    // for(j = jb+1; j < NyTOTAL; j++) {
    // jyh = j - jb - 1;
    // for(k = 1; k < NzTOTAL; k++)
    // {
    // curl_h = caz[k]*(Hx[j][k] - Hx[j][k-1]) -
    // cax*(Hz[j][k] - Hz[i-1][j][k]);
    // idyh[jyh][k] = idyh[jyh][k] + curl_h;
    // Dy[j][k] = gi3*gk3[k]*Dy[j][k] +
    // gi2*gk2[k]*(curl_h + gj1[j]*idyh[jyh][k]);
    // }
    // }
    // }
    // }
    // tt.Tac();



    /* C-ARRAY - TEST 2 */
    // double* gi3 = new double[NxTOTAL];
    // double* cax = new double[NxTOTAL];
    // double* cay = new double[NyTOTAL];
    // double* caz = new double[NzTOTAL];
    //
    // double* Hx = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    // double* Hy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    // double* Hz = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    //
    // double* gk2 = new double[NxTOTAL];
    // double* gk3 = new double[NxTOTAL];
    //
    // double* idyh = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    // double* Dy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    //
    // double* gi2 = new double[NxTOTAL];
    // double* gj1 = new double[NxTOTAL];
    //
    // TicTac tt;
    // tt.Tic("Array");
    // for (t = 0; t < TMAX; t++) {
    // for(i = 1; i < NxTOTAL; i++) {
    // for(j = jb+1; j < NyTOTAL; j++) {
    // jyh = j - jb - 1;
    // for(k = 1; k < NzTOTAL; k++)
    // {
    // curl_h = caz[k]*(getCM(Hx, NxTOTAL, NyTOTAL, i, j, k) - getCM(Hx, NxTOTAL, NyTOTAL, i, j, k-1)) -
    // cax*(getCM(Hz, NxTOTAL, NyTOTAL, i, j, k) - getCM(Hz, NxTOTAL,NyTOTAL, i-1, j, k));
    //
    // addCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k, curl_h);
    //
    // double value = gi3*gk3[k]*getCM(Dy, NxTOTAL, NyTOTAL, i, j, k) +
    // gi2*gk2[k]*(curl_h + gj1[j]*getCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k));
    // setCM(Dy, NxTOTAL, NyTOTAL, i, j, k, value);
    //
    // }
    // }
    // }
    // }
    // tt.Tac();
    //
    // delete[] gi3;
    // delete[] cax;
    // delete[] cay;
    // delete[] caz;
    //
    // delete[] Hx;
    // delete[] Hy;
    // delete[] Hz;
    //
    // delete[] gk2;
    // delete[] gk3;
    //
    // delete[] idyh;
    // delete[] Dy;
    //
    // delete[] gi2;
    // delete[] gj1;




    /* C-ARRAY (IMPROVED) - TEST 3*/
    double* gi3 = new double[NxTOTAL];
    double* cax = new double[NxTOTAL];
    double* cay = new double[NyTOTAL];
    double* caz = new double[NzTOTAL];

    double* Hx = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    double* Hy = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    double* Hz = new double[NxTOTAL*NyTOTAL*NzTOTAL];

    double* gk2 = new double[NxTOTAL];
    double* gk3 = new double[NxTOTAL];

    double* idyh = new double[NxTOTAL*NyTOTAL*NzTOTAL];
    double* Dy = new double[NxTOTAL*NyTOTAL*NzTOTAL];

    double* gi2 = new double[NxTOTAL];
    double* gj1 = new double[NxTOTAL];

    TicTac tt;
    tt.Tic("Array (improved): ");
    for (t = 0; t < TMAX; t++) {
    for(i = 1; i < NxTOTAL; i++) {
    for(j = jb+1; j < NyTOTAL; j++) {
    double* Dy_ij = Dy + (i + j*NxTOTAL);
    double* Hx_ij = Hx + (i + j*NxTOTAL);
    double* Hz_ij = Hz + (i + j*NxTOTAL);
    double* Hz_i1_j = Hz + (i - 1 + j*NxTOTAL);

    double jyh = j - jb - 1;
    for(k = 1; k < NzTOTAL; k++)
    {
    double curl_h = caz[k]*(Hx_ij[NxTOTAL*NyTOTAL*k] - Hx_ij[NxTOTAL*NyTOTAL*k-1]) -
    cax*(Hz_ij[NxTOTAL*NyTOTAL*k] - Hz_i1_j[NxTOTAL*NyTOTAL*k]);

    addCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k, curl_h);

    double value = gi3*gk3[k]*Dy_ij[NxTOTAL*NyTOTAL*k] +
    gi2*gk2[k]*(curl_h + gj1[j]*getCM(idyh, NxTOTAL, NyTOTAL, i, jyh, k));
    Dy_ij[NxTOTAL*NyTOTAL*k] = value;
    }
    }
    }
    }
    tt.Tac();

    delete[] gi3;
    delete[] cax;
    delete[] cay;
    delete[] caz;

    delete[] Hx;
    delete[] Hy;
    delete[] Hz;

    delete[] gk2;
    delete[] gk3;

    delete[] idyh;
    delete[] Dy;

    delete[] gi2;
    delete[] gj1;
    }
    ----------------
    TicTac just measures time and print:

    class TicTac {
    private:
    string message;
    double tstart;
    double tend;
    public:
    TicTac();
    ~TicTac();
    void Tic(string msg);
    void Tac();
    };

    void TicTac::Tic(string msg) {
    message = msg;
    tstart = (double)clock()/CLOCKS_PER_SEC;
    }

    void TicTac::Tac() {
    tend = (double)clock()/CLOCKS_PER_SEC;
    cout << message << ". TIME: " << (tend-tstart) << "s" << endl;
    }


    Tks
     
    Leandro, Feb 26, 2013
    #14
  15. On 2/26/2013 11:04 AM, Leandro wrote:
    > OK, so I'm trying to improve the performance of the code. I would
    > like to substitute TNT library for another, more efficient (as
    > suggest by Alain Ketterlin). So I tried to use direct pointers to
    > store the data. Unfortunately, the code with TNT runs ~3 times faster
    > than using pointers. How could I change this code to get a better
    > performance over TNT? I've followed the suggestions of Öö Tiib and
    > Alain Ketterlin (put some arrays in the outer loop) and the
    > performance improved 10%~15%, but yet worse than using TNT
    > (http://math.nist.gov/tnt/overview.html).


    Give it another thought. Take a look at the following.

    double a[N*M] = {};
    for (int i = 0; i < N; i++)
    {
    for (int j = 0; j < M; j++)
    {
    do_something( a[i*M + j] ); // pay attention here
    }
    }

    And compare it to

    double a[N*M] = {};
    for (int i = 0; i < N; i++)
    {
    double* pai = a + i*M;
    for (int j = 0; j < M; j++)
    {
    do_something( pai[j] ); // pay attention here too
    }
    }

    What do you see different here?

    I do not want to get your hopes too much up, so to speak, but IME tricks
    like that *have shown* improvement (although not three-fold) over the
    use of full indexing arithmetic in the innermost loop.

    > [..]


    V
    --
    I do not respond to top-posted replies, please don't ask
     
    Victor Bazarov, Feb 26, 2013
    #15
  16. Leandro

    Öö Tiib Guest

    On Tuesday, 26 February 2013 18:04:08 UTC+2, Leandro wrote:
    > OK, so I'm trying to improve the performance of the code.
    > I would like to substitute TNT library for another, more efficient
    > (as suggest by Alain Ketterlin). So I tried to use direct pointers
    > to store the data. Unfortunately, the code with TNT runs ~3 times
    > faster than using pointers. How could I change this code to get a
    > better performance over TNT? I've followed the suggestions of Öö Tiib
    > and Alain Ketterlin (put some arrays in the outer loop) and the
    > performance improved 10%~15%, but yet worse than using TNT
    > (http://math.nist.gov/tnt/overview.html).


    You want to beat someone then you need to master the whole thing,
    that is impossible to achieve by just single thread in clc++. For
    example how to replace multiplication with addition:

    for ( t = 0; t < TMAX; t++ ) {
    for ( i = 1; i < NxTOTAL; i++ ) {
    // multiply once
    double* Dy_ij = Dy + (i + jb*NxTOTAL);
    double* Hx_ij = Hx + (i + jb*NxTOTAL);
    double* Hz_ij = Hz + (i + jb*NxTOTAL);
    double* Hz_i1_j = Hz + (i - 1 + jb*NxTOTAL);
    for ( j = jb+1; j < NyTOTAL; j++ ) {
    // just add rest of the time
    Dy_ij += NxTOTAL;
    Hx_ij += NxTOTAL;
    Hz_ij += NxTOTAL;
    Hz_i1_j += NxTOTAL;

    In general ... profile -> read books & optimize -> profile in cycle until
    you can not make it any better. ;)
     
    Öö Tiib, Feb 26, 2013
    #16
  17. Leandro <> writes:

    > inline double getCM(double* array, int m0, int m1, int i, int j, int k) {
    > return array[i + m0*j + m0*m1*k];
    > }


    Leandro, this is "wrong". C/C++ stores arrays in row major mode, so if
    the array is of size N*M*P, element [j][k] is at i*M*P + j*P + k (the
    problem is the same inside the loop).

    Having it the wrong way completely kills spatial locality, it is
    therefore not surprising you get worse results, even though your numbers
    are probably right.

    Also, in a first experiment, leave it the compiler to move code out of
    the loops (compile with -O1 at least, gcc should be able to cope).

    -- Alain.
     
    Alain Ketterlin, Feb 27, 2013
    #17
  18. On 2013-02-26 04:00, Alain Ketterlin wrote:
    > Consider a loop like:
    >
    > for ( i=... )
    > for ( j=... )
    > ... T.at(i,j) ...
    >
    > (I use at(i,j) to abstract away the effective access). Now assume T is a
    > linearized array. This access is equivalent to:
    >
    > T.data[i*N+j]
    >
    > Calculating i*N on every iteration of the j loop costs something. What
    > Öô suggests is moving this computation out of the j loop.


    I have experimented with a simple program that calculates element-wise
    multiplication of two matrices.

    // Version A:
    for (size_t i = 0; i < M; ++i) {
    for (size_t j = 0; j < N; ++j) {
    C[N * i + j] = A[N * i + j] * B[N * i + j];
    }
    }

    // Version B:
    for (size_t i = 0; i < M; ++i) {
    const T* A_i = &A[N * i];
    const T* B_i = &B[N * i];
    T* C_i = &C[N * i];
    for (size_t j = 0; j < N; ++j) {
    C_i[j] = A_i[j] * B_i[j];
    }
    }

    (T=double, M=N=256, repeating for 10000 times)

    95% confidence intervals of the running time based on 100 trials are:

    A B
    g++ -O2: 1.594±0.005 1.592±0.005
    g++ -O3: 1.648±0.005 1.607±0.005
    (using g++ 4.6.3 on x86_64 Linux)

    Under g++ -O2, the difference between the two versions are negligible.
    However, g++ -O3 makes both worse, and makes Version A much worse.

    Anyway, this shows that compilers may not always be as smart as I wish
    them to be. :-(

    --
    Seungbeom Kim
     
    Seungbeom Kim, Apr 8, 2013
    #18
  19. Leandro

    James Kanze Guest

    On Tuesday, 26 February 2013 12:00:40 UTC, Alain Ketterlin wrote:
    > Seungbeom Kim <> writes:


    > > On 2013-02-25 09:39, Öö Tiib wrote:

    > [...]
    > >> for(j = 1; j < Ny; j++)
    > >> {
    > >> for(k = 1; k < Nz; k++)
    > >> {
    > >> curl_h = cay[j]*(Hz[j][k] - Hz[j-1][k])
    > >> - caz[k]*(Hy[j][k] - Hy[j][k-1]);
    > >> // ...
    > >> // rest of the stuff
    > >> }
    > >> }


    > >> Now you replace it ... (assuming the arrays contain doubles) with that:

    > [...]
    > >> for(j = 1; j < Ny; j++)
    > >> {
    > >> double cay_j = cay[j];
    > >> double (&Hz_i_j)[Nz] = Hz[j];
    > >> double (&Hz_i_j_1)[Nz] = Hz[j-1];
    > >> double (&Hy_i_j)[Nz] = Hy[j];


    > >> for(k = 1; k < Nz; k++)
    > >> {
    > >> curl_h = cay_j*(Hz_i_j[k] - Hz_i_j_1[k])
    > >> - caz[k]*(Hy_i_j[k] - Hy_i_j[k-1]);
    > >> // ...
    > >> // rest of the stuff
    > >> }
    > >> }


    > > I'm just speculating, but reflecting that references just
    > > create aliases, I doubt that using references like this will
    > > affect the generated code in any significant way.


    > It will, and a lot.


    If it does, then there's something seriously wrong with the
    quality of the compiler. Hoisting loop invariants was
    a standard optmization technique twenty or thirty years ago.

    > In the fragment above you have more array reference
    > that actual computations (like + - etc.). Consider a loop like:


    > for ( i=... )
    > for ( j=... )
    > ... T.at(i,j) ...


    > (I use at(i,j) to abstract away the effective access). Now assume T is a
    > linearized array. This access is equivalent to:


    > T.data[i*N+j]


    > Calculating i*N on every iteration of the j loop costs
    > something. What Öô suggests is moving this computation out of
    > the j loop.


    Except that pretty much every compiler in the world will do this
    for you. And typically, when the compiler does it, it can do it
    better than if you try to do it by hand, since it knows the
    finality of the references and pointers it creates.

    > Note that if the array is stored as a 1D array of pointers to 1D arrays
    > of doubles, the same argument applies. T[j] is:


    > *(*(T.rows+i)+j)


    > again, *(T.rows+i) could be hoisted out of the j loop (if that location
    > is not overwritten during the run of the loop -- unfortunately it is
    > difficult for a compiler to realize this, and that's where restrict is
    > useful).


    > There is a large amount of computation going on during array accesses,
    > and factoring as much as possible to put it out of the loop is one of
    > the major optimization opportunity.


    > (To the OP: the library you use just makes array accesses much more
    > costly than they should be, and probably prevents this kind of
    > optimization.)


    > > (Creating a reference will not even trigger a prefetch, will it?)


    > No.


    > > And if it actually does, the compiler must not have been doing a very
    > > good job even at merely grasping common subexpressions.


    > Common subexpressions are important, loop-invariant code motion is even
    > more, because it reduces the amount of code by a factor equal to the
    > number of iterations of the loop.


    > > On the other hand, value copying instead of just aliasing (as for cay_j
    > > above) may have a better chance of improvement.


    > This is actually no different from extracting a partial array access:
    > same idea, same potential gain.


    Extracting the value has an important benefit; when the compiler
    must access the values, it has to take into account possible
    aliasing. Thus, in `idx[j][k] = ...` and `D[j][k] = ...`,
    the compiler will probably have to assume that anything
    expression that references into any of the other arrays might
    have different results.

    This depends on the definitions of the other arrays. If e.g.
    `Hz` and `idx` are both C style arrays&mdash;_not_ pointers
    which point to the first element, but the actual data
    definitions&mdash;then the compiler can know that they don't
    alias one another. Otherwise, it has to assume the worst, and
    reread the elements each time through the loop. Manually
    hoisting the reads of `gj3[j]` etc. out of the inner most loop
    might make a significant difference, because the compiler
    probably cannot do this (since if all it has are pointers, it
    has to assume that one of the other assignments in the innermost
    loop might modify this value). Depending on the dimensions, it
    might actually be worth copying `Hz[j]` et al. into one
    dimensional local arrays (which the compiler can see arn't being
    modified).

    Even better, of course, would be if the compiler has extensions
    which allow you to tell it that there isn't any aliasing. C++11
    didn't adopt C's `restrict` keyword, but this is one place where
    it could help enormously (and some C++ compilers might support
    it as an extension).

    --
    James
     
    James Kanze, Apr 9, 2013
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jm
    Replies:
    1
    Views:
    528
    alien2_51
    Dec 12, 2003
  2. Aidan  Marcuss
    Replies:
    4
    Views:
    3,261
  3. Tristan
    Replies:
    3
    Views:
    493
    Roedy Green
    Nov 23, 2003
  4. Chris Roe
    Replies:
    1
    Views:
    344
    Wee Jin Goh
    Jun 11, 2004
  5. Software Engineer
    Replies:
    0
    Views:
    357
    Software Engineer
    Jun 10, 2011
Loading...

Share This Page