# Writing efficient normal distribution function (for learning purposes).

L

#### Luca Cerone

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so please just don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken from http://en.wikipedia.org/wiki/Normal_distribution
double val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;
}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){
gv=normdist(vx,mu,csigma);
}
return gv ;
}

As I told I'm a beginner so I guess there is plenty of room for improving the speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Thank you very much to all of you!
Luca

A

#### Asger Joergensen

Hi Luca

If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

Best regards
Asger-P

L

#### lucacerone

Let me understand if I've understood.. by fixed precision you mean having
number with a fixed number of digits? If so, yes I can live with it
as far as I have 8 exact decimal points.
If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

I'm sorry but I don't understand you. How would you suggest to use ints?
What type is it __int64?

Thanks a lot for you help in any case,
Luca

A

#### Asger Joergensen

Hi lucacerone
Let me understand if I've understood.. by fixed precision you mean having
number with a fixed number of digits? If so, yes I can live with it
as far as I have 8 exact decimal points.

int64_t val = DoubleVal * 100000000;

do the rest of the work with the int and then You divide by the same number
when You need the result.
I'm sorry but I don't understand you. How would you suggest to use ints?
What type is it __int64?

64 bit integer my borland compiler like __int64, but I think int64_t is the
cross compiler way to write it.

I'm not into math so I don't know if You can use this method for Your project.

I do a lot of gradient graphics, so when I calculate color steps i use this
method and it is a lot faster then using floating point values like float and
double.

Best regards
Asger-P

J

#### Juha Nieminen

Asger Joergensen said:
If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

This isn't the early 90's. Processors have got a bit faster at
calculating with floating point values in the meantime.

J

#### Juha Nieminen

Luca Cerone said:
I'd like to improve my code for speed.

First of all, are you sure you are compiling with optimizations / in
release mode? A surprisingly common mistake to not to do that.
double val = sqrt(2*M_PI)*csigma;

A compiler might or might not be able to calculate that "sqrt(2*M_PI)"
at compile time (because compilers oftentimes do not perform floating point
calculations at compile time due to a myriad of reasons; of course it
depends on the compiler and its version). If you want a few clock cycles
off, might be a safer bet to write the value directly.
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));

Likewise a compiler might or might not be able to optimize a
"pow(x,2)" into "x*x". It may be a safer bet to write it explicitly.
(std: ow is significantly slower than multiplication.)
vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.

How many times is this function called in your program? Are you
measuring its speed by calling this function thousands/millions of
times and seeing how long it takes overall?

If yes, then there's your major problem. Memory allocation is
(unfortunately) a slow operation in C++ (not really C++'s fault,
but that's another story). You might want to pass the return value
as a reference parameter to the function and have it write there
instead of allocating a new vector each time. Then you can reuse the
same vector object for each call, avoiding the memory allocation
(except for the very first time it's called, of course).

V

#### Victor Bazarov

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so please just don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken from http://en.wikipedia.org/wiki/Normal_distribution
double val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;

Some compilers are actually good at parsing expressions and converting
them into efficient machine code. You could try simply writing

return exp( -pow(px-mu,2)/2/pow(csigma,2) ) / sqrt(2*M_PI) / csigma;

And now you can see that there are two subexpressions that don't have to
be calculated on every call. pow(csigma,2) could be calculated once per
the entire loop and passed into 'normdist' as a const double&. The same
goes for sqrt(2*M_PI), which is just a constant you could define in the
file scope and just use here.
}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){

While it might not matter in your particular case, get into the habit of
not calling any function in your 'for' condition if you know that it's
not going to change. I'm yet to see a compiler that knows to eliminate
it. Try this:

for (int i=0, s=vx.size(); i<s; ++i)
gv=normdist(vx,mu,csigma);
}

Most modern compilers are good with indexing, but it's a function call
for the vector, so you might want to try to eliminate it as well by
using pointers:

double* pgv = &gv;
const double* pvx = &vx;

for (...
{
*pgv++ = normdist(*pvx++, mu, csigma);
}
return gv ;
}

As I told I'm a beginner so I guess there is plenty of room for improving the speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Good luck!

V

A

#### Asger Joergensen

Hi Juha

Juha said:
This isn't the early 90's. Processors have got a bit faster at
calculating with floating point values in the meantime.

Not nearly as fast as they can do it with int's though, in my experience.

Best regards
Asger-P

J

#### Joshua Maurice

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so pleasejust don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken fromhttp://en.wikipedia.org/wiki/Normal_distribution
double  val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;

}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){
gv=normdist(vx,mu,csigma);
}
return gv ;

}

As I told I'm a beginner so I guess there is plenty of room for improvingthe speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Thank you very much to all of you!
Luca

Are compiling with all of the proper optimization flags? Are you using
MSVC, and have you disabled checked iterators?

V

#### Victor Bazarov

[..]
I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
[.. code using vector and indexing snipped ..]

Are compiling with all of the proper optimization flags? Are you using
MSVC, and have you disabled checked iterators?

There seem to be no iterators in that code.

If the code speed is the same as Matlab's, there is probably not much
hope to significantly improve it. I doubt that Matlab people release
their library/application with sub-par performance.

V

G

#### glen herrmannsfeldt

(snip)
pow(x,2) should be replaced by x*x as mentioned by others, it is not
given that the optimizer is smart enough to do that by itself.

known to optimize x**2 for many many years. Being an operator,
instead of a function, makes it more obvious.

In addition, there is no pow(double, int) which would help in
optimizing many other cases.

-- glen

M

glen herrmannsfeldt said:
known to optimize x**2 for many many years. Being an operator,
instead of a function, makes it more obvious.

In addition, there is no pow(double, int) which would help in
optimizing many other cases.

All the C++ compilers I've used _do_ optimize "pow(x,2)" to "x*x"...
Not very hard as optimizations go...

-miles

L

#### lucacerone

Hi Asger,
this solution is not feasible for me unfortunately because the function I want to approximate is nonlinear. Thanks a lot in any case for the advice.

L

#### lucacerone

Many issues have been arisen, so I try to answer to them in this post.

For the performance I have evaluated them evaluating vectors made of 1000 points, for one hundred thousand times. So I think it is a reliable measurement.

As for the compiler optimization, I'm using g++ (GNU compiler) on a Ubuntu machine. The way I compile the code is simply:

g++ -c -o mylib.o mylib.cpp
g++ -c -o main.o main.cpp
g++ -o executableneme main.o mylib.o

I have absolutely no idea on how to optimize compilation...

Amongst the other suggestions I think I've read to substitute pow(x,2) with x*x. I'll certainly do that.
Also, it is true that in the vector version of the function certain calculations can be made only once. I'll change that as well.

Victor suggested to use pointers rather than indexing
I have to be honest, while studying C++ I haven't arrived to the
pointers chapter yet, that's why I haven't used them... and that's why I haven't fully understood your answer.
Also I don't understand why accessing with pointers might be faster than direct indexing.

I'm sorry if I'm asking easy question, I really am a novice and need to learn a lot Cheers, Luca

A

#### Asger Joergensen

Hi lucacerone
Victor suggested to use pointers rather than indexing
I have to be honest, while studying C++ I haven't arrived to the
pointers chapter yet, that's why I haven't used them... and that's why I haven't

Have a look in this thread in this NG:
Re: loop optimization 01.19.01
Also I don't understand why accessing with pointers might be faster than direct
indexing.

When You say direct indexing I don't think there is much, if any, difference,
when it is a single array, but when You have [][] or [][][] there start to be
a difference, but it all depends on the compiler and various little things.
If You use vector and its indexing there is a 30% difference, at least when
compiled with my CodeGear 2009 compiler, I don't know about gcc, but You can
easily test it Your self, You can use the example from the thread i mentioned
above.
P.s. Do remember that most of the time the indexing is a very little part of
the time used, in very many cases You will not be able to see the difference
in the total picture.

I'm sorry if I'm asking easy question, I really am a novice and need to learn a lot but communicating is better, thats my opinion.

Best regards
Asger-P

M

lucacerone said:
As for the compiler optimization, I'm using g++ (GNU compiler) on a
Ubuntu machine. The way I compile the code is simply:

g++ -c -o mylib.o mylib.cpp
g++ -c -o main.o main.cpp
g++ -o executableneme main.o mylib.o

I have absolutely no idea on how to optimize compilation...

As a start, you should turn on optimization -- the compiler does no
optimization unless you tell it to, and in many cases non-optimized code
is much, _much_ slower.

To turn on optimization, use the "-O" option, optionally followed by a
digit giving the level of optimization; with gcc, the most common choice
is "-O2" which optimizes somewhat more.

You also want to tell it what processor you're using, because by default
it will produce "generic" code that can run on almost any processor in
the same family, but can be significantly slower than code which is
precisely targetted at the processor you're using. To do this, use the
"-march=..." option; typically you can use "-march=native", which says
"produce code targetted at the processor in this machine."

So i'd suggest:

g++ -c -o mylib.o -O2 -march=native mylib.cpp
g++ -c -o main.o -O2 -march=native main.cpp
g++ -o executableneme main.o mylib.o

-Miles

J

#### Joshua Maurice

So i'd suggest:

g++ -c -o mylib.o -O2 -march=native mylib.cpp
g++ -c -o main.o -O2 -march=native  main.cpp
g++ -o executableneme main.o mylib.o

Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
cp-clone options.

Function inlining is pretty important for C++.

M

#### Marc

Joshua said:
Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
cp-clone options.

It is fairly safe that -O2 will generate better code than -O1 (which
generates better code than -O0). With -O3, things are much less clear.
Sometimes you get better results, but it is still quite common that
-O3 code runs slower than -O2. The generated code is often much larger
(and the compilation takes longer).
Function inlining is pretty important for C++.

There is already some inlining at -O2 (exactly how much depends on the
version). And -O3 is not always enough to get the useful inlining,
sometimes you need -fprofile-generate/use or -finline-limit=... to
convince gcc.

This is not to say that -O3 should not be tried, of course. And
possibly -ffast-math if the OP is happy with imprecise (aka wrong)
results.

M

Joshua Maurice said:
Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
cp-clone options.

Function inlining is pretty important for C++.

-O2 does comprehensive inlining as well -- including functions not
marked "inline" -- using gcc's inlining heuristics, which are quite
good these days. [What "-finline-functions" does is act as _every
function_ were marked "inline" meaning gcc will be more likely to
inline stuff, even when it doesn't think it's a good idea.]

-O2 is a good default choice if want good optimization, but don't want
to spend a lot of time benchmarking. -O3 includes optimizations that
may be very expensive at compile-time or may result in excessive code
bloat (and thus can have a negative effect on the icache etc). -O3
can _sometimes_ help, but it can also sometimes hurt, so it's a good
idea to do some benchmarking to determine which is better.

-Miles