multithread on multicore processor on linux

F

finecur

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

Thank you very much,

ff
 
T

Tim H

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?

Perfect speedup is rare. realistically, shoot for 75% per additional
core.

Good luck
 
A

AnonMail2005

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

Thank you very much,

ff

Some suggestions...

Speed up for threading depends on how much you can parrallelize your
algorithm.

The Butenhof book is very good thread book. It describes pthreads
but it discusses concepts that are applicable to threads in general.

pthreads are "native" for unix variants. Consider using boost threads
if you need cross platform.

HTH
 
C

ciccio

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...


As stated by the others, a 4x speedup will for sure not be possible, but
if you program clever, you can for sure gain a lot.

Have a look at openmp. It is supported by g++ as well as intel. And
most likely also others.

Info on this can be found on www.openmp.org where you can download a
very readable standard, and a simple tutorial is on

https://computing.llnl.gov/tutorials/openMP/

Enjoy
 
R

red floyd

finecur said:
Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

As for what sort of speedup you can get, see Amdahl's Law.

http://en.wikipedia.org/wiki/Amdahl's_law
 
G

Gianni Mariani

ciccio said:
As stated by the others, a 4x speedup will for sure not be possible, but
if you program clever, you can for sure gain a lot.

Have a look at openmp. It is supported by g++ as well as intel. And
most likely also others.

Info on this can be found on www.openmp.org where you can download a
very readable standard, and a simple tutorial is on

https://computing.llnl.gov/tutorials/openMP/

For a newbie who just wants a spattering of MP this is the fastest way
to do it. It does not give you all the flexibility you may think you
need but more than likely you don't (depending on your app).

I second the reccomendation.
 
G

Gianni Mariani

finecur said:
Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?

This depends heavily on cache utilization and how much of the code can
be truly parallelized. I've heard of some parallelizations improving by
more than the number of cpus and some not improve much at all.

In cases where the improvement was "superlinear" because the problem
ended up fitting into the combined cache of the CPU's.

I have seen code speed up by 50x simply by making sure that the data in
the cache didn't thrash.
What multithread library should I use? I know multithread is not
support in native c++...

openmp is a very good start and portable and designed for computational
applications.

Boost, austria c++, ace etc all provide cross platform thread interfaces
to raw threading primitives.
 
F

finecur

For a newbie who just wants a spattering of MP this is the fastest way
to do it.  It does not give you all the flexibility you may think you
need but more than likely you don't (depending on your app).

I second the reccomendation.- Hide quoted text -

- Show quoted text -

Thanks for everyone.

Here is my code using OpenMP:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <math.h>

#define CHUNKSIZE 10000
#define N 100000
#define NUM 1

float compute(float a, float b)
{
return sin(a) * sin(b);
}

main ()
{

int start, end;
int nthreads, tid;
int i, j, k, m, n, chunk;
float a[N], b[N], c[N], sum;

/* Some initializations */
for (i=0; i < N; i++){
a = b = i * 1.0;
}
chunk = CHUNKSIZE;

sum = 0;
start = clock();
for (i=0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

end = clock();
printf("time=%d, sum=%f\n", end - start, sum);

sum = 0;
start = clock();
#pragma omp parallel shared(a,b,c,chunk) private(i)
{

#pragma omp for schedule(dynaic,chunk) nowait
for (i = 0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

}
end = clock();
printf("time=%d, sum=%f\n", end - start, sum);
}

I found however, the output is

time=10000, sum=49999.394531
time=20000, sum=49999.394531

Which means OpenMP is even slower!!!!
How come?

I am using
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-fram-pointer -
pthread -o t main.cc

to compile the program.

help....
 
I

Ian Collins

finecur said:
Thanks for everyone.

Here is my code using OpenMP:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <math.h>

#define CHUNKSIZE 10000
#define N 100000
#define NUM 1
You appear to writing C, not C++.
end = clock();
printf("time=%d, sum=%f\n", end - start, sum);

sum = 0;
start = clock();
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynaic,chunk) nowait
for (i = 0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

}
end = clock();
printf("time=%d, sum=%f\n", end - start, sum);
}

I found however, the output is

time=10000, sum=49999.394531
time=20000, sum=49999.394531

You don't tell openmp how many threads to use. Your loop does not lent
its self to parallelization, all of the threads will contend for locks on c.
Which means OpenMP is even slower!!!!
How come?
clock() gives CPU time, so if you have two thread, the return will be
doubled for a unit of real time.
 
G

Gianni Mariani

finecur wrote:
....
to compile the program.

help....

You have a number of issues:

a) you're not compiling with omp enabled - use -fopenmp
b) your computation is not long enough to meausure adequately - need to
have at least seconds - preferably more than 5.
c) if you did make N large enough it would be too big for your stack
d) clock() has no where near enough resolution to measure the relative
performance of either version.
e) sharing sum and computing it on every iteration would cause alot of
contention on the shared add every iteration - best to remove that.
f) This is comp.lang.c++ so use C++ !!!

Try this code:

to compile withoutout openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3

... and with openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3 -fopenmp

==========
#include <omp.h>
#include <cmath>
#include <iostream>

#define CHUNKSIZE 100000
#define NX 10000000
#define NUM 1

template <typename T>
T compute(T a, T b)
{
return std::sin(a) * std::sin(b);
}

template <typename T, int N>
struct Compute
{
T a[N], b[N], c[N], sum;

Compute()
: sum()
{
for (int i=0; i < N; i++){
a = b = i * 1.0;
}
}

void DoWork()
{

int i, chunk = CHUNKSIZE;

// maintain references to arrays - speeds things up a tad
T (&la)[N] = a;
T (&lb)[N] = b;
T (&lc)[N] = b;
T lsum = T();
sum = T();

#pragma omp parallel shared(chunk) private(i,lsum)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i = 0; i < N; i++){
lsum += lc = compute(la, lb);
}

#pragma omp critical
{
sum += lsum;
}
}
}
};

Compute<double,NX> nx;

int main ()
{

nx.DoWork();

std::cout.precision(16);
std::cout << "sum = " << nx.sum << "\n";
}
==========

time ./xxx_openmp3
sum = 5000000.034065126

real 0m0.992s
user 0m1.588s
sys 0m0.152s

..... without -fopenmp

time ./xxx_openmp3
sum = 5000000.034065164

real 0m1.806s
user 0m1.632s
sys 0m0.152s


.... note the result is different due to round off !!! It will even
change every run because the order of the summation is different. You
can fix this but you need to save the intermediate sums. In theory, you
should really add the values in order starting at the ones with the
smallest absolute value which requires sorting the values in array c.

Here is what you get now:
../xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
[1] 30411
[2] 30412
sum = 5000000.034065094
sum = 5000000.03406521
sum = 5000000.034065165

....
 
F

finecur

finecurwrote:

...
to compile the program.

You have a number of issues:

a) you're not compiling with omp enabled - use -fopenmp
b) your computation is not long enough to meausure adequately - need to
have at least seconds - preferably more than 5.
c) if you did make N large enough it would be too big for your stack
d) clock() has no where near enough resolution to measure the relative
performance of either version.
e) sharing sum and computing it on every iteration would cause alot of
contention on the shared add every iteration - best to remove that.
f) This is comp.lang.c++ so use C++ !!!

Try this code:

to compile withoutout openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3

.. and with openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3 -fopenmp

==========
#include <omp.h>
#include <cmath>
#include <iostream>

#define CHUNKSIZE 100000
#define NX     10000000
#define NUM 1

template <typename T>
T compute(T a, T b)
{
     return std::sin(a) * std::sin(b);

}

template <typename T, int N>
struct Compute
{
     T a[N], b[N], c[N], sum;

     Compute()
       : sum()
     {
         for (int i=0; i < N; i++){
             a = b = i * 1.0;
         }
     }

     void DoWork()
     {

         int i, chunk = CHUNKSIZE;

         // maintain references to arrays - speeds things up a tad
         T (&la)[N] = a;
         T (&lb)[N] = b;
         T (&lc)[N] = b;
         T lsum = T();
         sum = T();

         #pragma omp parallel shared(chunk) private(i,lsum)
         {
             #pragma omp for schedule(dynamic,chunk) nowait
             for (i = 0; i < N; i++){
                 lsum += lc = compute(la, lb);
             }

             #pragma omp critical
             {
                 sum += lsum;
             }
         }
     }

};

Compute<double,NX>   nx;

int main ()
{

     nx.DoWork();

     std::cout.precision(16);
     std::cout << "sum = " << nx.sum << "\n";}

==========

time ./xxx_openmp3
sum = 5000000.034065126

real    0m0.992s
user    0m1.588s
sys     0m0.152s

.... without -fopenmp

time ./xxx_openmp3
sum = 5000000.034065164

real    0m1.806s
user    0m1.632s
sys     0m0.152s

... note the result is different due to round off !!!  It will even
change every run because the order of the summation is different.  You
can fix this but you need to save the intermediate sums. In theory, you
should really add the values in order starting at the ones with the
smallest absolute value which requires sorting the values in array c.

Here is what you get now:
./xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
[1] 30411
[2] 30412
sum = 5000000.034065094
sum =5000000.03406521
sum = 5000000.034065165

...

thank you very much.
can you tell me how you time the program?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,832
Latest member
GlennSmall

Latest Threads

Top