multithread on multicore processor on linux

finecur · Feb 1, 2008

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

Thank you very much,

ff

Tim H · Feb 1, 2008

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?

Perfect speedup is rare. realistically, shoot for 75% per additional
core.

Good luck

AnonMail2005 · Feb 1, 2008

Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

Thank you very much,

ff

Some suggestions...

Speed up for threading depends on how much you can parrallelize your
algorithm.

The Butenhof book is very good thread book. It describes pthreads
but it discusses concepts that are applicable to threads in general.

pthreads are "native" for unix variants. Consider using boost threads
if you need cross platform.

HTH

ciccio · Feb 1, 2008

Will this approach make my program four times faster than just a

single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

As stated by the others, a 4x speedup will for sure not be possible, but
if you program clever, you can for sure gain a lot.

Have a look at openmp. It is supported by g++ as well as intel. And
most likely also others.

Info on this can be found on www.openmp.org where you can download a
very readable standard, and a simple tutorial is on

https://computing.llnl.gov/tutorials/openMP/

Enjoy

vkc · Feb 1, 2008

What multithread library should I use? I know multithread is not

support in native c++...

For a parallel implementation of algorithms without having to worry
about threads explicitly, you might want to take a look at Intel's
Threading Building Blocks.
http://www.intel.com/cd/software/products/asmo-na/eng/294797.htm

Vilas

red floyd · Feb 1, 2008

finecur said:
Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?
What multithread library should I use? I know multithread is not
support in native c++...

As for what sort of speedup you can get, see Amdahl's Law.

http://en.wikipedia.org/wiki/Amdahl's_law

Gianni Mariani · Feb 1, 2008

ciccio said:
As stated by the others, a 4x speedup will for sure not be possible, but
if you program clever, you can for sure gain a lot.

Have a look at openmp. It is supported by g++ as well as intel. And
most likely also others.

Info on this can be found on www.openmp.org where you can download a
very readable standard, and a simple tutorial is on

https://computing.llnl.gov/tutorials/openMP/

For a newbie who just wants a spattering of MP this is the fastest way
to do it. It does not give you all the flexibility you may think you
need but more than likely you don't (depending on your app).

I second the reccomendation.

Gianni Mariani · Feb 1, 2008

finecur said:
Hi, I am writing a c++ program. The program needs a lot of
computational power so I am think about utilizing the "multi-core"
benefit of my processor. I have two dual-core x86-64 processors on my
linux box. So I am thinking starting 4 threads in my program, each one
takes care of a part the computation. I have two questions:

Will this approach make my program four times faster than just a
single thread program?

This depends heavily on cache utilization and how much of the code can
be truly parallelized. I've heard of some parallelizations improving by
more than the number of cpus and some not improve much at all.

In cases where the improvement was "superlinear" because the problem
ended up fitting into the combined cache of the CPU's.

I have seen code speed up by 50x simply by making sure that the data in
the cache didn't thrash.

What multithread library should I use? I know multithread is not
support in native c++...

openmp is a very good start and portable and designed for computational
applications.

Boost, austria c++, ace etc all provide cross platform thread interfaces
to raw threading primitives.

finecur · Feb 1, 2008

For a newbie who just wants a spattering of MP this is the fastest way
to do it. It does not give you all the flexibility you may think you
need but more than likely you don't (depending on your app).

I second the reccomendation.- Hide quoted text -

- Show quoted text -

Thanks for everyone.

Here is my code using OpenMP:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <math.h>

#define CHUNKSIZE 10000
#define N 100000
#define NUM 1

float compute(float a, float b)
{
return sin(a) * sin(b);
}

main ()
{

int start, end;
int nthreads, tid;
int i, j, k, m, n, chunk;
float a[N], b[N], c[N], sum;

/* Some initializations */
for (i=0; i < N; i++){
a = b = i * 1.0;
}
chunk = CHUNKSIZE;

sum = 0;
start = clock();
for (i=0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

end = clock();
printf("time=%d, sum=%f\n", end - start, sum);

sum = 0;
start = clock();
#pragma omp parallel shared(a,b,c,chunk) private(i)
{

#pragma omp for schedule(dynaic,chunk) nowait
for (i = 0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

}
end = clock();
printf("time=%d, sum=%f\n", end - start, sum);
}

I found however, the output is

time=10000, sum=49999.394531
time=20000, sum=49999.394531

Which means OpenMP is even slower!!!!
How come?

I am using
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-fram-pointer -
pthread -o t main.cc

to compile the program.

help....

Ian Collins · Feb 2, 2008

finecur said:
Thanks for everyone.

Here is my code using OpenMP:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <math.h>

#define CHUNKSIZE 10000
#define N 100000
#define NUM 1

You appear to writing C, not C++.

end = clock();
printf("time=%d, sum=%f\n", end - start, sum);

sum = 0;
start = clock();
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynaic,chunk) nowait
for (i = 0; i < N; i++){
c = compute(a, b);
sum = sum + c;
}

}
end = clock();
printf("time=%d, sum=%f\n", end - start, sum);
}

I found however, the output is

time=10000, sum=49999.394531
time=20000, sum=49999.394531

You don't tell openmp how many threads to use. Your loop does not lent
its self to parallelization, all of the threads will contend for locks on c.

Which means OpenMP is even slower!!!!
How come?

Click to expand...

clock() gives CPU time, so if you have two thread, the return will be
doubled for a unit of real time.

Gianni Mariani · Feb 2, 2008

finecur wrote:
....

to compile the program.

help....

You have a number of issues:

a) you're not compiling with omp enabled - use -fopenmp
b) your computation is not long enough to meausure adequately - need to
have at least seconds - preferably more than 5.
c) if you did make N large enough it would be too big for your stack
d) clock() has no where near enough resolution to measure the relative
performance of either version.
e) sharing sum and computing it on every iteration would cause alot of
contention on the shared add every iteration - best to remove that.
f) This is comp.lang.c++ so use C++ !!!

Try this code:

to compile withoutout openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3

... and with openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3 -fopenmp

==========
#include <omp.h>
#include <cmath>
#include <iostream>

#define CHUNKSIZE 100000
#define NX 10000000
#define NUM 1

template <typename T>
T compute(T a, T b)
{
return std::sin(a) * std::sin(b);
}

template <typename T, int N>
struct Compute
{
T a[N], b[N], c[N], sum;

Compute()
: sum()
{
for (int i=0; i < N; i++){
a = b = i * 1.0;
}
}

void DoWork()
{

int i, chunk = CHUNKSIZE;

// maintain references to arrays - speeds things up a tad
T (&la)[N] = a;
T (&lb)[N] = b;
T (&lc)[N] = b;
T lsum = T();
sum = T();

#pragma omp parallel shared(chunk) private(i,lsum)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i = 0; i < N; i++){
lsum += lc = compute(la, lb);
}

#pragma omp critical
{
sum += lsum;
}
}
}
};

Compute<double,NX> nx;

int main ()
{

nx.DoWork();

std::cout.precision(16);
std::cout << "sum = " << nx.sum << "\n";
}
==========

time ./xxx_openmp3
sum = 5000000.034065126

real 0m0.992s
user 0m1.588s
sys 0m0.152s

..... without -fopenmp

time ./xxx_openmp3
sum = 5000000.034065164

real 0m1.806s
user 0m1.632s
sys 0m0.152s

.... note the result is different due to round off !!! It will even
change every run because the order of the summation is different. You
can fix this but you need to save the intermediate sums. In theory, you
should really add the values in order starting at the ones with the
smallest absolute value which requires sorting the values in array c.

Here is what you get now:
../xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
[1] 30411
[2] 30412
sum = 5000000.034065094
sum = 5000000.03406521
sum = 5000000.034065165

....

finecur · Feb 3, 2008

finecurwrote:

...

to compile the program.

Click to expand...

help....

Click to expand...

You have a number of issues:

a) you're not compiling with omp enabled - use -fopenmp
b) your computation is not long enough to meausure adequately - need to
have at least seconds - preferably more than 5.
c) if you did make N large enough it would be too big for your stack
d) clock() has no where near enough resolution to measure the relative
performance of either version.
e) sharing sum and computing it on every iteration would cause alot of
contention on the shared add every iteration - best to remove that.
f) This is comp.lang.c++ so use C++ !!!

Try this code:

to compile withoutout openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3

.. and with openmp use:
g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
xxx_openmp3.cpp -o xxx_openmp3 -fopenmp

==========
#include <omp.h>
#include <cmath>
#include <iostream>

#define CHUNKSIZE 100000
#define NX 10000000
#define NUM 1

template <typename T>
T compute(T a, T b)
{
return std::sin(a) * std::sin(b);

}

template <typename T, int N>
struct Compute
{
T a[N], b[N], c[N], sum;

Compute()
: sum()
{
for (int i=0; i < N; i++){
a = b = i * 1.0;
}
}

void DoWork()
{

int i, chunk = CHUNKSIZE;

// maintain references to arrays - speeds things up a tad
T (&la)[N] = a;
T (&lb)[N] = b;
T (&lc)[N] = b;
T lsum = T();
sum = T();

#pragma omp parallel shared(chunk) private(i,lsum)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i = 0; i < N; i++){
lsum += lc = compute(la, lb);
}

#pragma omp critical
{
sum += lsum;
}
}
}

};

Compute<double,NX> nx;

int main ()
{

nx.DoWork();

std::cout.precision(16);
std::cout << "sum = " << nx.sum << "\n";}

==========

time ./xxx_openmp3
sum = 5000000.034065126

real 0m0.992s
user 0m1.588s
sys 0m0.152s

.... without -fopenmp

time ./xxx_openmp3
sum = 5000000.034065164

real 0m1.806s
user 0m1.632s
sys 0m0.152s

... note the result is different due to round off !!! It will even
change every run because the order of the summation is different. You
can fix this but you need to save the intermediate sums. In theory, you
should really add the values in order starting at the ones with the
smallest absolute value which requires sorting the values in array c.

Here is what you get now:
./xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
[1] 30411
[2] 30412
sum = 5000000.034065094
sum =5000000.03406521
sum = 5000000.034065165

...

thank you very much.
can you tell me how you time the program?

Juha Nieminen · Feb 3, 2008

can you tell me how you time the program?

He did already. In unix systems there's a utility called 'time' for
this exact purpose.

Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Using a multicore-processor	5	Aug 29, 2008
Overcoming python performance penalty for multicore CPU	19	Feb 2, 2010
Suspected Memory Leak in Multithread queue implmenetation	2	Jan 15, 2007
question on vector<char>::difference_type	10	Jul 30, 2009
Intel compiler, efficiency of various complex number types and FFTWvs. Intel's FFT.	4	Jul 6, 2009
Java Threads on Windows, multi-core processor...	3	Feb 26, 2008
Problems running on hp dual core processor	0	Sep 22, 2008

multithread on multicore processor on linux

finecur

Tim H

AnonMail2005

ciccio

vkc

red floyd

Gianni Mariani

Gianni Mariani

finecur

Ian Collins

Gianni Mariani

finecur

Juha Nieminen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads