multithread on multicore processor on linux

Discussion in 'C++' started by finecur, Feb 1, 2008.

  1. finecur

    finecur Guest

    Hi, I am writing a c++ program. The program needs a lot of
    computational power so I am think about utilizing the "multi-core"
    benefit of my processor. I have two dual-core x86-64 processors on my
    linux box. So I am thinking starting 4 threads in my program, each one
    takes care of a part the computation. I have two questions:

    Will this approach make my program four times faster than just a
    single thread program?
    What multithread library should I use? I know multithread is not
    support in native c++...

    Thank you very much,

    ff
     
    finecur, Feb 1, 2008
    #1
    1. Advertising

  2. finecur

    Tim H Guest

    On Feb 1, 9:53 am, finecur <> wrote:
    > Hi, I am writing a c++ program. The program needs a lot of
    > computational power so I am think about utilizing the "multi-core"
    > benefit of my processor. I have two dual-core x86-64 processors on my
    > linux box. So I am thinking starting 4 threads in my program, each one
    > takes care of a part the computation. I have two questions:
    >
    > Will this approach make my program four times faster than just a
    > single thread program?


    Perfect speedup is rare. realistically, shoot for 75% per additional
    core.

    Good luck
     
    Tim H, Feb 1, 2008
    #2
    1. Advertising

  3. finecur

    Guest

    On Feb 1, 12:53 pm, finecur <> wrote:
    > Hi, I am writing a c++ program. The program needs a lot of
    > computational power so I am think about utilizing the "multi-core"
    > benefit of my processor. I have two dual-core x86-64 processors on my
    > linux box. So I am thinking starting 4 threads in my program, each one
    > takes care of a part the computation. I have two questions:
    >
    > Will this approach make my program four times faster than just a
    > single thread program?
    > What multithread library should I use? I know multithread is not
    > support in native c++...
    >
    > Thank you very much,
    >
    > ff


    Some suggestions...

    Speed up for threading depends on how much you can parrallelize your
    algorithm.

    The Butenhof book is very good thread book. It describes pthreads
    but it discusses concepts that are applicable to threads in general.

    pthreads are "native" for unix variants. Consider using boost threads
    if you need cross platform.

    HTH
     
    , Feb 1, 2008
    #3
  4. finecur

    ciccio Guest

    > Will this approach make my program four times faster than just a
    > single thread program?
    > What multithread library should I use? I know multithread is not
    > support in native c++...



    As stated by the others, a 4x speedup will for sure not be possible, but
    if you program clever, you can for sure gain a lot.

    Have a look at openmp. It is supported by g++ as well as intel. And
    most likely also others.

    Info on this can be found on www.openmp.org where you can download a
    very readable standard, and a simple tutorial is on

    https://computing.llnl.gov/tutorials/openMP/

    Enjoy
     
    ciccio, Feb 1, 2008
    #4
  5. finecur

    vkc Guest

    > What multithread library should I use? I know multithread is not
    > support in native c++...


    For a parallel implementation of algorithms without having to worry
    about threads explicitly, you might want to take a look at Intel's
    Threading Building Blocks.
    http://www.intel.com/cd/software/products/asmo-na/eng/294797.htm

    Vilas
     
    vkc, Feb 1, 2008
    #5
  6. finecur

    red floyd Guest

    finecur wrote:
    > Hi, I am writing a c++ program. The program needs a lot of
    > computational power so I am think about utilizing the "multi-core"
    > benefit of my processor. I have two dual-core x86-64 processors on my
    > linux box. So I am thinking starting 4 threads in my program, each one
    > takes care of a part the computation. I have two questions:
    >
    > Will this approach make my program four times faster than just a
    > single thread program?
    > What multithread library should I use? I know multithread is not
    > support in native c++...
    >


    As for what sort of speedup you can get, see Amdahl's Law.

    http://en.wikipedia.org/wiki/Amdahl's_law
     
    red floyd, Feb 1, 2008
    #6
  7. ciccio wrote:
    >> Will this approach make my program four times faster than just a
    >> single thread program?
    >> What multithread library should I use? I know multithread is not
    >> support in native c++...

    >
    >
    > As stated by the others, a 4x speedup will for sure not be possible, but
    > if you program clever, you can for sure gain a lot.
    >
    > Have a look at openmp. It is supported by g++ as well as intel. And
    > most likely also others.
    >
    > Info on this can be found on www.openmp.org where you can download a
    > very readable standard, and a simple tutorial is on
    >
    > https://computing.llnl.gov/tutorials/openMP/


    For a newbie who just wants a spattering of MP this is the fastest way
    to do it. It does not give you all the flexibility you may think you
    need but more than likely you don't (depending on your app).

    I second the reccomendation.
     
    Gianni Mariani, Feb 1, 2008
    #7
  8. finecur wrote:
    > Hi, I am writing a c++ program. The program needs a lot of
    > computational power so I am think about utilizing the "multi-core"
    > benefit of my processor. I have two dual-core x86-64 processors on my
    > linux box. So I am thinking starting 4 threads in my program, each one
    > takes care of a part the computation. I have two questions:
    >
    > Will this approach make my program four times faster than just a
    > single thread program?


    This depends heavily on cache utilization and how much of the code can
    be truly parallelized. I've heard of some parallelizations improving by
    more than the number of cpus and some not improve much at all.

    In cases where the improvement was "superlinear" because the problem
    ended up fitting into the combined cache of the CPU's.

    I have seen code speed up by 50x simply by making sure that the data in
    the cache didn't thrash.

    > What multithread library should I use? I know multithread is not
    > support in native c++...


    openmp is a very good start and portable and designed for computational
    applications.

    Boost, austria c++, ace etc all provide cross platform thread interfaces
    to raw threading primitives.
     
    Gianni Mariani, Feb 1, 2008
    #8
  9. finecur

    finecur Guest

    On Feb 1, 2:09 pm, Gianni Mariani <> wrote:
    > ciccio wrote:
    > >> Will this approach make my program four times faster than just a
    > >> single thread program?
    > >> What multithread library should I use? I know multithread is not
    > >> support in native c++...

    >
    > > As stated by the others, a 4x speedup will for sure not be possible, but
    > > if you program clever, you can for sure gain a lot.

    >
    > > Have a look at openmp.  It is supported by g++ as well as intel.  And
    > > most likely also others.

    >
    > > Info on this can be found onwww.openmp.orgwhere you can download a
    > > very readable standard, and a simple tutorial is on

    >
    > >https://computing.llnl.gov/tutorials/openMP/

    >
    > For a newbie who just wants a spattering of MP this is the fastest way
    > to do it.  It does not give you all the flexibility you may think you
    > need but more than likely you don't (depending on your app).
    >
    > I second the reccomendation.- Hide quoted text -
    >
    > - Show quoted text -


    Thanks for everyone.

    Here is my code using OpenMP:

    #include <stdio.h>
    #include <stdlib.h>
    #include <omp.h>
    #include <time.h>
    #include <math.h>

    #define CHUNKSIZE 10000
    #define N 100000
    #define NUM 1

    float compute(float a, float b)
    {
    return sin(a) * sin(b);
    }

    main ()
    {

    int start, end;
    int nthreads, tid;
    int i, j, k, m, n, chunk;
    float a[N], b[N], c[N], sum;

    /* Some initializations */
    for (i=0; i < N; i++){
    a = b = i * 1.0;
    }
    chunk = CHUNKSIZE;

    sum = 0;
    start = clock();
    for (i=0; i < N; i++){
    c = compute(a, b);
    sum = sum + c;
    }

    end = clock();
    printf("time=%d, sum=%f\n", end - start, sum);

    sum = 0;
    start = clock();
    #pragma omp parallel shared(a,b,c,chunk) private(i)
    {

    #pragma omp for schedule(dynaic,chunk) nowait
    for (i = 0; i < N; i++){
    c = compute(a, b);
    sum = sum + c;
    }

    }
    end = clock();
    printf("time=%d, sum=%f\n", end - start, sum);
    }

    I found however, the output is

    time=10000, sum=49999.394531
    time=20000, sum=49999.394531

    Which means OpenMP is even slower!!!!
    How come?

    I am using
    g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-fram-pointer -
    pthread -o t main.cc

    to compile the program.

    help....
     
    finecur, Feb 1, 2008
    #9
  10. finecur

    Ian Collins Guest

    finecur wrote:
    >
    > Thanks for everyone.
    >
    > Here is my code using OpenMP:
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <omp.h>
    > #include <time.h>
    > #include <math.h>
    >
    > #define CHUNKSIZE 10000
    > #define N 100000
    > #define NUM 1
    >

    You appear to writing C, not C++.
    >
    > end = clock();
    > printf("time=%d, sum=%f\n", end - start, sum);
    >
    > sum = 0;
    > start = clock();
    > #pragma omp parallel shared(a,b,c,chunk) private(i)
    > {
    > #pragma omp for schedule(dynaic,chunk) nowait
    > for (i = 0; i < N; i++){
    > c = compute(a, b);
    > sum = sum + c;
    > }
    >
    > }
    > end = clock();
    > printf("time=%d, sum=%f\n", end - start, sum);
    > }
    >
    > I found however, the output is
    >
    > time=10000, sum=49999.394531
    > time=20000, sum=49999.394531
    >

    You don't tell openmp how many threads to use. Your loop does not lent
    its self to parallelization, all of the threads will contend for locks on c.

    > Which means OpenMP is even slower!!!!
    > How come?
    >

    clock() gives CPU time, so if you have two thread, the return will be
    doubled for a unit of real time.

    --
    Ian Collins.
     
    Ian Collins, Feb 2, 2008
    #10
  11. finecur wrote:
    ....
    > to compile the program.
    >
    > help....


    You have a number of issues:

    a) you're not compiling with omp enabled - use -fopenmp
    b) your computation is not long enough to meausure adequately - need to
    have at least seconds - preferably more than 5.
    c) if you did make N large enough it would be too big for your stack
    d) clock() has no where near enough resolution to measure the relative
    performance of either version.
    e) sharing sum and computing it on every iteration would cause alot of
    contention on the shared add every iteration - best to remove that.
    f) This is comp.lang.c++ so use C++ !!!

    Try this code:

    to compile withoutout openmp use:
    g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
    xxx_openmp3.cpp -o xxx_openmp3

    ... and with openmp use:
    g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
    xxx_openmp3.cpp -o xxx_openmp3 -fopenmp

    ==========
    #include <omp.h>
    #include <cmath>
    #include <iostream>

    #define CHUNKSIZE 100000
    #define NX 10000000
    #define NUM 1

    template <typename T>
    T compute(T a, T b)
    {
    return std::sin(a) * std::sin(b);
    }

    template <typename T, int N>
    struct Compute
    {
    T a[N], b[N], c[N], sum;

    Compute()
    : sum()
    {
    for (int i=0; i < N; i++){
    a = b = i * 1.0;
    }
    }

    void DoWork()
    {

    int i, chunk = CHUNKSIZE;

    // maintain references to arrays - speeds things up a tad
    T (&la)[N] = a;
    T (&lb)[N] = b;
    T (&lc)[N] = b;
    T lsum = T();
    sum = T();

    #pragma omp parallel shared(chunk) private(i,lsum)
    {
    #pragma omp for schedule(dynamic,chunk) nowait
    for (i = 0; i < N; i++){
    lsum += lc = compute(la, lb);
    }

    #pragma omp critical
    {
    sum += lsum;
    }
    }
    }
    };

    Compute<double,NX> nx;

    int main ()
    {

    nx.DoWork();

    std::cout.precision(16);
    std::cout << "sum = " << nx.sum << "\n";
    }
    ==========

    time ./xxx_openmp3
    sum = 5000000.034065126

    real 0m0.992s
    user 0m1.588s
    sys 0m0.152s

    ..... without -fopenmp

    time ./xxx_openmp3
    sum = 5000000.034065164

    real 0m1.806s
    user 0m1.632s
    sys 0m0.152s


    .... note the result is different due to round off !!! It will even
    change every run because the order of the summation is different. You
    can fix this but you need to save the intermediate sums. In theory, you
    should really add the values in order starting at the ones with the
    smallest absolute value which requires sorting the values in array c.

    Here is what you get now:
    ../xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
    [1] 30411
    [2] 30412
    sum = 5000000.034065094
    sum = 5000000.03406521
    sum = 5000000.034065165

    ....
     
    Gianni Mariani, Feb 2, 2008
    #11
  12. finecur

    finecur Guest

    On Feb 1, 5:25 pm, Gianni Mariani <> wrote:
    > finecurwrote:
    >
    > ...
    >
    > > to compile the program.

    >
    > > help....

    >
    > You have a number of issues:
    >
    > a) you're not compiling with omp enabled - use -fopenmp
    > b) your computation is not long enough to meausure adequately - need to
    > have at least seconds - preferably more than 5.
    > c) if you did make N large enough it would be too big for your stack
    > d) clock() has no where near enough resolution to measure the relative
    > performance of either version.
    > e) sharing sum and computing it on every iteration would cause alot of
    > contention on the shared add every iteration - best to remove that.
    > f) This is comp.lang.c++ so use C++ !!!
    >
    > Try this code:
    >
    > to compile withoutout openmp use:
    > g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
    > xxx_openmp3.cpp -o xxx_openmp3
    >
    > .. and with openmp use:
    > g++ -O3 -march=x86-64 -mfpmath=sse -funroll-loops -fomit-frame-pointer
    > xxx_openmp3.cpp -o xxx_openmp3 -fopenmp
    >
    > ==========
    > #include <omp.h>
    > #include <cmath>
    > #include <iostream>
    >
    > #define CHUNKSIZE 100000
    > #define NX     10000000
    > #define NUM 1
    >
    > template <typename T>
    > T compute(T a, T b)
    > {
    >      return std::sin(a) * std::sin(b);
    >
    > }
    >
    > template <typename T, int N>
    > struct Compute
    > {
    >      T a[N], b[N], c[N], sum;
    >
    >      Compute()
    >        : sum()
    >      {
    >          for (int i=0; i < N; i++){
    >              a = b = i * 1.0;
    >          }
    >      }
    >
    >      void DoWork()
    >      {
    >
    >          int i, chunk = CHUNKSIZE;
    >
    >          // maintain references to arrays - speeds things up a tad
    >          T (&la)[N] = a;
    >          T (&lb)[N] = b;
    >          T (&lc)[N] = b;
    >          T lsum = T();
    >          sum = T();
    >
    >          #pragma omp parallel shared(chunk) private(i,lsum)
    >          {
    >              #pragma omp for schedule(dynamic,chunk) nowait
    >              for (i = 0; i < N; i++){
    >                  lsum += lc = compute(la, lb);
    >              }
    >
    >              #pragma omp critical
    >              {
    >                  sum += lsum;
    >              }
    >          }
    >      }
    >
    > };
    >
    > Compute<double,NX>   nx;
    >
    > int main ()
    > {
    >
    >      nx.DoWork();
    >
    >      std::cout.precision(16);
    >      std::cout << "sum = " << nx.sum << "\n";}
    >
    > ==========
    >
    > time ./xxx_openmp3
    > sum = 5000000.034065126
    >
    > real    0m0.992s
    > user    0m1.588s
    > sys     0m0.152s
    >
    > .... without -fopenmp
    >
    > time ./xxx_openmp3
    > sum = 5000000.034065164
    >
    > real    0m1.806s
    > user    0m1.632s
    > sys     0m0.152s
    >
    > ... note the result is different due to round off !!!  It will even
    > change every run because the order of the summation is different.  You
    > can fix this but you need to save the intermediate sums. In theory, you
    > should really add the values in order starting at the ones with the
    > smallest absolute value which requires sorting the values in array c.
    >
    > Here is what you get now:
    > ./xxx_openmp3 & ./xxx_openmp3 & ./xxx_openmp3 ; sleep 3
    > [1] 30411
    > [2] 30412
    > sum = 5000000.034065094
    > sum =5000000.03406521
    > sum = 5000000.034065165
    >
    > ...

    thank you very much.
    can you tell me how you time the program?
     
    finecur, Feb 3, 2008
    #12
  13. finecur wrote:
    >> time ./xxx_openmp3


    > can you tell me how you time the program?


    He did already. In unix systems there's a utility called 'time' for
    this exact purpose.
     
    Juha Nieminen, Feb 3, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dimitri Ognibene

    java vm and multicore

    Dimitri Ognibene, Apr 27, 2006, in forum: Java
    Replies:
    1
    Views:
    519
    Daniel Dyer
    Apr 27, 2006
  2. r_obert@REMOVE_THIS.hotmail.com

    Multithread or Multithread DLL?

    r_obert@REMOVE_THIS.hotmail.com, Nov 27, 2004, in forum: C++
    Replies:
    0
    Views:
    2,255
    r_obert@REMOVE_THIS.hotmail.com
    Nov 27, 2004
  3. brahatha
    Replies:
    1
    Views:
    684
  4. Neo
    Replies:
    4
    Views:
    379
    Joe Seigh
    Jan 31, 2008
  5. Obnoxious User

    Using a multicore-processor

    Obnoxious User, Aug 29, 2008, in forum: C++
    Replies:
    5
    Views:
    355
    Juha Nieminen
    Aug 29, 2008
Loading...

Share This Page