Concurrent code performance

Discussion in 'C++' started by Saeed Amrollahi, Oct 9, 2011.

  1. Hi All C++ developers

    I learned (from theory and practice) the most important benefit of
    qualified multi-threaded programming is better performance.
    Here it is a (not well-written) concurrent (two worker threads)
    and sequential version
    of Summation of [0, 2000000[ interval.
    Of course my concurrent code is not good. It's for exposition only.

    // concurrent code
    #include <thread>
    #include <iostream>

    using namespace std;
    long long s1 = 0, s2 = 0;
    void Sum(int first, int last, long long& res) // calculate the sum in
    [first, last[ interval
    {
    while (first < last)
    res += first++;
    }

    int main()
    {
    long long r1 = 0, r2 = 0;
    thread t1(Sum, 0, 1000000, std::ref(r1));
    thread t2(Sum, 1000000, 2000000, std::ref(r2));
    t1.join(); t2.join();
    cout << r1 + r2 << '\n';

    return 0;
    }

    // sequential code
    #include <iostream>

    using namespace std;
    void Sum(int first, int last, long long& res) // calculate the sum in
    [first, last[ interval
    {
    while (first < last)
    res += first++;
    }

    int main()
    {
    long long r = 0;
    Sum(0, 2000000, r);

    cout << r<< '\n';

    return 0;
    }

    I compiled and ran two codes using the following commands:
    $ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
    concurrent_sum
    $ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum
    $ time ./concurrent_sum
    1999999999000
    real 0m0.014s
    user 0m0.016s
    sys 0m0.004s
    $ time ./sequential_sum
    1999999999000
    real 0m0.021s
    user 0m0.020s
    sys 0m0.000s
    Of course the time command differs in several
    execution, but honestly, I didn't see so much difference between two
    execution.
    In addition the generated code for sequential_sum is about 6 KB, but
    the size
    of concurrent_code is about 45 KB.
    Is there any problem in my measurement? Why the concurrent object code
    is 7+ times bigger than sequential version? How do you explain it?

    FYI, I use GCC 4.6.0 under Ubuntu on a Dell single processor notebook.
    Please shed some light.

    -- Saeed Amrollahi
     
    Saeed Amrollahi, Oct 9, 2011
    #1
    1. Advertising

  2. Saeed Amrollahi

    Marc Guest

    Saeed Amrollahi wrote:

    > Hi All C++ developers
    >
    > I learned (from theory and practice) the most important benefit of
    > qualified multi-threaded programming is better performance.
    > Here it is a (not well-written) concurrent (two worker threads)
    > and sequential version
    > of Summation of [0, 2000000[ interval.
    > Of course my concurrent code is not good. It's for exposition only.
    >
    > // concurrent code
    > #include <thread>
    > #include <iostream>
    >
    > using namespace std;
    > long long s1 = 0, s2 = 0;
    > void Sum(int first, int last, long long& res) // calculate the sum in
    > [first, last[ interval
    > {
    > while (first < last)
    > res += first++;
    > }
    >
    > int main()
    > {
    > long long r1 = 0, r2 = 0;
    > thread t1(Sum, 0, 1000000, std::ref(r1));
    > thread t2(Sum, 1000000, 2000000, std::ref(r2));


    I would make Sum return a long long instead of updating a reference.
    If you are unlucky, r1 and r2 end up on the same cache line, you have
    false sharing and the concurrent version ends up slower than the
    sequential one.

    > t1.join(); t2.join();
    > cout << r1 + r2 << '\n';
    >
    > return 0;
    > }


    > In addition the generated code for sequential_sum is about 6 KB, but
    > the size
    > of concurrent_code is about 45 KB.
    > Is there any problem in my measurement? Why the concurrent object code
    > is 7+ times bigger than sequential version? How do you explain it?


    Your basic code is very simple: a loop, a couple additions, almost
    nothing. It is not really surprising that code meant to handle
    multiple threads is several times larger. Now it might be possible for
    libstdc++ to factor more of that code into the shared library, but
    that's a different, more complicated question.
     
    Marc, Oct 9, 2011
    #2
    1. Advertising

  3. On Oct 9, 4:29 pm, Marc <> wrote:
    > Saeed Amrollahi  wrote:
    > > Hi All C++ developers

    >
    > > I learned (from theory and practice) the most important benefit of
    > > qualified multi-threaded programming is better performance.
    > > Here it is a (not well-written) concurrent (two worker threads)
    > > and sequential version
    > > of Summation of [0, 2000000[ interval.
    > > Of course my concurrent code is not good. It's for exposition only.

    >
    > > // concurrent code
    > > #include <thread>
    > > #include <iostream>

    >
    > > using namespace std;
    > > long long s1 = 0, s2 = 0;
    > > void Sum(int first, int last, long long& res) // calculate the sum in
    > > [first, last[ interval
    > > {
    > >   while (first < last)
    > >     res += first++;
    > > }

    >
    > > int main()
    > > {
    > >   long long r1 = 0, r2 = 0;
    > >   thread t1(Sum, 0, 1000000, std::ref(r1));
    > >   thread t2(Sum, 1000000, 2000000, std::ref(r2));

    >
    > I would make Sum return a long long instead of updating a reference.
    > If you are unlucky, r1 and r2 end up on the same cache line, you have
    > false sharing and the concurrent version ends up slower than the
    > sequential one.


    I know. As I noted, my code is not well-written. but I tried
    to write two equivalent code.
    Can you explain your answer in more detail? the cache line thing
    and false sharing ...
    >
    > >   t1.join(); t2.join();
    > >   cout << r1 + r2 << '\n';

    >
    > >   return 0;
    > > }
    > > In addition the generated code for sequential_sum is about 6 KB, but
    > > the size
    > > of concurrent_code is about 45 KB.
    > > Is there any problem in my measurement? Why the concurrent object code
    > > is 7+ times bigger than sequential version? How do you explain it?

    >
    > Your basic code is very simple: a loop, a couple additions, almost
    > nothing. It is not really surprising that code meant to handle
    > multiple threads is several times larger. Now it might be possible for
    > libstdc++ to factor more of that code into the shared library, but
    > that's a different, more complicated question.


    Why it's not surprising? is the size of my threads big?
    Would you explain more.

    TIA,
    -- Saeed Amrollahi
     
    Saeed Amrollahi, Oct 9, 2011
    #3
  4. Saeed Amrollahi

    Marc Guest

    Saeed Amrollahi wrote:

    > Can you explain your answer in more detail? the cache line thing
    > and false sharing ...


    There's a wikipedia article on this...

    >> Your basic code is very simple: a loop, a couple additions, almost
    >> nothing. It is not really surprising that code meant to handle
    >> multiple threads is several times larger. Now it might be possible for
    >> libstdc++ to factor more of that code into the shared library, but
    >> that's a different, more complicated question.

    >
    > Why it's not surprising? is the size of my threads big?
    > Would you explain more.


    Size of thread: basically how much you call operator new.
    Size of object file: basically length of (used) source code.

    The basic code is self contained and fits in less than 10 lines. The
    threaded code includes the <thread> header, which is rather long...
     
    Marc, Oct 9, 2011
    #4
  5. Saeed Amrollahi

    Pavel Guest

    Saeed Amrollahi wrote:
    > Hi All C++ developers

    ....

    > I compiled and ran two codes using the following commands:
    > $ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
    > concurrent_sum
    > $ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum

    ....

    > Is there any problem in my measurement?

    Not that I can see. You had a decent speed-up in real-time of 50% (or 33%,
    depending on how you count). Speed-up factor would almost never be exactly 2 for
    2 threads as there are lots of resources in your system that can become
    bottlenecks (you may consider reading "What Every Programmer Should Know About
    Memory" by Ulrich Drepper, available, for example, at
    www.akkadia.org/drepper/cpumemory.pdf); also, context switches would take some
    time. Probably your single-CPU notebook has hyper-threading turned on (try 'cat
    /proc/cpuinfo' to say for sure); else I can't explain that you saw the reduction
    in real time at all.

    > Why the concurrent object code
    > is 7+ times bigger than sequential version? How do you explain it?

    .....
    The concurrent code has thread library code linked in whereas the sequential one
    doesn't. As you use GCC, you can run 'objdump -d' on your concurrent binary and
    see what exactly functions were added as compared to the sequential version.

    >
    > -- Saeed Amrollahi



    Hope this helps,
    -Pavel
     
    Pavel, Oct 10, 2011
    #5
  6. Saeed Amrollahi

    Goran Guest

    On Oct 9, 2:42 pm, Saeed Amrollahi <> wrote:
    > Hi All C++ developers
    >
    > I learned (from theory and practice) the most important benefit of
    > qualified multi-threaded programming is better performance.
    > Here it is a (not well-written) concurrent (two worker threads)
    > and sequential version
    > of Summation of [0, 2000000[ interval.
    > Of course my concurrent code is not good. It's for exposition only.
    >
    > // concurrent code
    > #include <thread>
    > #include <iostream>
    >
    > using namespace std;
    > long long s1 = 0, s2 = 0;
    > void Sum(int first, int last, long long& res) // calculate the sum in
    > [first, last[ interval
    > {
    >   while (first < last)
    >     res += first++;
    >
    > }
    >
    > int main()
    > {
    >   long long r1 = 0, r2 = 0;
    >   thread t1(Sum, 0, 1000000, std::ref(r1));
    >   thread t2(Sum, 1000000, 2000000, std::ref(r2));
    >   t1.join(); t2.join();
    >   cout << r1 + r2 << '\n';
    >
    >   return 0;
    >
    > }
    >
    > // sequential code
    > #include <iostream>
    >
    > using namespace std;
    > void Sum(int first, int last, long long& res) // calculate the sum in
    > [first, last[ interval
    > {
    >   while (first < last)
    >    res += first++;
    >
    > }
    >
    > int main()
    > {
    >   long long r = 0;
    >   Sum(0, 2000000, r);
    >
    >   cout << r<< '\n';
    >
    >   return 0;
    >
    > }
    >
    > I compiled and ran two codes using the following commands:
    > $ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
    > concurrent_sum
    > $ g++ -std=c++0x  -pedantic -o2 sequential_sum.c++ -o sequential_sum
    > $ time ./concurrent_sum
    > 1999999999000
    > real   0m0.014s
    > user  0m0.016s
    > sys   0m0.004s
    > $ time ./sequential_sum
    > 1999999999000
    > real   0m0.021s
    > user  0m0.020s
    > sys   0m0.000s
    > Of course the time command differs in several
    > execution, but honestly, I didn't see so much difference between two
    > execution.
    > In addition the generated code for sequential_sum is about 6 KB, but
    > the size
    > of concurrent_code is about 45 KB.
    > Is there any problem in my measurement? Why the concurrent object code
    > is 7+ times bigger than sequential version? How do you explain it?


    I don't see a problem. On a single-core machine, I am surprised you
    saw even that much of an improvement. I guess, just like Pavel, that
    hyper-threading is working. Your code should give better performance
    on a multi-core machines. Given what it does, it should hardly touch
    memory, if at all, so memory should not be a bottleneck.

    Your executable is 7 times bigger because you linked in (part of)
    pthreads in it.

    Goran.
     
    Goran, Oct 10, 2011
    #6
  7. On Oct 9, 6:41 pm, Marc <> wrote:
    > Saeed Amrollahi  wrote:
    > > Can you explain your answer in more detail? the cache line thing
    > > and false sharing ...

    >
    > There's a wikipedia article on this...
    >
    > >> Your basic code is very simple: a loop, a couple additions, almost
    > >> nothing. It is not really surprising that code meant to handle
    > >> multiple threads is several times larger. Now it might be possible for
    > >> libstdc++ to factor more of that code into the shared library, but
    > >> that's a different, more complicated question.

    >
    > > Why it's not surprising? is the size of my threads big?
    > > Would you explain more.

    >
    > Size of thread: basically how much you call operator new.
    > Size of object file: basically length of (used) source code.
    >
    > The basic code is self contained and fits in less than 10 lines. The
    > threaded code includes the <thread> header, which is rather long...


    Thank you for your answers Marc.
    -- Saeed
     
    Saeed Amrollahi, Oct 10, 2011
    #7
  8. On Oct 10, 5:57 am, Pavel
    <> wrote:
    > Saeed Amrollahi wrote:
    > > Hi All C++ developers

    >
    > ...
    >
    > > I compiled and ran two codes using the following commands:
    > > $ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
    > > concurrent_sum
    > > $ g++ -std=c++0x  -pedantic -o2 sequential_sum.c++ -o sequential_sum

    >
    > ...
    >
    > > Is there any problem in my measurement?

    >
    > Not that I can see. You had a decent speed-up in real-time of 50% (or 33%,
    > depending on how you count). Speed-up factor would almost never be exactly 2 for
    > 2 threads as there are lots of resources in your system that can become
    > bottlenecks

    <Nodded>
    (you may consider reading "What Every Programmer Should Know About
    > Memory" by Ulrich Drepper, available, for example, atwww.akkadia.org/drepper/cpumemory.pdf);

    I know about the paper, but I haven't to read it yet.
    also, context switches would take some
    > time. Probably your single-CPU notebook has hyper-threading turned on (try 'cat
    > /proc/cpuinfo' to say for sure); else I can't explain that you saw the reduction
    > in real time at all.
    >
    > > Why the concurrent object code
    > > is 7+ times bigger than sequential version? How do you explain it?

    >
    > ....
    > The concurrent code has thread library code linked in whereas the sequential one
    > doesn't. As you use GCC, you can run 'objdump -d' on your concurrent binary and
    > see what exactly functions were added as compared to the sequential version.
    >
    >
    >
    > >    -- Saeed Amrollahi

    >
    > Hope this helps,
    > -Pavel


    Thank you very much. Besides concurrency concepts, I learned
    a few UNIX-related commands.
    -- Saeed
     
    Saeed Amrollahi, Oct 10, 2011
    #8
  9. On Oct 10, 10:06 am, Goran <> wrote:
    > On Oct 9, 2:42 pm, Saeed Amrollahi <> wrote:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > > Hi All C++ developers

    >
    > > I learned (from theory and practice) the most important benefit of
    > > qualified multi-threaded programming is better performance.
    > > Here it is a (not well-written) concurrent (two worker threads)
    > > and sequential version
    > > of Summation of [0, 2000000[ interval.
    > > Of course my concurrent code is not good. It's for exposition only.

    >
    > > // concurrent code
    > > #include <thread>
    > > #include <iostream>

    >
    > > using namespace std;
    > > long long s1 = 0, s2 = 0;
    > > void Sum(int first, int last, long long& res) // calculate the sum in
    > > [first, last[ interval
    > > {
    > >   while (first < last)
    > >     res += first++;

    >
    > > }

    >
    > > int main()
    > > {
    > >   long long r1 = 0, r2 = 0;
    > >   thread t1(Sum, 0, 1000000, std::ref(r1));
    > >   thread t2(Sum, 1000000, 2000000, std::ref(r2));
    > >   t1.join(); t2.join();
    > >   cout << r1 + r2 << '\n';

    >
    > >   return 0;

    >
    > > }

    >
    > > // sequential code
    > > #include <iostream>

    >
    > > using namespace std;
    > > void Sum(int first, int last, long long& res) // calculate the sum in
    > > [first, last[ interval
    > > {
    > >   while (first < last)
    > >    res += first++;

    >
    > > }

    >
    > > int main()
    > > {
    > >   long long r = 0;
    > >   Sum(0, 2000000, r);

    >
    > >   cout << r<< '\n';

    >
    > >   return 0;

    >
    > > }

    >
    > > I compiled and ran two codes using the following commands:
    > > $ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
    > > concurrent_sum
    > > $ g++ -std=c++0x  -pedantic -o2 sequential_sum.c++ -o sequential_sum
    > > $ time ./concurrent_sum
    > > 1999999999000
    > > real   0m0.014s
    > > user  0m0.016s
    > > sys   0m0.004s
    > > $ time ./sequential_sum
    > > 1999999999000
    > > real   0m0.021s
    > > user  0m0.020s
    > > sys   0m0.000s
    > > Of course the time command differs in several
    > > execution, but honestly, I didn't see so much difference between two
    > > execution.
    > > In addition the generated code for sequential_sum is about 6 KB, but
    > > the size
    > > of concurrent_code is about 45 KB.
    > > Is there any problem in my measurement? Why the concurrent object code
    > > is 7+ times bigger than sequential version? How do you explain it?

    >
    > I don't see a problem. On a single-core machine, I am surprised you
    > saw even that much of an improvement. I guess, just like Pavel, that
    > hyper-threading is working. Your code should give better performance
    > on a multi-core machines. Given what it does, it should hardly touch
    > memory, if at all, so memory should not be a bottleneck.
    >
    > Your executable is 7 times bigger because you linked in (part of)
    > pthreads in it.
    >
    > Goran.


    Thank you Goran. I just found devising appropriate Multi-threaded
    algorithms is much more difficult than just creating and using several
    threads.
    -- Saeed
     
    Saeed Amrollahi, Oct 10, 2011
    #9
  10. Saeed Amrollahi

    Jorgen Grahn Guest

    On Mon, 2011-10-10, Pavel wrote:
    ....
    > As you use GCC, you can run 'objdump -d' on your concurrent binary


    Nitpick: gcc has nothing to do with it. If the binary is on a format
    which objdump understands, objdump will understand it. (The object
    files generated by gcc happen to be in that category.)

    'nm -nC foo.o' might be a simpler tool for this, by the way.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Oct 10, 2011
    #10
  11. Saeed Amrollahi

    Pavel Guest

    Jorgen Grahn wrote:
    > On Mon, 2011-10-10, Pavel wrote:
    > ...
    >> As you use GCC, you can run 'objdump -d' on your concurrent binary

    >
    > Nitpick: gcc has nothing to do with it. If the binary is on a format
    > which objdump understands, objdump will understand it. (The object
    > files generated by gcc happen to be in that category.)
    >
    > 'nm -nC foo.o' might be a simpler tool for this, by the way.
    >
    > /Jorgen
    >


    Yeah, should have probably say "GNU" or whatever but was lazy during wee hours.

    Gcc had something do to with it, however: the line of thinking was something
    like: uses GCC-on-Linux => should have GNU binutils => has objdump, compressed
    with losses to the first keyword :). The point was not that objdump understood
    the OP's binary format but that the OP should have objdump installed.

    -Pavel
     
    Pavel, Oct 15, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jm
    Replies:
    1
    Views:
    519
    alien2_51
    Dec 12, 2003
  2. Pep
    Replies:
    6
    Views:
    837
  3. Replies:
    1
    Views:
    306
  4. Donkey Hot
    Replies:
    3
    Views:
    4,607
    Chase Preuninger
    Apr 27, 2008
  5. Software Engineer
    Replies:
    0
    Views:
    348
    Software Engineer
    Jun 10, 2011
Loading...

Share This Page