Concurrent code performance

Saeed Amrollahi · Oct 9, 2011

Hi All C++ developers

I learned (from theory and practice) the most important benefit of
qualified multi-threaded programming is better performance.
Here it is a (not well-written) concurrent (two worker threads)
and sequential version
of Summation of [0, 2000000[ interval.
Of course my concurrent code is not good. It's for exposition only.

// concurrent code
#include <thread>
#include <iostream>

using namespace std;
long long s1 = 0, s2 = 0;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;
}

int main()
{
long long r1 = 0, r2 = 0;
thread t1(Sum, 0, 1000000, std::ref(r1));
thread t2(Sum, 1000000, 2000000, std::ref(r2));
t1.join(); t2.join();
cout << r1 + r2 << '\n';

return 0;
}

// sequential code
#include <iostream>

using namespace std;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;
}

int main()
{
long long r = 0;
Sum(0, 2000000, r);

cout << r<< '\n';

return 0;
}

I compiled and ran two codes using the following commands:
$ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
concurrent_sum
$ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum
$ time ./concurrent_sum
1999999999000
real 0m0.014s
user 0m0.016s
sys 0m0.004s
$ time ./sequential_sum
1999999999000
real 0m0.021s
user 0m0.020s
sys 0m0.000s
Of course the time command differs in several
execution, but honestly, I didn't see so much difference between two
execution.
In addition the generated code for sequential_sum is about 6 KB, but
the size
of concurrent_code is about 45 KB.
Is there any problem in my measurement? Why the concurrent object code
is 7+ times bigger than sequential version? How do you explain it?

FYI, I use GCC 4.6.0 under Ubuntu on a Dell single processor notebook.
Please shed some light.

-- Saeed Amrollahi

Marc · Oct 9, 2011

Saeed said:
Hi All C++ developers

I learned (from theory and practice) the most important benefit of
qualified multi-threaded programming is better performance.
Here it is a (not well-written) concurrent (two worker threads)
and sequential version
of Summation of [0, 2000000[ interval.
Of course my concurrent code is not good. It's for exposition only.

// concurrent code
#include <thread>
#include <iostream>

using namespace std;
long long s1 = 0, s2 = 0;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;
}

int main()
{
long long r1 = 0, r2 = 0;
thread t1(Sum, 0, 1000000, std::ref(r1));
thread t2(Sum, 1000000, 2000000, std::ref(r2));

I would make Sum return a long long instead of updating a reference.
If you are unlucky, r1 and r2 end up on the same cache line, you have
false sharing and the concurrent version ends up slower than the
sequential one.

t1.join(); t2.join();
cout << r1 + r2 << '\n';

return 0;
}

In addition the generated code for sequential_sum is about 6 KB, but
the size
of concurrent_code is about 45 KB.
Is there any problem in my measurement? Why the concurrent object code
is 7+ times bigger than sequential version? How do you explain it?

Your basic code is very simple: a loop, a couple additions, almost
nothing. It is not really surprising that code meant to handle
multiple threads is several times larger. Now it might be possible for
libstdc++ to factor more of that code into the shared library, but
that's a different, more complicated question.

Saeed Amrollahi · Oct 9, 2011

Saeed Amrollahi wrote:

Hi All C++ developers

Click to expand...

I learned (from theory and practice) the most important benefit of
qualified multi-threaded programming is better performance.
Here it is a (not well-written) concurrent (two worker threads)
and sequential version
of Summation of [0, 2000000[ interval.
Of course my concurrent code is not good. It's for exposition only.

Click to expand...

// concurrent code
#include <thread>
#include <iostream>

Click to expand...

using namespace std;
long long s1 = 0, s2 = 0;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;
}

Click to expand...

int main()
{
long long r1 = 0, r2 = 0;
thread t1(Sum, 0, 1000000, std::ref(r1));
thread t2(Sum, 1000000, 2000000, std::ref(r2));

Click to expand...

I would make Sum return a long long instead of updating a reference.
If you are unlucky, r1 and r2 end up on the same cache line, you have
false sharing and the concurrent version ends up slower than the
sequential one.

I know. As I noted, my code is not well-written. but I tried
to write two equivalent code.
Can you explain your answer in more detail? the cache line thing
and false sharing ...

Your basic code is very simple: a loop, a couple additions, almost
nothing. It is not really surprising that code meant to handle
multiple threads is several times larger. Now it might be possible for
libstdc++ to factor more of that code into the shared library, but
that's a different, more complicated question.

Why it's not surprising? is the size of my threads big?
Would you explain more.

TIA,
-- Saeed Amrollahi

Marc · Oct 9, 2011

Saeed said:
Can you explain your answer in more detail? the cache line thing
and false sharing ...

There's a wikipedia article on this...

Why it's not surprising? is the size of my threads big?
Would you explain more.

Size of thread: basically how much you call operator new.
Size of object file: basically length of (used) source code.

The basic code is self contained and fits in less than 10 lines. The
threaded code includes the <thread> header, which is rather long...

Pavel · Oct 10, 2011

Saeed said:
Hi All C++ developers ....

I compiled and ran two codes using the following commands:
$ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
concurrent_sum
$ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum ....

Is there any problem in my measurement?

Not that I can see. You had a decent speed-up in real-time of 50% (or 33%,
depending on how you count). Speed-up factor would almost never be exactly 2 for
2 threads as there are lots of resources in your system that can become
bottlenecks (you may consider reading "What Every Programmer Should Know About
Memory" by Ulrich Drepper, available, for example, at
www.akkadia.org/drepper/cpumemory.pdf); also, context switches would take some
time. Probably your single-CPU notebook has hyper-threading turned on (try 'cat
/proc/cpuinfo' to say for sure); else I can't explain that you saw the reduction
in real time at all.

Why the concurrent object code
is 7+ times bigger than sequential version? How do you explain it?

.....
The concurrent code has thread library code linked in whereas the sequential one
doesn't. As you use GCC, you can run 'objdump -d' on your concurrent binary and
see what exactly functions were added as compared to the sequential version.

-- Saeed Amrollahi

Hope this helps,
-Pavel

Goran · Oct 10, 2011

Hi All C++ developers

I learned (from theory and practice) the most important benefit of
qualified multi-threaded programming is better performance.
Here it is a (not well-written) concurrent (two worker threads)
and sequential version
of Summation of [0, 2000000[ interval.
Of course my concurrent code is not good. It's for exposition only.

// concurrent code
#include <thread>
#include <iostream>

using namespace std;
long long s1 = 0, s2 = 0;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;

}

int main()
{
long long r1 = 0, r2 = 0;
thread t1(Sum, 0, 1000000, std::ref(r1));
thread t2(Sum, 1000000, 2000000, std::ref(r2));
t1.join(); t2.join();
cout << r1 + r2 << '\n';

return 0;

}

// sequential code
#include <iostream>

using namespace std;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;

}

int main()
{
long long r = 0;
Sum(0, 2000000, r);

cout << r<< '\n';

return 0;

}

I compiled and ran two codes using the following commands:
$ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
concurrent_sum
$ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum
$ time ./concurrent_sum
1999999999000
real 0m0.014s
user 0m0.016s
sys 0m0.004s
$ time ./sequential_sum
1999999999000
real 0m0.021s
user 0m0.020s
sys 0m0.000s
Of course the time command differs in several
execution, but honestly, I didn't see so much difference between two
execution.
In addition the generated code for sequential_sum is about 6 KB, but
the size
of concurrent_code is about 45 KB.
Is there any problem in my measurement? Why the concurrent object code
is 7+ times bigger than sequential version? How do you explain it?

I don't see a problem. On a single-core machine, I am surprised you
saw even that much of an improvement. I guess, just like Pavel, that
hyper-threading is working. Your code should give better performance
on a multi-core machines. Given what it does, it should hardly touch
memory, if at all, so memory should not be a bottleneck.

Your executable is 7 times bigger because you linked in (part of)
pthreads in it.

Goran.

Saeed Amrollahi · Oct 10, 2011

Saeed Amrollahi wrote:

There's a wikipedia article on this...

Size of thread: basically how much you call operator new.
Size of object file: basically length of (used) source code.

The basic code is self contained and fits in less than 10 lines. The
threaded code includes the <thread> header, which is rather long...

Thank you for your answers Marc.
-- Saeed

Saeed Amrollahi · Oct 10, 2011

Not that I can see. You had a decent speed-up in real-time of 50% (or 33%,
depending on how you count). Speed-up factor would almost never be exactly 2 for
2 threads as there are lots of resources in your system that can become
bottlenecks

Memory" by Ulrich Drepper, available, for example, atwww.akkadia.org/drepper/cpumemory.pdf);

I know about the paper, but I haven't to read it yet.
also, context switches would take some

time. Probably your single-CPU notebook has hyper-threading turned on (try 'cat
/proc/cpuinfo' to say for sure); else I can't explain that you saw the reduction
in real time at all.

....
The concurrent code has thread library code linked in whereas the sequential one
doesn't. As you use GCC, you can run 'objdump -d' on your concurrent binary and
see what exactly functions were added as compared to the sequential version.

Hope this helps,
-Pavel

Thank you very much. Besides concurrency concepts, I learned
a few UNIX-related commands.
-- Saeed

Saeed Amrollahi · Oct 10, 2011

Hi All C++ developers

Click to expand...

I learned (from theory and practice) the most important benefit of
qualified multi-threaded programming is better performance.
Here it is a (not well-written) concurrent (two worker threads)
and sequential version
of Summation of [0, 2000000[ interval.
Of course my concurrent code is not good. It's for exposition only.

Click to expand...

// concurrent code
#include <thread>
#include <iostream>

Click to expand...

using namespace std;
long long s1 = 0, s2 = 0;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;

}

Click to expand...

int main()
{
long long r1 = 0, r2 = 0;
thread t1(Sum, 0, 1000000, std::ref(r1));
thread t2(Sum, 1000000, 2000000, std::ref(r2));
t1.join(); t2.join();
cout << r1 + r2 << '\n';

Click to expand...

return 0;

// sequential code
#include <iostream>

Click to expand...

using namespace std;
void Sum(int first, int last, long long& res) // calculate the sum in
[first, last[ interval
{
while (first < last)
res += first++;

}

Click to expand...

int main()
{
long long r = 0;
Sum(0, 2000000, r);

Click to expand...

cout << r<< '\n';

Click to expand...

return 0;

I compiled and ran two codes using the following commands:
$ g++ -std=c++0x -pthread -pedantic -o2 concurrent_sum.c++ -o
concurrent_sum
$ g++ -std=c++0x -pedantic -o2 sequential_sum.c++ -o sequential_sum
$ time ./concurrent_sum
1999999999000
real 0m0.014s
user 0m0.016s
sys 0m0.004s
$ time ./sequential_sum
1999999999000
real 0m0.021s
user 0m0.020s
sys 0m0.000s
Of course the time command differs in several
execution, but honestly, I didn't see so much difference between two
execution.
In addition the generated code for sequential_sum is about 6 KB, but
the size
of concurrent_code is about 45 KB.
Is there any problem in my measurement? Why the concurrent object code
is 7+ times bigger than sequential version? How do you explain it?

Click to expand...

I don't see a problem. On a single-core machine, I am surprised you
saw even that much of an improvement. I guess, just like Pavel, that
hyper-threading is working. Your code should give better performance
on a multi-core machines. Given what it does, it should hardly touch
memory, if at all, so memory should not be a bottleneck.

Your executable is 7 times bigger because you linked in (part of)
pthreads in it.

Goran.

Thank you Goran. I just found devising appropriate Multi-threaded
algorithms is much more difficult than just creating and using several
threads.
-- Saeed

Jorgen Grahn · Oct 10, 2011

.

As you use GCC, you can run 'objdump -d' on your concurrent binary

Nitpick: gcc has nothing to do with it. If the binary is on a format
which objdump understands, objdump will understand it. (The object
files generated by gcc happen to be in that category.)

'nm -nC foo.o' might be a simpler tool for this, by the way.

/Jorgen

Pavel · Oct 15, 2011

Jorgen said:
Nitpick: gcc has nothing to do with it. If the binary is on a format
which objdump understands, objdump will understand it. (The object
files generated by gcc happen to be in that category.)

'nm -nC foo.o' might be a simpler tool for this, by the way.

/Jorgen

Yeah, should have probably say "GNU" or whatever but was lazy during wee hours.

Gcc had something do to with it, however: the line of thinking was something
like: uses GCC-on-Linux => should have GNU binutils => has objdump, compressed
with losses to the first keyword

. The point was not that objdump understood
the OP's binary format but that the OP should have objdump installed.

-Pavel

Fibonacci C code	1	Aug 4, 2018
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
threaded dla, _very_ naive, interest test 0...	3	Jun 10, 2014
Adding adressing of IPv6 to program	1	Feb 16, 2023
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Performance problem with Pthread code	1	Mar 26, 2010
std::thread constructor does not handle references without std::ref	2	Oct 21, 2012
8 buttons ,3 states and PJON Arduino	0	Jan 15, 2022

Concurrent code performance

Saeed Amrollahi

Marc

Saeed Amrollahi

Marc

Pavel

Goran

Saeed Amrollahi

Saeed Amrollahi

Saeed Amrollahi

Jorgen Grahn

Pavel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads