Counter-intuitive io vs no-io time readings

  • Thread starter Ney André de Mello Zunino
  • Start date

N

Ney André de Mello Zunino

Hello.

A discussion at work about how we might need to reconsider our
understanding about basic data structures and their impact on the
performance of algorithms has led me to start writing a simple
benchmarking program.

The full code is pasted below (still a work in progress) and consists
basically of an abstract base class /timed_task/ which takes care of
calling a virtual /do_run/ method and measuring the (wallclock) elapsed
time. The time readings are done with facilities from sys/time.h (I'm
running Ubuntu 13.10 64-bit).

The only test implemented so far consists of sequentially accessing
elements in an int vector and checking if their are odd or even (see
class /vec_seq_access/).

The test data is dynamically generated and consists of a hundred million
random numbers. The test data generation time is of no importance.

/*** BEGIN CODE ***/
#include <sys/time.h>

#include <random>
#include <vector>
#include <list>
#include <string>
#include <sstream>
#include <iostream>
#include <iomanip>

const unsigned int TEST_SAMPLE_COUNT = 5;

const unsigned int LIGHT_LOAD = 1000;
const unsigned int MEDIUM_LOAD = 100000;
const unsigned int HEAVY_LOAD = 100000000;

long elapsed_nsec(const timespec& t1, const timespec& t2) {
return (t2.tv_sec - t1.tv_sec) * 1e9 + (t2.tv_nsec - t1.tv_nsec);
}

std::string elapsed_info(long nsec) {
double sec = nsec / 1e9;
std::eek:stringstream oss;
oss << nsec << " nanoseconds ~= " << std::setprecision(5) << sec << "
seconds";
return oss.str();
}

class timed_task {
public:
virtual ~timed_task() {}
long run() const {
timespec t1;
clock_gettime(CLOCK_REALTIME, &t1);
do_run();
timespec t2;
clock_gettime(CLOCK_REALTIME, &t2);
return elapsed_nsec(t1, t2);
}
protected:
virtual void do_run() const = 0;
};

class vec_seq_access : public timed_task {
public:
vec_seq_access(const std::vector<unsigned int>& data) : data(data) {
}
protected:
virtual void do_run() const {
int odd_count = 0;
int even_count = 0;
for (const auto& i : data) {
if (i % 2 != 0) ++odd_count; else ++even_count;
}
std::cout << odd_count << " odd numbers and " << even_count << " even
numbers.\n";
}
private:
const std::vector<unsigned int>& data;
};

// TODO
class list_seq_access : public timed_task {
};

auto generate_random_data(int count) {
timespec t;
clock_gettime(CLOCK_REALTIME, &t);
std::mt19937 generator;
generator.seed(t.tv_nsec);
std::uniform_int_distribution<uint32_t> dist;
std::vector<unsigned int> data;
data.reserve(count);
for (int k = 0; k < count; ++k) {
data.push_back(dist(generator));
}
return data;
}

void run_test_samples(const std::string&& label, const timed_task& task,
int count) {
std::cout << "[TEST] " << label << " (" << count << " runs)\n";
for (int i = 1; i <= count; ++i) {
std::cout << "Run " << i << ": " << elapsed_info(task.run()) << '\n';
}
}

int main() {
std::cout << "Generating random data..." << std::flush;
std::vector<unsigned int>&& data = generate_random_data(HEAVY_LOAD);
std::cout << "\nDone.\n";

vec_seq_access vsq(data);

run_test_samples("Vector sequential access", vsq, TEST_SAMPLE_COUNT);
}

/*** END CODE ***/

When I run the program as-is, I get the following output:

Generating random data...
Done.
[TEST] Vector sequential access (5 runs)
50003095 odd numbers and 49996905 even numbers.
Run 1: 1968346953 nanoseconds ~= 1.9683 seconds
50003095 odd numbers and 49996905 even numbers.
Run 2: 1968285632 nanoseconds ~= 1.9683 seconds
50003095 odd numbers and 49996905 even numbers.
Run 3: 1967984546 nanoseconds ~= 1.968 seconds
50003095 odd numbers and 49996905 even numbers.
Run 4: 1968289613 nanoseconds ~= 1.9683 seconds
50003095 odd numbers and 49996905 even numbers.
Run 5: 1968062489 nanoseconds ~= 1.9681 seconds

As you can see, the average time is roughly 1.97 seconds per run. Now,
the counter-intuitive aspect mentioned in the subject came up when I
decided to remove the IO call from the test algorithm, expecting time
readings to decrease a bit. Here is the output with the IO stream
operation removed:

Generating random data...
Done.
[TEST] Vector sequential access (5 runs)
Run 1: 2141563114 nanoseconds ~= 2.1416 seconds
Run 2: 2142123171 nanoseconds ~= 2.1421 seconds
Run 3: 2141130097 nanoseconds ~= 2.1411 seconds
Run 4: 2140915057 nanoseconds ~= 2.1409 seconds
Run 5: 2141052016 nanoseconds ~= 2.1411 seconds

Surprisingly, the average run time has actually increased to about 2.14
seconds. Surely, I must be missing something or maybe there's some weird
combination of factors leading to this outcome. Can anybody see what's
going on?

Compiler: g++ 4.8.1
Compilation command: g++ --std=c++1y -o ds-perf ds-perf.cpp

Thank you in advance,
 
Ad

Advertisements

I

Ian Collins

Ney said:
Hello.

A discussion at work about how we might need to reconsider our
understanding about basic data structures and their impact on the
performance of algorithms has led me to start writing a simple
benchmarking program.

The full code is pasted below (still a work in progress) and consists
basically of an abstract base class /timed_task/ which takes care of
calling a virtual /do_run/ method and measuring the (wallclock) elapsed
time. The time readings are done with facilities from sys/time.h (I'm
running Ubuntu 13.10 64-bit).

The only test implemented so far consists of sequentially accessing
elements in an int vector and checking if their are odd or even (see
class /vec_seq_access/).

The test data is dynamically generated and consists of a hundred million
random numbers. The test data generation time is of no importance.

/*** BEGIN CODE ***/

class vec_seq_access : public timed_task {
public:
vec_seq_access(const std::vector<unsigned int>& data) : data(data) {
}
protected:
virtual void do_run() const {
int odd_count = 0;
int even_count = 0;
for (const auto& i : data) {
if (i % 2 != 0) ++odd_count; else ++even_count;
}
std::cout << odd_count << " odd numbers and " << even_count << " even
numbers.\n";

If this is the line you refer to later, I would expect an optimiser to
remove the body of this function if you remove this line.

auto generate_random_data(int count) {

This auto declaration is missing the return type.

As you can see, the average time is roughly 1.97 seconds per run. Now,
the counter-intuitive aspect mentioned in the subject came up when I
decided to remove the IO call from the test algorithm, expecting time
readings to decrease a bit. Here is the output with the IO stream
operation removed:

The one I highlighted?
Generating random data...
Done.
[TEST] Vector sequential access (5 runs)
Run 1: 2141563114 nanoseconds ~= 2.1416 seconds
Run 2: 2142123171 nanoseconds ~= 2.1421 seconds
Run 3: 2141130097 nanoseconds ~= 2.1411 seconds
Run 4: 2140915057 nanoseconds ~= 2.1409 seconds
Run 5: 2141052016 nanoseconds ~= 2.1411 seconds

Surprisingly, the average run time has actually increased to about 2.14
seconds. Surely, I must be missing something or maybe there's some weird
combination of factors leading to this outcome. Can anybody see what's
going on?

Probably the latter! Did you look at the generated assembly code?
Compiler: g++ 4.8.1
Compilation command: g++ --std=c++1y -o ds-perf ds-perf.cpp

--std=c++1y ?
 
N

Ney André de Mello Zunino

Ney André de Mello Zunino wrote:
[...]
class vec_seq_access : public timed_task {
public:
vec_seq_access(const std::vector<unsigned int>& data) : data(data) {
}
protected:
virtual void do_run() const {
int odd_count = 0;
int even_count = 0;
for (const auto& i : data) {
if (i % 2 != 0) ++odd_count; else ++even_count;
}
std::cout << odd_count << " odd numbers and " << even_count <<
" even
numbers.\n";

If this is the line you refer to later, I would expect an optimiser
to remove the body of this function if you remove this line.

That's a reasonable assumption. Upon realizing that none of the values
produced in this function are actually read/referenced, the optimizer
could decide that no code needed to be generated. However, that would
make it all even more counter-intuitive, since the program output showed
that the time readings were greater for the no-output case, when
compared to the other case, where the output line was kept (and the code
not optimized away, according to the assumption at hand).
This auto declaration is missing the return type.

Luca Risolia's got it right. I've been away from C++ for a long time and
decided to try out some of its new features.
Probably the latter! Did you look at the generated assembly code?

I did. I've created a repository on BitBucket and pushed the source
code, assembly outputs and a screenshot of the relevant part of the
comparison between them. I've used '-io' and '-no-io' suffixes to refer
to the instances with the 'cout' line enabled and suppressed, respectively.

https://bitbucket.org/Zunino/ds-performance/src
--std=c++1y ?

Again, as Luca Risolia's said, it's g++'s option to enable support for
the latest language features.

P.S.: Just one last note to make matters a bit more contrived: I've run
the same tests here in my laptop at home and, guess what? The results
were more like what I expected, that is, the version without the
IO/stream call did perform better. Not by as much as I'd thought, but it
did. Oh, well...

Thank you all for your time.

Regards,
 
N

Ney André de Mello Zunino

Ney André de Mello Zunino wrote:
[...]
class vec_seq_access : public timed_task {
public:
vec_seq_access(const std::vector<unsigned int>& data) : data(data) {
}
protected:
virtual void do_run() const {
int odd_count = 0;
int even_count = 0;
for (const auto& i : data) {
if (i % 2 != 0) ++odd_count; else ++even_count;
}
std::cout << odd_count << " odd numbers and " << even_count <<
" even
numbers.\n";

If this is the line you refer to later, I would expect an optimiser to
remove the body of this function if you remove this line.

That's a reasonable assumption. Upon realizing that none of the values
produced in this function are actually read/referenced, the optimizer
could decide that no code needed to be generated. However, that would
make it all even more counter-intuitive, since the program output showed
that the time readings were greater for the no-output case, when
compared to the other case, where the output line was kept (and the code
not optimized away, according to the assumption at hand).
This auto declaration is missing the return type.

Luca Risolia's got it right. I've been away from C++ for a long time and
decided to try out some of its new features.

[...]
Probably the latter! Did you look at the generated assembly code?

I did. I've created a repository on BitBucket and pushed the source
code, assembly outputs and a screenshot of the relevant part of the
comparison between them. I've used '-io' and '-no-io' suffixes to refer
to the instances with the 'cout' line enabled and suppressed, respectively.

https://bitbucket.org/Zunino/ds-performance/src
--std=c++1y ?

As Luca Risolia's said, it's g++'s option to enable support for the
latest language features.

P.S.: Just one last note to make matters a bit more contrived: I've run
the same tests here in my laptop at home and, guess what? The results
were more like what I expected, that is, the version without the
IO/stream call did perform better. Not by as much as I'd thought, but it
did. Oh, well...

Thank you all for your time.

Regards,
 
I

Ike Naar

[...]
Surprisingly, the average run time has actually increased to about 2.14
seconds. Surely, I must be missing something or maybe there's some weird
combination of factors leading to this outcome. Can anybody see what's
going on?

You are measuring elapsed times.
So your measurements depend on the system load.
 
Ad

Advertisements

N

Ney André de Mello Zunino

[...]
Surprisingly, the average run time has actually increased to about 2.14
seconds. Surely, I must be missing something or maybe there's some weird
combination of factors leading to this outcome. Can anybody see what's
going on?

You are measuring elapsed times.
So your measurements depend on the system load.

You are right about that. However, I did do several runs of the program
both with and without the stream I/O call in the benchmark function and
got consistent results (albeit not identical).

Nevertheless, I intend to enhance my little test framework to measure
CPU time instead of (or in conjunction with) wallclock time. I'm first
trying to figure out why I did get those strange results.

Regards,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top