unit-profiling, similar to unit-testing

U

Ulrich Eckhardt

Hi!

I'm currently trying to establish a few tests here that evaluate certain
performance characteristics of our systems. As part of this, I found
that these tests are rather similar to unit-tests, only that they are
much more fuzzy and obviously dependent on the systems involved, CPU
load, network load, day of the week (Tuesday is virus scan day) etc.

What I'd just like to ask is how you do such things. Are there tools
available that help? I was considering using the unit testing framework,
but the problem with that is that the results are too hard to interpret
programmatically and too easy to misinterpret manually. Any suggestions?

Cheers!

Uli
 
R

Roy Smith

Ulrich Eckhardt said:
Hi!

I'm currently trying to establish a few tests here that evaluate certain
performance characteristics of our systems. As part of this, I found
that these tests are rather similar to unit-tests, only that they are
much more fuzzy and obviously dependent on the systems involved, CPU
load, network load, day of the week (Tuesday is virus scan day) etc.

What I'd just like to ask is how you do such things. Are there tools
available that help? I was considering using the unit testing framework,
but the problem with that is that the results are too hard to interpret
programmatically and too easy to misinterpret manually. Any suggestions?

It's really, really, really hard to either control for, or accurately
measure, things like CPU or network load. There's so much stuff you
can't even begin to see. The state of your main memory cache. Disk
fragmentation. What I/O is happening directly out of kernel buffers vs
having to do a physical disk read. How slow your DNS server is today.

What I suggest is instrumenting your unit test suite to record not just
the pas/fail status of every test, but also the test duration. Stick
these into a database as the tests run. Over time, you will accumulate
a whole lot of performance data, which you can then start to mine.

While you're running the tests, gather as much system performance data
as you can (output of top, vmstat, etc) and stick that into your
database too. You never know when you'll want to refer to the data, so
just collect it all and save it forever.
 
U

Ulrich Eckhardt

Am 16.11.2011 15:36, schrieb Roy Smith:
It's really, really, really hard to either control for, or accurately
measure, things like CPU or network load. There's so much stuff you
can't even begin to see. The state of your main memory cache. Disk
fragmentation. What I/O is happening directly out of kernel buffers vs
having to do a physical disk read. How slow your DNS server is today.

Fortunately, I am in a position where I'm running tests on one system
(generic desktop PC) while the system to test is another one, and there
both hardware and software is under my control. Since this is rather
smallish and embedded, the power and load of the desktop don't play a
significant role, the other side is usually the bottleneck. ;)

What I suggest is instrumenting your unit test suite to record not just
the pas/fail status of every test, but also the test duration. Stick
these into a database as the tests run. Over time, you will accumulate
a whole lot of performance data, which you can then start to mine.

I'm not sure. I see unit tests as something that makes sure things run
correctly. For performance testing, I have functions to set up and tear
down the environment. Then, I found it useful to have separate code to
prime a cache, which is something done before each test run, but which
is not part of the test run itself. I'm repeating each test run N times,
recording the times and calculating maximum, minimum, average and
standard deviation. Some of this is similar to unit testing (code to set
up/tear down), but other things are too different. Also, sometimes I can
vary tests with a factor F, then I would also want to capture the
influence of this factor. I would even wonder if you can't verify the
behaviour agains an expected Big-O complexity somehow.

All of this is rather general, not specific to my use case, hence my
question if there are existing frameworks to facilitate this task. Maybe
it's time to create one...

While you're running the tests, gather as much system performance data
as you can (output of top, vmstat, etc) and stick that into your
database too. You never know when you'll want to refer to the data, so
just collect it all and save it forever.

Yes, this is surely something that is necessary, in particular since
there are no clear success/failure outputs like for unit tests and they
require a human to interpret them.


Cheers!

Uli
 
R

Roy Smith

Ulrich Eckhardt said:
Yes, this is surely something that is necessary, in particular since
there are no clear success/failure outputs like for unit tests and they
require a human to interpret them.

As much as possible, you want to automate things so no human
intervention is required.

For example, let's say you have a test which calls foo() and times how
long it takes. You've already mentioned that you run it N times and
compute some basic (min, max, avg, sd) stats. So far, so good.

The next step is to do some kind of regression against past results.
Once you've got a bunch of historical data, it should be possible to
look at today's numbers and detect any significant change in performance.

Much as I loathe the bureaucracy and religious fervor which has grown up
around Six Sigma, it does have some good tools. You might want to look
into control charts (http://en.wikipedia.org/wiki/Control_chart). You
think you've got the test environment under control, do you? Try
plotting a month's worth of run times for a particular test on a control
chart and see what it shows.

Assuming your process really is under control, I would write scripts
that did the following kinds of analysis:

1) For a given test, do a linear regression of run time vs date. If the
line has any significant positive slope, you want to investigate why.

2) You already mentioned, "I would even wonder if you can't verify the
behaviour agains an expected Big-O complexity somehow". Of course you
can. Run your test a bunch of times with different input sizes. I
would try something like a 1-2-5 progression over several decades (i.e.
input sizes of 10, 20, 50, 100, 200, 500, 1000, etc) You will have to
figure out what an appropriate range is, and how to generate useful
input sets. Now, curve fit your performance numbers to various shape
curves and see what correlation coefficient you get.

All that being said, in my experience, nothing beats plotting your data
and looking at it.
 
T

Tycho Andersen

It's really, really, really hard to either control for, or accurately
measure, things like CPU or network load. There's so much stuff you
can't even begin to see. The state of your main memory cache. Disk
fragmentation. What I/O is happening directly out of kernel buffers vs
having to do a physical disk read. How slow your DNS server is today.

While I agree there's a lot of things you can't control for, you can
get a more accurate picture by using CPU time instead of wall time
(e.g. the clock() system call). If what you care about is mostly CPU
time, you can control for the "your disk is fragmented", "your DNS
server died", or "my cow-orker was banging on the test machine" this
way.

\t
 
S

spartan.the

As much as possible, you want to automate things so no human
intervention is required.

For example, let's say you have a test which calls foo() and times how
long it takes.  You've already mentioned that you run it N times and
compute some basic (min, max, avg, sd) stats.  So far, so good.

The next step is to do some kind of regression against past results.
Once you've got a bunch of historical data, it should be possible to
look at today's numbers and detect any significant change in performance.

Much as I loathe the bureaucracy and religious fervor which has grown up
around Six Sigma, it does have some good tools.  You might want to look
into control charts (http://en.wikipedia.org/wiki/Control_chart).  You
think you've got the test environment under control, do you?  Try
plotting a month's worth of run times for a particular test on a control
chart and see what it shows.

Assuming your process really is under control, I would write scripts
that did the following kinds of analysis:

1) For a given test, do a linear regression of run time vs date.  If the
line has any significant positive slope, you want to investigate why.

2) You already mentioned, "I would even wonder if you can't verify the
behaviour agains an expected Big-O complexity somehow".  Of course you
can.  Run your test a bunch of times with different input sizes.  I
would try something like a 1-2-5 progression over several decades (i.e.
input sizes of 10, 20, 50, 100, 200, 500, 1000, etc)  You will have to
figure out what an appropriate range is, and how to generate useful
input sets.  Now, curve fit your performance numbers to various shape
curves and see what correlation coefficient you get.

All that being said, in my experience, nothing beats plotting your data
and looking at it.

I strongly agree with Roy, here.

Ulrich, I recommend you to explore how google measures appengine's
health here: http://code.google.com/status/appengine.

Unit tests are inappropriate here; any single unit test can answer
PASS or FAIL, YES or NO. It can't answer the question "how much".
Unless you just want to use unit tests. Then any arguments here just
don't make sense.

I suggest:

1. Decide what you want to measure. Measure result must be a number in
range (0..100, -5..5), so you can plot them.
2. Write no-UI programs to get each number (measure) and write it to
CSV. Run each of them several times take away 1 worst and 1 best
result, and take and average number.
3. Collect the data for some period of time.
4. Plot those average number over time axis (it's easy with CSV
format).
5. Make sure you automate this process (batch files or so) so the plot
is generated automatically each hour or each day.

And then after a month you can decide if you want to divide your
number ranges into green-yellow-red zones. More often than not you may
find that your measures are so inaccurate and random that you can't
trust them. Then you'll either forget that or dive into math
(statistics). You have about 5% chances to succeed ;)
 
R

Roy Smith

Tycho Andersen said:
While I agree there's a lot of things you can't control for, you can
get a more accurate picture by using CPU time instead of wall time
(e.g. the clock() system call). If what you care about is mostly CPU
time [...]

That's a big if. In some cases, CPU time is important, but more often,
wall-clock time is more critical. Let's say I've got two versions of a
program. Here's some results for my test run:

Version CPU Time Wall-Clock Time
1 2 hours 2.5 hours
2 1.5 hours 5.0 hours

Between versions, I reduced the CPU time to complete the given task, but
increased the wall clock time. Perhaps I doubled the size of some hash
table. Now I get a lot fewer hash collisions (so I spend less CPU time
re-hashing), but my memory usage went up so I'm paging a lot and my
locality of reference went down so my main memory cache hit rate is
worse.

Which is better? I think most people would say version 1 is better.

CPU time is only important in a situation where the system is CPU bound.
In many real-life cases, that's not at all true. Things can be memory
bound. Or I/O bound (which, when you consider paging, is often the same
thing as memory bound). Or lock-contention bound.

Before you starting measuring things, it's usually a good idea to know
what you want to measure, and why :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top