micro-benchmarking

Giovanni Azua · May 1, 2009

Hi all,

I was busy today doing some micro-benchmark performance testing to the
[revised and optimized] PerfectJPattern's Visitor implementation and wanted
to ask for advice if anything could be improved here.

The benchmark is implemented as a Test Case like this:
http://perfectjpattern.svn.sourcefo...oral/visitor/TestPerformance.java?view=markup

The class being benchmarked is the following:
http://perfectjpattern.svn.sourcefo...oral/visitor/AbstractVisitor.java?view=markup

Model class (requires parameter covariance matching):
http://perfectjpattern.svn.sourcefo...ehavioral/visitor/StyledBody.java?view=markup

Visitors:

(without caching)
http://perfectjpattern.svn.sourcefo...behavioral/visitor/DoVisitor.java?view=markup

(with caching)
http://perfectjpattern.svn.sourcefo...avioral/visitor/PrintVisitor.java?view=markup

I run the test case separately from command line like this:
$
mvn -DargLine="-XX:+PrintCompilation -verbose:gc -Xbatch -XX:CICompilerCount=1
-Xms256m -Xmx512m" -Dtest=TestPerformance test

I consistently get results like the following:

***************************************************
TestPerformance Classical took '859.0' millis
TestPerformance PerfectJPattern's without caching took '1735.0' millis
***************************************************

***************************************************
INFO TestPerformance Classical took '812.0' millis
INFO TestPerformance PerfectJPattern's with caching took '875.0' millis
***************************************************

The classical version does two separate dispatches same as in manual Visitor
implementations.
The perfectjpattern's version emulates double-dispatch on top of a Delegate.

if you want to run the test you can do so by checking out the code like
this:
$ svn checkout
https://perfectjpattern.svn.sourceforge.net/svnroot/perfectjpattern/trunk
perfectjpattern

Thanks in advance,
Best regards,
Giovanni

Mark Space · May 1, 2009

A few quick thoughts:

1. Are your JVM parameters redundant with -server? That flag might
remove the need for a "warm-up time" for all tests. -server might also
allow others to test and compare with your results more easily.

2. Please use System.nanoTime() instead of System.currentTimeMillis().
I think the former is "the standard" for benchmarking anyway.

3. Most visitor patterns have many types of visitors. I'd suggest
adding a few more (10 total?) to simulate a "real" use of visitor.

Good job on following up with this, btw.

Roedy Green · May 2, 2009

I was busy today doing some micro-benchmark performance testing to the
[revised and optimized] PerfectJPattern's Visitor implementation and wanted
to ask for advice if anything could be improved here.

see http://mindprod.com/jgloss/benchmark.html

if you are not familiar with System.nanoTime() see
http://mindprod.com/jgloss/time.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Species evolve exactly as if they were adapting as best they could to a changing world, and not at all as if they were moving toward a set goal."
~ George Gaylord Simpson

Arne Vajhøj · May 2, 2009

Mark said:
A few quick thoughts:

1. Are your JVM parameters redundant with -server? That flag might
remove the need for a "warm-up time" for all tests. -server might also
allow others to test and compare with your results more easily.

I don't think -server will remove the need for warm-up.

2. Please use System.nanoTime() instead of System.currentTimeMillis(). I
think the former is "the standard" for benchmarking anyway.

It does not matter. If milliseconds is not granular enough, then
the results are worthless on OS's like Windows and *nix.

Arne

Mark Space · May 2, 2009

Arne said:
It does not matter. If milliseconds is not granular enough, then
the results are worthless on OS's like Windows and *nix.

currentTimeMillis(), on Windows at least, could use a very inaccurate
timer that doesn't even compute milliseconds correctly. It's always off
by like 18 milliseconds one way or the other (I'd have to look up the
exact value). nanoTime() isn't accurate to nanoseconds, but it's
usually accurate to at least 1 millisecond.

Arne Vajhøj · May 2, 2009

Mark said:
currentTimeMillis(), on Windows at least, could use a very inaccurate
timer that doesn't even compute milliseconds correctly. It's always off
by like 18 milliseconds one way or the other (I'd have to look up the
exact value). nanoTime() isn't accurate to nanoseconds, but it's
usually accurate to at least 1 millisecond.

Sure.

But given the unpredictable scheduling of other processes, then
you can still not meaningful measure so small intervals.

Arne

Tom Anderson · May 2, 2009

Sure.

But given the unpredictable scheduling of other processes, then you can
still not meaningful measure so small intervals.

Sure you can, you just have to do a lot of runs. This is why when i do
benchmarks, i do 100 runs, and ignore the five slowest, on the grounds
that they probably took so long because either another process was
scheduled in the middle, or GC kicked in or something. For this to be
effective, you actually want to keep the individual runs quite short, to
minimise the chance of them being disrupted, and do a huge number of them,
and for this, a high-resolution timer is essential.

tom

Giovanni Azua · May 2, 2009

Hi Mark,

Thank you for your nice feedback!

Mark Space said:
1. Are your JVM parameters redundant with -server? That flag might remove
the need for a "warm-up time" for all tests. -server might also allow
others to test and compare with your results more easily.

I am still looking around but have not found yet what the -server actually
does ...

2. Please use System.nanoTime() instead of System.currentTimeMillis(). I
think the former is "the standard" for benchmarking anyway.

3. Most visitor patterns have many types of visitors. I'd suggest adding
a few more (10 total?) to simulate a "real" use of visitor.

Slightly adapted already the performance test for this but not sure if it is
useful to add many more Visitors ... do you mean like iterating many
Visitors in one step? I think it is more relevant in this micro-benchmarking
to see how it behaves in a single "typical" implementation using one
Visitor, or?

I get the following results now:

Not cached
***************************************************
classical: avg per call '22541' nanos
perfectjpattern (no cache): avg per call '34889' nanos
***************************************************
classical: total elapsed '676' millis
perfectjpattern (no cache): total elapsed '1047' millis
perfectjpattern (no cache) took 154.78% must be lower than 250% the classic
***************************************************

Cached
***************************************************
classical: avg per call '21197' nanos
perfectjpattern (cache): avg per call '22359' nanos
***************************************************
classical: total elapsed '636' millis
perfectjpattern (cache): total elapsed '671' millis
perfectjpattern (cache) took 105.48% must be lower than 115% the classic
***************************************************

A good idea (I think brought up by Tom) would be to measure each iteration
separately and then discard outliers by e.g. discarding those that exceed
the abs diff between the mean and the stddev.

Best regards,
Giovanni

Lew · May 2, 2009

Tom said:
Sure you can, you just have to do a lot of runs. This is why when i do
benchmarks, i do 100 runs, and ignore the five slowest, on the grounds
that they probably took so long because either another process was
scheduled in the middle, or GC kicked in or something. For this to be
effective, you actually want to keep the individual runs quite short, to
minimise the chance of them being disrupted, and do a huge number of
them, and for this, a high-resolution timer is essential.

The trouble with short runs is that you defeat HotSpot.

Lew · May 2, 2009

Giovanni said:
Hi Mark,

Thank you for your nice feedback!

I am still looking around but have not found yet what the -server actually
does ...

Basically it doubles the speed of Java programs compared to the -client
option, as measured on enterprise-class hardware (e.g., Sun servers) in
real-world applications.

The "-server" option turns on every HotSpot optimization in the JVM - loop
unrolling, lock escape analysis, compilation and decompilation of key code
blocks between bytecode and native code with On-Stack Replacement (OSR), dead
code elimination, and the like. Unlike static optimizers, HotSpot can account
for transitory factors, for example, that a variable isn't changing at this time.

A good idea (I think brought up by Tom) would be to measure each iteration
separately and then discard outliers by e.g. [sic] discarding those that exceed
the abs [sic] diff [sic] between the mean and the stddev [sic].

That technique doesn't seem statistically valid.

In the first place, you'd have to use the outliers to calculate the mean and
"stddev".

I've seen techniques before that discard the endmost data points, but never
ones that required statistical analysis to decide what to include or reject
for the statistical analysis.

Arved Sandstrom · May 2, 2009

Lew said:
Giovanni Azua wrote:

[ SNIP ]

A good idea (I think brought up by Tom) would be to measure each
iteration separately and then discard outliers by e.g. [sic]
discarding those that exceed the abs [sic] diff [sic] between the mean
and the stddev [sic].

Click to expand...

That technique doesn't seem statistically valid.

In the first place, you'd have to use the outliers to calculate the mean
and "stddev".

I've seen techniques before that discard the endmost data points, but
never ones that required statistical analysis to decide what to include
or reject for the statistical analysis.

Doing this is acceptable if it's a step in identifying outliers for
examination, rather than being an automatic elimination step. What
Giovanni suggested might not be the statistical procedure of choice
however; something like Grubb's test would be common enough if your
clean data is normally distributed.

What we're really trying to do (data QA is a a very well-established
discipline in geophysics & nuclear physics, for example) is _detect_
outliers to see if those data points represent _contamination_. About 15
years back I helped on the programming side with the production of
climatological atlases for bodies of water off the eastern coast of
Canada. One of the first data quality control steps was actually to
apply a bandpass filter - something along the lines of, water
temperature in February in this region is simply not going to be less
than T1 nor higher than T2 (*). There may actually be several ranges,
applied iteratively.

Point being that data QA/QC attempts to determine why a data point
should be rejected. You don't just do it because it's 5 SD's out; you
try to find out if it's bad data. In his case that we're examining I'd
sure like to see a reason for why any outliers should be identified as
contamination.

AHS

* Something to be careful with. NASA computers initially failed to
detect the Antarctic ozone hole because they deleted ozone concentration
data below a certain preset.

John B. Matthews · May 2, 2009

Arved Sandstrom said:
Lew said:

Giovanni Azua wrote:

Click to expand...

[ SNIP ]

A good idea (I think brought up by Tom) would be to measure each
iteration separately and then discard outliers by e.g. [sic]
discarding those that exceed the abs [sic] diff [sic] between the
mean and the stddev [sic].

Click to expand...

That technique doesn't seem statistically valid.

In the first place, you'd have to use the outliers to calculate the
mean and "stddev".

I've seen techniques before that discard the endmost data points,
but never ones that required statistical analysis to decide what to
include or reject for the statistical analysis.

Click to expand...

Doing this is acceptable if it's a step in identifying outliers for
examination, rather than being an automatic elimination step. What
Giovanni suggested might not be the statistical procedure of choice
however; something like Grubb's test would be common enough if your
clean data is normally distributed.

Indeed, Grubb's test uses an iterative approach to examine one outlier
at a time:

<http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm>

But I've never looked closely at whether run-times "can be reasonably
approximated by a normal distribution" on a quiet system.

[...]

Joshua Cranmer · May 2, 2009

Giovanni said:
A good idea (I think brought up by Tom) would be to measure each iteration
separately and then discard outliers by e.g. discarding those that exceed
the abs diff between the mean and the stddev.

In the normal distribution, approximately 32% of data points will be at
least one standard deviation away from the mean.

Since you're trying to figure the comparative better of two
methodologies, it seems that it would be far better to do a statistical
hypothesis test:
<http://en.wikipedia.org/wiki/Statistical_hypothesis_testing>.

I don't know how accurate the assumption that the variances are the same
would be; if they are the same, you can simply use the two-sampled
pooled t-test, otherwise, you'd need the two-sampled unpooled t-test.

I wonder why I didn't want to take statistics earlier...

Arne Vajhøj · May 3, 2009

Tom said:
Sure you can, you just have to do a lot of runs. This is why when i do
benchmarks, i do 100 runs, and ignore the five slowest, on the grounds
that they probably took so long because either another process was
scheduled in the middle, or GC kicked in or something. For this to be
effective, you actually want to keep the individual runs quite short, to
minimise the chance of them being disrupted, and do a huge number of
them, and for this, a high-resolution timer is essential.

1) That is actually very close to what I suggest. You are just making
it rather cumbersome for yourself by having many short runs and
add instead of a few long runs.

2) And you don't want to ignore GC time. Because GC will also hit the
real app.

Arne

Tom Anderson · May 3, 2009

The trouble with short runs is that you defeat HotSpot.

I should explain, i mean making each individual benchmarked unit small,
but still doing lots of them, and plenty of warmup, which should give
hotspot an adequate chance to do its thing. For instance, doing 1000 runs
over 10 000 elements rather than 10 over 1 000 000, with half that again
for warmup - same total time, but divided up more finely. Indeed, i
believe that this more likely to attract hotspot's attention than the
10-run version.

Of course, varying the problem size will vary the cache dynamics, and may
take things out of the zone of greatest relevance to the real problem.
That's something to keep in mind.

tom

Tom Anderson · May 3, 2009

Lew said:
Lew said:

Giovanni Azua wrote: [ SNIP ]

A good idea (I think brought up by Tom) would be to measure each iteration
separately and then discard outliers by e.g. [sic] discarding those that
exceed the abs [sic] diff [sic] between the mean and the stddev [sic].

Click to expand...

That technique doesn't seem statistically valid.

In the first place, you'd have to use the outliers to calculate the mean
and "stddev".

I've seen techniques before that discard the endmost data points, but never
ones that required statistical analysis to decide what to include or reject
for the statistical analysis.

Click to expand...

Doing this is acceptable if it's a step in identifying outliers for
examination, rather than being an automatic elimination step. What
Giovanni suggested might not be the statistical procedure of choice
however; something like Grubb's test would be common enough if your
clean data is normally distributed.

I would have said Chauvenet's criterion rather than Grubb's test - but
only because i'm more familiar with the former! Grubb's test looks more
rigorous to me.

A less aggressive alternative would just be to describe the data by a
median and an interquartile range, thus effectively ignoring all big or
small values. You're not claiming they're 'wrong' in any sense, just not
focusing on them.

What we're really trying to do (data QA is a a very well-established
discipline in geophysics & nuclear physics, for example) is _detect_ outliers
to see if those data points represent _contamination_. About 15 years back I
helped on the programming side with the production of climatological atlases
for bodies of water off the eastern coast of Canada. One of the first data
quality control steps was actually to apply a bandpass filter - something
along the lines of, water temperature in February in this region is simply
not going to be less than T1 nor higher than T2 (*). There may actually be
several ranges, applied iteratively.

Point being that data QA/QC attempts to determine why a data point should be
rejected. You don't just do it because it's 5 SD's out; you try to find out
if it's bad data. In his case that we're examining I'd sure like to see a
reason for why any outliers should be identified as contamination.

In this case, though, i can't see any way to do that. If a run took 150 ms
instead of 100, all you know is that it took 50 ms longer. There's no way
to retrospectively ask 'did GC happen?', 'did the OS preempt us to do some
housekeeping?' etc.

tom

Tom Anderson · May 3, 2009

But I've never looked closely at whether run-times "can be reasonably
approximated by a normal distribution" on a quiet system.

Off the top of my head, i'd guess not. A normal distribution is what you
get when you take some mean and add on lots and lots of small *but
symmetric* random variables of any kind. Like flipping a load of coins and
adding or subtracting one on according to the result. But in timing some
code, there are only things that can be added to the baseline, not taken
away - preemption, GC, page-in, cache miss, etc. Thus, i'd expect the
distribution to be highly asymmetric, with a load of points at some
minimum, then a decreasing number of points at some higher value. A bit
like a power-law curve, but with a minimum. Maybe like a one-tailed
normal?

IANAS, though, so this is probably tosh. It's probably a Poisson
distribution or something.

tom

Tom Anderson · May 3, 2009

1) That is actually very close to what I suggest. You are just making
it rather cumbersome for yourself by having many short runs and
add instead of a few long runs.

No, the whole point is that more runs gives you more numbers to do
statistics on, which gives you a better chance to detect and discard
outliers.

2) And you don't want to ignore GC time. Because GC will also hit the
real app.

It will. But if we're talking about microbenchmarking, i don't think you
necessarily want to measure it. If the problem is one which itself
involves allocating memory, then yes, you do, but if not, i think you
don't. If you're doing macrobenchmarking, then yes, you do want to measure
it, of course.

tom

Arved Sandstrom · May 3, 2009

Tom said:
On Sat, 2 May 2009, Arved Sandstrom wrote:

[ SNIP ]

In this case, though, i can't see any way to do that. If a run took 150
ms instead of 100, all you know is that it took 50 ms longer. There's no
way to retrospectively ask 'did GC happen?', 'did the OS preempt us to
do some housekeeping?' etc.

tom

My only suggestion here is the one I always make - to understand your
data, graph it and look at it.

AHS

OpenJPA persist(...) doesn't assign the Id to the Entity instance	1	Feb 19, 2009
How to use BaseNCodec	0	Aug 29, 2012
(Encryption Package) error: cannot find symbol symbol: class BaseNCode	6	Aug 29, 2012
OutOfMemoryError in java app	0	Sep 8, 2006
navigator.userAgent required to set location.href in Safari?	2	Aug 20, 2007
python-dev Summary for 2004-10-01 through 2004-10-15	0	Nov 18, 2004
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003
python-dev Summary for 2003-08-01 through 2003-08-15	5	Aug 18, 2003

micro-benchmarking

Giovanni Azua

Mark Space

Roedy Green

Arne Vajhøj

Mark Space

Arne Vajhøj

Tom Anderson

Giovanni Azua

Lew

Lew

Arved Sandstrom

John B. Matthews

Joshua Cranmer

Arne Vajhøj

Tom Anderson

Tom Anderson

Tom Anderson

Tom Anderson

Arved Sandstrom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads