Symbolic benchmark

M

Mark Thornton

Jon said:
Stop-the-world is exactly the kind of interthread issue that I was referring
to.

If you take this symbolic simplifier, for example, and make it run two
simplifications at a time on my dual core then its performance will not
improve by a factor of two if the GC stops all threads while it runs.

In contrast, the OCaml is not only more than twice as fast to start with but
it will scale linearly with the number of cores in this case because the
GCs do not interfere.

Note that Sun's Java 6 also comes with a parallel garbage collector. I
didn't test that because my home machine is an old single core
processor. I notice from other comments that the performance of OCaml
relative to Java seems to depend quite a bit on the specific
environment. Some people are reporting Java as faster.
Can you send me your code and I'll update it.
Done.

Mark Thornton
 
R

Roger Lindsjö

Jon said:
Your statement is contrary to the evidence: Java does not allocate fast.

But on my machine it seems quite fast. It's a bit hard to measure the
allocations apart from the rest of the code, but at least Java runs on
par with Ocaml.

With the 0 and 1 cache and using longs I get 3 allocations for
simplifying the expression. This gives about 46.4 clock cycles per
allocation (which of course also includes the rest of the logic). I
gouss I could try to write a similar program with no allocations, but
who knows what the Hotspot would optimize that to?

I'm not sure if tuning the GC would improve the performance, the objects
allocated are quite small and should not migrate out of the nursery. For
now I don't think that's an exercise worth doing.

By the way, the raytrace you have been referring to, is it the one on
http://www.ffconsultancy.com/languages/ray_tracer/ ?

//Roger Lindsjö
 
J

Jon Harrop

Lew said:
Message passing as a primitive coordination and concurrency mechanism is
extremely powerful, especially in an O.S. like QNX that reduces
context-switch
times to ridiculously small values. I first encountered it on a
minicomputer
in 1981. It makes for very straightforward coding, too.

Exactly, yes. However, OCaml does not do such optimizations (AFAIK). To be
fair, I haven't tried it though.
 
R

Roedy Green

Your statement is contrary to the evidence: Java does not allocate fast.

In a non-GC system, when you allocate an object the allocator looks
around for a suitable sized hole for it. This takes longer than an
addition to allocate the object from a single giant contiguous pool.
The allocation is thus faster.

On free, on a non-CG system, the allocator puts the space back on the
free space chain, and perhaps consolidates it with free space before
and after. In Java, there is no explicit free, but when GC times
comes, the active objects are copied to a new bin. There is ZERO
processing on the dead objects. see
http://mindprod.com/jgloss/garbagecollection.html

The time to complete the GC is proportional to the number of LIVE
objects.

The faster you allocate objects, the more frequently you need GC
passes.

So traditional allocation would be optimal in a system where you
allocated a large number of objects to start and keep them all.

Java allocation would be optimal when you had almost no long term
objects, but many short lived objects, particularly if they are of
assorted sizes.

GC is a whole field of study in itself. Generational collectors
collect objects of different ages in different bins, and GC them
separately. Concurrent collectors can grind away in the background to
avoid pausing the app.
 
J

Jon Harrop

Roger said:
But on my machine it seems quite fast. It's a bit hard to measure the
allocations apart from the rest of the code, but at least Java runs on
par with Ocaml.

In absolute terms, the fastest program is still the OCaml running on my
machine (0.7s) even though my machine is significantly slower than yours.
Also, we've seen benchmark results from several different machines and the
one where Java is slightly faster than OCaml is an outlier. In most cases,
Java is ~2-3x slower, in at least one case it was 5x slower.

Also, you haven't normalized by the load average, which is 1.0 for OCaml but
~1.4 for Java here.

Incidentally, I suspect memory bandwidth is the problem on most machines. I
think your machine where Java does comparatively well has extremely high
memory bandwidth. My reason for speculating this is that the Java program
is using 100x as much memory as the OCaml.
With the 0 and 1 cache and using longs I get 3 allocations for
simplifying the expression. This gives about 46.4 clock cycles per
allocation (which of course also includes the rest of the logic). I
gouss I could try to write a similar program with no allocations, but
who knows what the Hotspot would optimize that to?

We should try it rather than speculate.

Also, the improvement in performance noted by Steve that was attributed to
Hotspot seems to be nothing more than GC interactions that periodically
slow the Java code down. I did 1,000 runs of the benchmark and, when the GC
kicks in, the Java code slows down by almost 2x and the load average
increases to 2.0, i.e. Java is maxing out both of my CPUs and is still
several times slower here.
I'm not sure if tuning the GC would improve the performance, the objects
allocated are quite small and should not migrate out of the nursery. For
now I don't think that's an exercise worth doing.

Exactly, yes. There is no point in avoiding the allocation in OCaml code
because it is extremely fast. In contrast, you can often get huge
performance improvements in Java my manually rewriting code to amortize
allocations.
By the way, the raytrace you have been referring to, is it the one on
http://www.ffconsultancy.com/languages/ray_tracer/ ?

Yes. Note how the fastest Java implementation achieves a huge performance
boost by manually working around the allocation of temporaries, which is
very slow in Java.

I'd like to know if the fastest and second fastest implementations of the
Java ray tracer also show a smaller performance gap on your machine (the
one where Java is slightly faster than OCaml).
 
M

Mark Thornton

Roedy said:
In a non-GC system, when you allocate an object the allocator looks
around for a suitable sized hole for it. This takes longer than an
addition to allocate the object from a single giant contiguous pool.
The allocation is thus faster.

On free, on a non-CG system, the allocator puts the space back on the
free space chain, and perhaps consolidates it with free space before
and after. In Java, there is no explicit free, but when GC times
comes, the active objects are copied to a new bin. There is ZERO
processing on the dead objects. see
http://mindprod.com/jgloss/garbagecollection.html

The time to complete the GC is proportional to the number of LIVE
objects.

The faster you allocate objects, the more frequently you need GC
passes.

However if you have almost no live objects, the time for a GC pass
doesn't reach zero. We can reduce the number of GC passes by throwing
memory at the problem --- increase the size of the nursery area.

Mark Thornton
 
R

Roger Lindsjö

Jon said:
In absolute terms, the fastest program is still the OCaml running on my
machine (0.7s) even though my machine is significantly slower than yours.
Also, we've seen benchmark results from several different machines and the
one where Java is slightly faster than OCaml is an outlier. In most cases,
Java is ~2-3x slower, in at least one case it was 5x slower.

Also, you haven't normalized by the load average, which is 1.0 for OCaml but
~1.4 for Java here.

I'm not sure what you mean here, when restricting to one core, both
programs used about 100% and I don't think any other processes(or at
least very few) were scheduled on that core.
Incidentally, I suspect memory bandwidth is the problem on most machines. I
think your machine where Java does comparatively well has extremely high
memory bandwidth. My reason for speculating this is that the Java program
is using 100x as much memory as the OCaml.

The memory bandwidth might be one reason but memory usage was not a
constraint or a measurement point earlier in the discussion. And 100
times the memory? I'm running a longer test with heapsize of 1.5MB, the
JVM as runtime system take a bit of memory. The resident size of the
Java process is 11MB while the size of the Ocaml is 1MB. I'm measuring
the memory use of the Java process, and during the first 10 minutes no
data has survived to the old heap, a few object temporarily survived a
short while in eden while almost everything seems to die within the
nurseries (this is of course expected behavior).
We should try it rather than speculate.

I will try to make a cache for all "Val", since we have so few different
values a simple switch should solve that, and then we might be able to
measure how much time a "Val" would take to create/destroy. The problem
is that the we don't stress the nurseries with "Val" objects, so they
might behave differently.
Also, the improvement in performance noted by Steve that was attributed to
Hotspot seems to be nothing more than GC interactions that periodically
slow the Java code down. I did 1,000 runs of the benchmark and, when the GC
kicks in, the Java code slows down by almost 2x and the load average
increases to 2.0, i.e. Java is maxing out both of my CPUs and is still
several times slower here.

For me GC does not "kick in" at all, I have no full GCs at all (so far).
I have 800 minor GCs a second, but you would have to fine tune the
measuring to be able to see the slowdown while they run.
Exactly, yes. There is no point in avoiding the allocation in OCaml code
because it is extremely fast. In contrast, you can often get huge
performance improvements in Java my manually rewriting code to amortize
allocations.

My experience is the opposite, for small, short lived objects it has
been better to create-use-forget than use smart code that reuses
objects. However, once the objects grow larger than a few KB, then they
are directly promoted to the old heap, and for those objects it might be
better to handle them differently. But this could vary between
applications. I normally work with web service application with a lot of
xml processing and generation.
Yes. Note how the fastest Java implementation achieves a huge performance
boost by manually working around the allocation of temporaries, which is
very slow in Java.

I'd like to know if the fastest and second fastest implementations of the
Java ray tracer also show a smaller performance gap on your machine (the
one where Java is slightly faster than OCaml).

This is my output:
Java 1
real 0m12.624s
user 0m12.264s
sys 0m0.812s

Ocaml 1
real 1m13.171s
user 1m13.112s
sys 0m0.045s

Java 2
real 0m8.911s
user 0m9.186s
sys 0m0.177s

Ocaml 2
real 0m51.302s
user 0m51.261s
sys 0m0.032s

Java 3
real 0m8.012s
user 0m8.304s
sys 0m0.163s

Ocaml 3
real 0m8.200s
user 0m8.181s
sys 0m0.020s

Java 4
real 0m7.962s
user 0m8.255s
sys 0m0.150s

Ocaml 4
real 0m7.402s
user 0m7.378s
sys 0m0.019s

Java 5
real 0m5.557s
user 0m5.837s
sys 0m0.113s

Ocaml 5
real 0m7.350s
user 0m7.342s
sys 0m0.008s

As far as I can see, the slowest Java code takes about 2.5 times longer
to complete than the fastest code.

Oh and why isn't the Ocaml programs saving the image to a file but
instead writes to standard out?

//Roger Lindsjö
 
R

Roger Lindsjö

Sorry couldn't help my self, here is also the timing form the c++ programs.

C++ 1
real 0m6.557s
user 0m6.542s
sys 0m0.013s

C++ 2
real 0m5.040s
user 0m5.028s
sys 0m0.010s

C++ 3
real 0m4.299s
user 0m4.287s
sys 0m0.007s

C++ 4
real 0m4.074s
user 0m4.068s
sys 0m0.006s

C++ 5
real 0m4.151s
user 0m4.139s
sys 0m0.011s

//Roger Lindsjö
 
J

Jon Harrop

Roger said:
I'm not sure what you mean here, when restricting to one core, both
programs used about 100% and I don't think any other processes(or at
least very few) were scheduled on that core.

I'm not sure what you mean by "restricting to one core". The Java is using
both of my cores: one for the main program and the other for the GC.
The memory bandwidth might be one reason but memory usage was not a
constraint or a measurement point earlier in the discussion. And 100
times the memory? I'm running a longer test with heapsize of 1.5MB, the
JVM as runtime system take a bit of memory. The resident size of the
Java process is 11MB while the size of the Ocaml is 1MB.

That's odd. Resident size here is 180Mb for the Java and 1.7Mb for the
OCaml.
My experience is the opposite, for small, short lived objects it has
been better to create-use-forget than use smart code that reuses
objects. However, once the objects grow larger than a few KB, then they
are directly promoted to the old heap, and for those objects it might be
better to handle them differently. But this could vary between
applications. I normally work with web service application with a lot of
xml processing and generation.

Interesting. Assuming your XML processing functions don't destroy their
input I'd have expected allocation to be something you'd optimize away as
well.
This is my output:
Java 1
real 0m12.624s
user 0m12.264s
sys 0m0.812s

Ocaml 1
real 1m13.171s
user 1m13.112s
sys 0m0.045s

Java 2
real 0m8.911s
user 0m9.186s
sys 0m0.177s

Ocaml 2
real 0m51.302s
user 0m51.261s
sys 0m0.032s

Java 3
real 0m8.012s
user 0m8.304s
sys 0m0.163s

Ocaml 3
real 0m8.200s
user 0m8.181s
sys 0m0.020s

Java 4
real 0m7.962s
user 0m8.255s
sys 0m0.150s

Ocaml 4
real 0m7.402s
user 0m7.378s
sys 0m0.019s

Java 5
real 0m5.557s
user 0m5.837s
sys 0m0.113s

Ocaml 5
real 0m7.350s
user 0m7.342s
sys 0m0.008s

As far as I can see, the slowest Java code takes about 2.5 times longer
to complete than the fastest code.

Interesting. I get:

Java:
18.3124079704
13.3784809113
12.2087609768
12.3188209534
6.49561595917

OCaml:
20.0419669151
11.9526901245
5.76670503616
5.23136401176
3.96777296066

So you're again seeing the exact opposite of what I'd predicted: Java gets
significantly faster but OCaml doesn't on your machine for the last two
implementations.

From the initial OCaml timings I assume you're using 32-bit. I'd be
interested to see 64-bit results as well: they tend to be much faster for
this kind of work.
Oh and why isn't the Ocaml programs saving the image to a file but
instead writes to standard out?

They should both be writing to stdout because the original C++ did. Looks
like a bug in the Java implementations...
 
J

Jon Harrop

Roger said:
Sorry couldn't help my self, here is also the timing form the c++
programs.

What version of what compiler?
C++ 1
real 0m6.557s
user 0m6.542s
sys 0m0.013s

C++ 2
real 0m5.040s
user 0m5.028s
sys 0m0.010s

C++ 3
real 0m4.299s
user 0m4.287s
sys 0m0.007s

C++ 4
real 0m4.074s
user 0m4.068s
sys 0m0.006s

C++ 5
real 0m4.151s
user 0m4.139s
sys 0m0.011s

Very interesting. So the only language that runs fast on both of our
computers is C++.

For C++, I get:

7.0633699894
5.21576809883
4.62505602837
4.38395786285
4.0194811821

I also find it amusing that my AMD hardware is faster than your Intel
hardware. ;-)
 
J

Jon Harrop

Roedy said:
In a non-GC system, when you allocate an object the allocator looks
around for a suitable sized hole for it. This takes longer than an
addition to allocate the object from a single giant contiguous pool.
The allocation is thus faster.

On free, on a non-CG system, the allocator puts the space back on the
free space chain, and perhaps consolidates it with free space before
and after. In Java, there is no explicit free, but when GC times
comes, the active objects are copied to a new bin. There is ZERO
processing on the dead objects. see
http://mindprod.com/jgloss/garbagecollection.html

The time to complete the GC is proportional to the number of LIVE
objects.

The faster you allocate objects, the more frequently you need GC
passes.

So traditional allocation would be optimal in a system where you
allocated a large number of objects to start and keep them all.

Java allocation would be optimal when you had almost no long term
objects, but many short lived objects, particularly if they are of
assorted sizes.

GC is a whole field of study in itself. Generational collectors
collect objects of different ages in different bins, and GC them
separately. Concurrent collectors can grind away in the background to
avoid pausing the app.

Sure. None of that compares Java to other garbage collected languages though
(of which there are many) and, consequently, does not support your claim
that Java is comparatively fast at allocating.
 
R

Roger Lindsjö

Jon said:
What version of what compiler?

Yes, I should of course have posted that.
g++ (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)
Very interesting. So the only language that runs fast on both of our
computers is C++.

For C++, I get:

7.0633699894
5.21576809883
4.62505602837
4.38395786285
4.0194811821

I also find it amusing that my AMD hardware is faster than your Intel
hardware. ;-)

Well, at least at one of the test, and comparing our fastes runs, yours
is a whopping 1.3% faster. Averaging out the different runs mine is
almost 5% faster ;-).

Of course this is a silly comparison, I think they run very comparable
in speed for the C++ code. Being so similar for this code it is quite
interesting that we get so different results for Java vs Ocaml, but
memory bandwidth could be a reason.

//Roger Lindsjö
 
R

Roger Lindsjö

Jon said:
I'm not sure what you mean by "restricting to one core". The Java is using
both of my cores: one for the main program and the other for the GC.

That's why I limited it to only use one core, using taskset.
That's odd. Resident size here is 180Mb for the Java and 1.7Mb for the
OCaml.
I also limited the heap size since Java will grab a beap proportional to
system memory. I ran with a heap of 1.5MB.
Interesting. Assuming your XML processing functions don't destroy their
input I'd have expected allocation to be something you'd optimize away as
well.

The application gets a lot of small inputs from different sources, they
are parsed to XML, XSL transformations are done and then some other
stuff such as user sessions replicated among all servers etc. We used to
have implementations minimizing allocations, and for Java 1.3 this was a
performance boost. However, after migrating to Java 1.5, then that was
no longer the case, so then we simplified the code.
Interesting. I get:

Java:
18.3124079704
13.3784809113
12.2087609768
12.3188209534
6.49561595917

OCaml:
20.0419669151
11.9526901245
5.76670503616
5.23136401176
3.96777296066

So you're again seeing the exact opposite of what I'd predicted: Java gets
significantly faster but OCaml doesn't on your machine for the last two
implementations.

From the initial OCaml timings I assume you're using 32-bit. I'd be
interested to see 64-bit results as well: they tend to be much faster for
this kind of work.

I'm not about to install a 64bit version of GNU/Linux, perhaps someone
else with similar hardware already has? But why is 64bit faster for this
work?

For the first Ocaml example we get huge differences in speed, my machine
is taking 3.5 times longer than yours while for the rest our machines
are a lot closer.
They should both be writing to stdout because the original C++ did. Looks
like a bug in the Java implementations...

Ok, I thought it was because you'd need 3rd party modules ;-). That was
before I tried the C++ version and I noticed the same behavior. A
graphics file printed to the console is always a nice treat :)

I could change the Java implementations, but I doubt that would make a
speed difference. Would you like new versions that behave as the others?

//Roger Lindsjö
 
M

Mark Thornton

Roger said:
The application gets a lot of small inputs from different sources, they
are parsed to XML, XSL transformations are done and then some other
stuff such as user sessions replicated among all servers etc. We used to
have implementations minimizing allocations, and for Java 1.3 this was a
performance boost. However, after migrating to Java 1.5, then that was
no longer the case, so then we simplified the code.
The amount of 'effort' that can usefully be expended on reducing
allocations has gone down. However I think cases equivalent to basic
complex arithmetic are still noticeably faster if the objects are
eliminated.

Mark Thornton
 
J

Jon Harrop

Roger said:
Yes, I should of course have posted that.
g++ (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)

My results are for g++ 4.2.3 and I noticed a significant slow-down moving to
that from 4.1.2. C++ used to be faster than OCaml... :)
Of course this is a silly comparison, I think they run very comparable
in speed for the C++ code. Being so similar for this code it is quite
interesting that we get so different results for Java vs Ocaml, but
memory bandwidth could be a reason.

Possibly. The memory bandwidth on this machine is comparatively low but I
don't notice it with OCaml. I wonder if this makes OCaml's performance more
future proof as the memory gap widens...
 
J

Jon Harrop

Roger said:
That's why I limited it to only use one core, using taskset.

Aha. :)
I also limited the heap size since Java will grab a beap proportional to
system memory. I ran with a heap of 1.5MB.

Ah, ok. I didn't realise you were tinkering with the settings. I was just
running with default settings. Interestingly, I can improve the performance
of the OCaml by reducing its minor heap size. It then takes 0.67s.
The application gets a lot of small inputs from different sources, they
are parsed to XML, XSL transformations are done and then some other
stuff such as user sessions replicated among all servers etc. We used to
have implementations minimizing allocations, and for Java 1.3 this was a
performance boost. However, after migrating to Java 1.5, then that was
no longer the case, so then we simplified the code.

Very interesting. In that case the JVM might be a much better candidate for
implementing functional programming languages.
I'm not about to install a 64bit version of GNU/Linux, perhaps someone
else with similar hardware already has? But why is 64bit faster for this
work?

Generating efficient assembler for floating point intensive code is much
easier with the x86-64 instruction set than the old x86 instruction set.
Consequently, many numerical programs in different languages get ~2x faster
moving to 64-bit.
For the first Ocaml example we get huge differences in speed, my machine
is taking 3.5 times longer than yours while for the rest our machines
are a lot closer.

Yes. The creators of OCaml put a lot more effort into optimizing for the
more modern architecture.
Ok, I thought it was because you'd need 3rd party modules ;-). That was
before I tried the C++ version and I noticed the same behavior. A
graphics file printed to the console is always a nice treat :)

The programs should probably all check that they're not printing to a
console. :)
I could change the Java implementations, but I doubt that would make a
speed difference. Would you like new versions that behave as the others?

Yes please!
 
L

Lew

Mark said:
The amount of 'effort' that can usefully be expended on reducing
allocations has gone down. However I think cases equivalent to basic
complex arithmetic are still noticeably faster if the objects are
eliminated.

In the special case of relatively small primitive structures such as

public class Complex
{
private final double re, im;
public Complex( double r, double i )
{ re = r; im = i; }
public final double re() { return re; }
public final double im() { return im; }
}

you will likely find that Hotspot will eliminate the objects for you.

The scenario in complex arithmetic is lots of intermediate values - these have
very short lifespans and live completely within the stack context. Object
creation and accessor calls almost certainly will inline and disappear. he
values could live in registers, thus avoiding even memory overhead.

I chose an example of an immutable object, but simple mutable structures also
optimize.
 
M

Mark Thornton

Lew said:
In the special case of relatively small primitive structures such as

public class Complex
{
private final double re, im;
public Complex( double r, double i )
{ re = r; im = i; }
public final double re() { return re; }
public final double im() { return im; }
}

you will likely find that Hotspot will eliminate the objects for you.

As far as I know that optimisation hasn't been implemented in HotSpot
yet. It is planned.
The scenario in complex arithmetic is lots of intermediate values -
these have very short lifespans and live completely within the stack
context. Object creation and accessor calls almost certainly will
inline and disappear.
The accessor calls are inlined, but the object creation remains.
I chose an example of an immutable object, but simple mutable structures
also optimize.
immutable objects where == is a value comparison can be placed directly
in arrays instead of arrays of references. This is important in many
applications of complex values.

Mark Thornton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,192
Latest member
KalaReid2

Latest Threads

Top