Symbolic benchmark

L

ldv

No threading other than what the JVM does internally for housekeeping
such as GC. I even ran restricted to one core (using taskset) since I
was told that Ocaml was single threaded.

The code is 3 posts up, the one I was relying to.http://groups.google.com/group/comp.lang.java.programmer/msg/b5185f59...

//Roger Lindsjö

On my 2GHz Celeron the executable produced by the latest Excelsior JET
6.0 is much faster than HotSpot Client VM 1.6.0_3, but slower than the
Server VM, which is almost twice as fast as the Client VM. Here go the
best results selected from 10 consecutive runs:

HSS : 2.594s
JET : 3.25s
HSC : 5.172s

On a colleague's Core 2 Duo, the situation is more or less the same:

HSS : 1.206s
JET : 1.687s
HSC : 2.118s

Are you guys all using the Server VM in your comparisons?

LDV
 
J

Jon Harrop

Roger said:
I'm using Server VM by default (or actually, the JVM does), the output
from just java contains:
The default VM is server,
because you are running on a server-class machine.

Me too.
 
R

Roger Lindsjö

On my 2GHz Celeron the executable produced by the latest Excelsior JET
6.0 is much faster than HotSpot Client VM 1.6.0_3, but slower than the
Server VM, which is almost twice as fast as the Client VM. Here go the
best results selected from 10 consecutive runs:

HSS : 2.594s
JET : 3.25s
HSC : 5.172s

On a colleague's Core 2 Duo, the situation is more or less the same:

HSS : 1.206s
JET : 1.687s
HSC : 2.118s

Are you guys all using the Server VM in your comparisons?

I'm using Server VM by default (or actually, the JVM does), the output
from just java contains:
The default VM is server,
because you are running on a server-class machine.


//Roger Lindsjö
 
R

Roger Lindsjö

Jon said:
I think there are now two problems:

1. We're measuring bigint arithmetic but we don't really care about it.

Well, actually I used BigDecimal, but that made no difference for the
test. I also made a version using long instead of BigDecimal. With longs
10000000 iterations took about .54 seconds, so about twice as fast. I'm
not sure how to correctly rewrite the Ocaml program to use the native
int. I'll post my program rewritten for long instead, so if you also
post the Ocaml program, then we can both compare relative performance.
2. The input should be such that the evaluation is irreducible. In
particular, the input must not be hardcoded.

We can address these problems quite easily by using machine-precision ints
(whatever they are, we'll just make sure the problem doesn't overflow them)
and doing something juicier with symbolic computation like a big derivative
or multiplying out polynomials.

I think the current program works, at least I think that my solution
will not allow the HotSpot VM to optimize too much.

//Roger Lindsjö

SimplifyLong.java
public class SimplifyLong {
public static void main(String[] args) {
int lim = Integer.parseInt(args[0]);
int nbench = Integer.parseInt(args[1]);

Expr expr =
mul(var("x"),
add(
add(mul(rat(12),rat(0)), add(rat(23),rat(8))),
var("y")
)
);

System.out.println(expr);
Expr reduced = null;
for (int batch = 0; batch < nbench; batch++) {
long start = System.currentTimeMillis();
System.err.println("Bench " + batch);
for (int i = 0; i < lim; i++) {
reduced = expr.simplify();
}
System.err.println("Took " +
(System.currentTimeMillis() - start) / 1000.0 + " seconds");
}
System.out.println(reduced);
}

private static Expr mul(Expr a, Expr b) { return new Mul(a,b); }
private static Expr add(Expr a, Expr b) { return new Add(a,b); }
private static Expr rat(long a) { return Val.create(a); }
private static Expr var(String s) { return new Sym(s); }

public static abstract class Expr {
public abstract Expr simplify();
}

public static final class Add extends Expr {
private final Expr l, r;

public Add(Expr l, Expr r) {
this.l = l;
this.r = r;
}

public Expr simplify() {
return add(l.simplify(), r.simplify());
}

private Expr add(Expr l, Expr r) {
if (Val.ZERO == l) {
return r;
} else if (Val.ZERO == r) {
return l;
} else if (l instanceof Val && r instanceof Val) {
return add((Val) l, (Val) r) ;
} else if (r instanceof Add) {
return add(l, (Add) r);
} else {
return new Add(l, r);
}
}

private Expr add(Val l, Val r) {
return Val.create(l.getValue() + r.getValue());
}

private Expr add(Expr l, Add r) {
return new Add(new Add(l, r.l), r.r).simplify();
}

public String toString() {
return "[+ " + l + " " + r +"]";
}
}

public static class Mul extends Expr {
private final Expr l, r;
public Mul(Expr l, Expr r) {
this.l = l;
this.r = r;
}

public Expr simplify() {
return mul(l.simplify(), r.simplify());
}

private Expr mul(Expr l, Expr r) {
if (Val.ZERO == l || Val.ZERO == r) {
return Val.ZERO;
} else if (Val.ONE == l) {
return r;
} else if (Val.ONE == r) {
return l;
} else if (l instanceof Val && r instanceof Val) {
return mul((Val) l, (Val) r) ;
} else if (r instanceof Mul) {
return mul(l, (Mul) r);
} else {
return new Mul(l, r);
}
}

private Expr mul(Val l, Val r) {
return Val.create(l.getValue() * r.getValue());
}

private Expr mul(Expr l, Mul r) {
return new Mul(new Mul(l, r.l), r.r).simplify();
}

public String toString() {
return "[* " + l + " " + r +"]";
}
}

public static class Sym extends Expr {
private final String symbol;

public Sym(String symbol) {
this.symbol = symbol;
}

public Expr simplify() {
return this;
}

public String toString() {
return symbol;
}
}

public static class Val extends Expr {
public static final Val ZERO = new Val(0);
public static final Val ONE = new Val(1);

private final long value;

private Val(long value) {
this.value = value;
}

public Expr simplify() {
return this;
}

public static Val create(long value) {
if (0 == value) {
return ZERO;
} else if (1 == value) {
return ONE;
} else {
return new Val(value);
}
}

public long getValue() {
return value;
}

public String toString() {
return String.valueOf(value);
}
}
}
 
L

Lew

Jon said:
OCaml simply uses processes for concurrency rather than threads. In terms of
capabilities, there is no difference.

I'm not familiar with OCaml. Does it have a library of concurrency
constructs, such as blocking / non-blocking queues, monitors or other locks,
memory-model coordinating constructs cognate to Java's "volatile" and the like
to coordinate shared data?
 
B

bugbear

Lew said:
This supports the points of those who say, "Watch out for benchmarks,"
and, "Java can be as fast as C++", and. "Java is slower that C++", and,
"It depends".

May I point out the existence of comp.benchmarks ?

BugBear
 
J

Jon Harrop

Lew said:
I'm not familiar with OCaml. Does it have a library of concurrency
constructs, such as blocking / non-blocking queues, monitors or other
locks, memory-model coordinating constructs cognate to Java's "volatile"
and the like to coordinate shared data?

No, OCaml has data parallel and message passing libraries instead. These are
higher level constructs for parallelism and concurrency that abstract away
the need to use the low-level constructs like monitors and locks directly.
This is easier to use and arguably safer (although I have no evidence to
back up the latter in general, outside of Erlang's overwhelming victory for
massively-concurrent applications).

There are complicated trade-offs involved here but, essentially, OCaml
sacrifices high-performance inter-thread communication (that you can
achieve with) for better scaling because its GC requires no garbage
collection.

There is also a new research language called JoCaml that integrates support
for concurrency directly into the OCaml language using the join calculus.

You may also be interested in related efforts such as the recent support for
hinted parallelism in Haskell and F#'s recently introduction asynchronous
workflows.
 
C

Chris Dollin

Jon said:
There are complicated trade-offs involved here but, essentially, OCaml
sacrifices high-performance inter-thread communication (that you can
achieve with) for better scaling because its GC requires no garbage
collection.

Did something go wrong with that paragraph, possibly just over-ambitions
packing? I can't make consistent sense of it.
 
J

Jon Harrop

Chris said:
Did something go wrong with that paragraph, possibly just over-ambitions
packing? I can't make consistent sense of it.

That was indeed a buggy paragraph. :)

Multithreaded GC (like the JVM and .NET) provides cheap interthread
communication. OCaml's approach (independent GCs per process) can make
interthread communication much more expensive but it scales better.
 
R

Roger Lindsjö

Roger said:
Well, actually I used BigDecimal, but that made no difference for the
test. I also made a version using long instead of BigDecimal. With longs
10000000 iterations took about .54 seconds, so about twice as fast. I'm
not sure how to correctly rewrite the Ocaml program to use the native
int. I'll post my program rewritten for long instead, so if you also
post the Ocaml program, then we can both compare relative performance.

Jon gave me a Ocaml version that uses native integers (I tried writing
it myself, but is is a long time since I used a functional language, an
Haskell isn't quite the same as Ocaml). Anyway, here are my results.
(the Ocaml version is posted below).

taskset -c 0 \time -v ./simplifyint 10000000 5
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 1
Took 0.635466s
Bench 2
Took 0.634575s
Bench 3
Took 0.634284s
Bench 4
Took 0.633329s
Bench 5
Took 0.633333s
[* x [+ 31 y]]
Command being timed: "./simplifyint 10000000 5"
User time (seconds): 3.17
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 287
Voluntary context switches: 1
Involuntary context switches: 8
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[roger@pooh Reduce]$ taskset -c 0 \time -v java SimplifyLong 10000000 5
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 0.626 seconds
Bench 1
Took 0.546 seconds
Bench 2
Took 0.541 seconds
Bench 3
Took 0.556 seconds
Bench 4
Took 0.551 seconds
[* x [+ 31 y]]
Command being timed: "java SimplifyLong 10000000 5"
User time (seconds): 2.80
System time (seconds): 0.07
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 17429
Voluntary context switches: 2178
Involuntary context switches: 3483
Swaps: 0
File system inputs: 0
File system outputs: 64
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[roger@pooh Reduce]$ \time -v ./simplifyint 10000000 5
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 1
Took 0.612540s
Bench 2
Took 0.610459s
Bench 3
Took 0.611671s
Bench 4
Took 0.613096s
Bench 5
Took 0.611549s
[* x [+ 31 y]]
Command being timed: "./simplifyint 10000000 5"
User time (seconds): 3.05
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.06
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 287
Voluntary context switches: 1
Involuntary context switches: 19
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[roger@pooh Reduce]$ \time -v java SimplifyLong 10000000 5
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 0.606 seconds
Bench 1
Took 0.554 seconds
Bench 2
Took 0.557 seconds
Bench 3
Took 0.561 seconds
Bench 4
Took 0.559 seconds
[* x [+ 31 y]]
Command being timed: "java SimplifyLong 10000000 5"
User time (seconds): 2.88
System time (seconds): 0.06
Percent of CPU this job got: 101%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 17509
Voluntary context switches: 2278
Involuntary context switches: 87
Swaps: 0
File system inputs: 0
File system outputs: 64
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

So on my system the Java version is slightly faster.

simplifyint.ml (All credit to Jon Harrop)
type t =
| Int of int
| Var of char
| Add of t * t
| Mul of t * t

let rec ( +: ) f g = match f, g with
| Int n, Int m -> Int (n + m)
| Int 0, e | e, Int 0 -> e
| f, Add(g, h) -> f +: g +: h
| f, g -> Add(f, g)

let rec ( *: ) f g = match f, g with
| Int n, Int m -> Int (n * m)
| Int 0, e | e, Int 0 -> Int 0
| Int 1, e | e, Int 1 -> e
| f, Mul(g, h) -> f *: g *: h
| f, g -> Mul(f, g)

let rec simplify = function
| Int _ | Var _ as f -> f
| Add (f, g) -> simplify f +: simplify g
| Mul (f, g) -> simplify f *: simplify g

open Format

let rec print ff = function
| Var x -> fprintf ff "%c" x
| Int n -> fprintf ff "%d" n
| Add(f, g) -> fprintf ff "[+ %a %a]" print f print g
| Mul(f, g) -> fprintf ff "[* %a %a]" print f print g

let e =
Mul(Var 'x', Add(Add(Mul(Int 12, Int 0), Add(Int 23, Int 8)), Var 'y'))

let n, m =
try
match Sys.argv with
| [|_; n; m|] -> int_of_string n, int_of_string m
| _ -> raise Exit
with _ -> 10000000, 10

let () =
fprintf std_formatter "%a\n%!" print e;
for i=1 to m do
Printf.printf "Bench %d\n%!" i;
let t = Unix.gettimeofday() in
for j=1 to n do
ignore(simplify e)
done;
Printf.printf "Took %fs\n%!" (Unix.gettimeofday() -. t)
done;
fprintf std_formatter "%a\n%!" print (simplify e)
 
M

Mark Thornton

Jon said:
That was indeed a buggy paragraph. :)

Multithreaded GC (like the JVM and .NET) provides cheap interthread
communication. OCaml's approach (independent GCs per process) can make
interthread communication much more expensive but it scales better.

With multicore (shared memory) processors now the norm on the desk top,
the advantage of the Java approach is significant for some desktop
applications. The work I'm doing at the moment has threads sharing data
structures which run to hundreds of megabytes. Combined with the Doug
Lea's Fork/Join library (jsr166y) this is giving me good speed up over
sequential computation.
Separate processes not only impose higher communication overhead, but on
some architectures, the overhead of process creation is substantial.
Much as I would like to wean customers off Windows, my code has to work
efficiently on that platform.

Mark Thornton
 
J

Jon Harrop

Mark said:
With multicore (shared memory) processors now the norm on the desk top,
the advantage of the Java approach is significant for some desktop
applications.

I'm interested in seeing evidence either way. From the benchmarks in this
thread, for example, I attribute the (up to) 5x slower allocation rates in
Java to the fact that its allocator and GC must cope with interthread
issues.
The work I'm doing at the moment has threads sharing data
structures which run to hundreds of megabytes. Combined with the Doug
Lea's Fork/Join library (jsr166y) this is giving me good speed up over
sequential computation.

Sure. The work I'm doing at the moment has many processes with large shared
data structures.
Separate processes not only impose higher communication overhead, but on
some architectures, the overhead of process creation is substantial.
Much as I would like to wean customers off Windows, my code has to work
efficiently on that platform.

Yes. Process creation in OCaml takes about 10ms here whereas thread creation
in F# takes 1ms.
 
R

Roedy Green

To my surprise Java -server skunked Jet which just edges out java
-client. I gather the run-time knowledge -server can gather is paying
off big time here.

Athlon 64 X2 3800+ dual core, 2 gig ram.

java -client Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 4.422 seconds
Bench 1
Took 4.728 seconds
Bench 2
Took 4.408 seconds
[* x [+ 31 y]]

using jet
Simplify.exe 10000000 3
[E:\exper]Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 4.178 seconds
Bench 1
Took 4.193 seconds
Bench 2
Took 4.303 seconds
[* x [+ 31 y]]

java -server Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 2.91 seconds
Bench 1
Took 2.84 seconds
Bench 2
Took 2.84 seconds
[* x [+ 31 y]]
 
R

Roedy Green

Then you'll be able to improve the Java implementations of these benchmarks
so that they aren't up to 5x slower than an unoptimized alternative.

That does not follow. A capability of the JVM implies nothing about
my skill or willinginess to optimise your code.
 
M

Mark Thornton

Jon said:
I'm interested in seeing evidence either way. From the benchmarks in this
thread, for example, I attribute the (up to) 5x slower allocation rates in
Java to the fact that its allocator and GC must cope with interthread
issues.

The allocator is very trivial, almost all the time is in collection. The
comment seen elsewhere that dead objects are free to collect is not
quite true. If you have very high rates of garbage creation, the nursery
area will fill frequently and cause garbage collection to occur often.
There seems to be at least some fixed cost in gc, so if the live object
count/size is very small, the time for each gc cycle does not tend to
zero. Many cases of high rates of allocating short lived objects have a
recognisable block structure. The current JVM can detect this (escape
analysis) but so far only uses this to eliminate some types of locks.
Perhaps unfortunately a lot of production Java code has been hand
'optimised' to reduce such allocations, which means that when the
compiler people test escape analysis, the returns aren't as good as
hoped (programmers have already done it by hand). This technique is
apparently used in the (not free) JET jvm, and we hope it will finally
appear in Java 7.


As for multithread issues, the default garbage collector (note Sun's jvm
comes with a number of collectors which you can select using runtime
options), is a stop the world variety. As all threads are stopped (apart
from the collector itself) interthread issues are non existent. In
threaded applications this means that neither allocation or collection
have threading overheads (no synchronization in either case). OK, it
must take some time to suspend/resume the working threads. Compared to
the costs involved with thread safe heaps in C/C++ this can be a
significant gain. Note that the JVM can use simplified code when running
on a uniprocessor, i.e. multiprocessor costs only occur if you actually
run the application on such a machine.
> Sure. The work I'm doing at the moment has many processes with large
> shared data structures.

It is awkward to include pointers in the shared structures unless you
can ensure that the structure is mapped to the same address in each
process. Easy to arrange in a unix fork environment, more difficult to
ensure on Windows, especially if the processes are not effectively clones.

Incidentally your ray trace code makes unnecessary use of inner classes
as opposed to nested classes. Sprinkling a few 'static' keywords around
chops about 10% off the time of the simple version. Another 10% can be
saved by specifying a larger heap than default (on my machine the
default initial heap is about 5MB). These savings apply to both client
and server JVM's.

Mark Thornton
 
J

Jon Harrop

Roedy said:
That does not follow. A capability of the JVM implies nothing about
my skill or willinginess to optimise your code.

Your statement is contrary to the evidence: Java does not allocate fast.
 
J

Jon Harrop

Mark said:
As for multithread issues, the default garbage collector (note Sun's jvm
comes with a number of collectors which you can select using runtime
options), is a stop the world variety. As all threads are stopped (apart
from the collector itself) interthread issues are non existent...

Stop-the-world is exactly the kind of interthread issue that I was referring
to.

If you take this symbolic simplifier, for example, and make it run two
simplifications at a time on my dual core then its performance will not
improve by a factor of two if the GC stops all threads while it runs.

In contrast, the OCaml is not only more than twice as fast to start with but
it will scale linearly with the number of cores in this case because the
GCs do not interfere.
It is awkward to include pointers in the shared structures unless you
can ensure that the structure is mapped to the same address in each
process.

There are no pointers in these languages though.
Incidentally your ray trace code makes unnecessary use of inner classes
as opposed to nested classes. Sprinkling a few 'static' keywords around
chops about 10% off the time of the simple version. Another 10% can be
saved by specifying a larger heap than default (on my machine the
default initial heap is about 5MB). These savings apply to both client
and server JVM's.

Can you send me your code and I'll update it.
 
J

Jon Harrop

Roedy said:
To my surprise Java -server skunked Jet which just edges out java
-client. I gather the run-time knowledge -server can gather is paying
off big time here.

Athlon 64 X2 3800+ dual core, 2 gig ram.

java -server Simplify 10000000 3
[* x [+ [+ [* 12 0] [+ 23 8]] y]]
Bench 0
Took 2.91 seconds
Bench 1
Took 2.84 seconds
Bench 2
Took 2.84 seconds
[* x [+ 31 y]]

Athlon 64 X2 4400+ dual core, 2Gb RAM

OCaml: 0.695s

Even accounting for CPU speed differences, OCaml is still running 3.5x
faster, primarily because it can allocate much more quickly than Java.

This is also neglecting the facts that Java is using 100x as much RAM and
doubles the load average (i.e. maxes out both of my cores whereas the OCaml
only uses one).
 
L

Lew

Jon said:
No, OCaml has data parallel and message passing libraries instead. These are
higher level constructs for parallelism and concurrency that abstract away
the need to use the low-level constructs like monitors and locks directly.
This is easier to use and arguably safer (although I have no evidence to
back up the latter in general, outside of Erlang's overwhelming victory for
massively-concurrent applications).

Message passing as a primitive coordination and concurrency mechanism is
extremely powerful, especially in an O.S. like QNX that reduces context-switch
times to ridiculously small values. I first encountered it on a minicomputer
in 1981. It makes for very straightforward coding, too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,197
Latest member
ScottChare

Latest Threads

Top