execution speed java vs. C

N

nicolasbock

I write programs at work to do numerical calculations. Those programs
are usually written in C. I recently started looking into Java and
found the language and its features very attractive. I figured that I
should give it a try and see whether it is suitable for numerical work.
So I went ahead and wrote a very simple matrix multiplication program
in C and Java and benchmarked them. To my disappointment, C turned out
to be about 1.5 to 2 times faster than Java.

Below more detail on what I actually did. I calculated the matrix
product of two random 800x800 matrices for int and double matrices. The
tests were run on a powerBook G4 1.33GHz. The C compiler was gcc-3.3
and the Java version I used 1.4.2_05. The C code was compiled with
optimization level 3 (-O3). I couldn't find anything equivalent for
javac, so I compiled the code presumably without optimizations or with
some default level of optimization. The runtimes I found were

int matrices
C (-O3): 15s
C (without -O): 33s
Java: 32s

double matrices
C (-O3): 28s
C (without -O): 47s
Java: 45s

Since the unoptimized C code (without specifying any -O level) runs
about as fast as the Java code I assume that I am looking at badly
optimized Java code.

Can I force the java compiler or the runtime engine to optimize my code
further?

Thanks, nick






The C program was

#include <stdio.h>
#include <stdlib.h>

#define N 800
#define M 800

int a [N][M];
int b [N][M];
int c [N][M];

int main ()
{
int i, j, k;

printf ("creating matrices a and b\n");

/* Initialize the two matrices. */
for (i = 0; i < N; ++i) {
for (j = 0; j < M; ++j) {

a[j] = (int) ((rand () / (double) RAND_MAX - 0.5) * 10);
b[j] = (int) ((rand () / (double) RAND_MAX - 0.5) * 10);
}
}

/* Multiply the matrices. */

printf ("multiplying them\n");

for (i = 0; i < N; ++i) {
for (j = 0; j < M; ++j) {

c[j] = 0;
}
}

for (i = 0; i < N; ++i) {
for (j = 0; j < M; ++j) {
for (k = 0; k < M; ++k) {

c[j] += a[k] * b[k][j];
}
}
}
}


The Java code was:

public class MatrixTestDirty
{
public static void main (String args [])
{
int N = 800;
int M = 800;

int a [][] = new int [N][M];
int b [][] = new int [N][M];
int c [][] = new int [N][M];

System.out.println ("creating matrices a and b");

/* Initialize the two matrices. */
for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {

a[j] = (int) ((Math.random () - 0.5) * 10);
b[j] = (int) ((Math.random () - 0.5) * 10);
}
}

/* Multiply the matrices. */

System.out.println ("multiplying them");

for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {

c[j] = 0;
}
}

for (int i = 0; i < N; ++i) {
for (int j = 0; j < M; ++j) {
for (int k = 0; k < M; ++k) {

c[j] += a[k] * b[k][j];
}
}
}
}
}
 
S

Skip

So I went ahead and wrote a very simple matrix multiplication program
in C and Java and benchmarked them. To my disappointment, C turned out
to be about 1.5 to 2 times faster than Java.
int a [][] = new int [N][M];
int b [][] = new int [N][M];
int c [][] = new int [N][M];

In java every array is a separate object, if you do:
int a [] = new int [N*M];
int b [] = new int [N*M];
int c [] = new int [N*M];

your code will be a LOT faster.

further: you benchmark only the first run it seems. the HotSpot JIT seems to
optimize the code only after the same code block is run a couple of times
(normally the 2nd time).

your plain java-code took 16.4s for me.
after i optimised it myself: 12.3s

after enableding the -server in java commandline 5.9s (you need the java SDK
for that)


5.9s / 16.4s = 36% of the time it took, *2.78x faster*

*
so with my own optimisation, and -server your java app will be faster than
C -O3
*

(even surprises me!)

~~~~~~~~~~

here is my optimised sourcecode:

private final int N = 800;

private final int M = 800;

private final int a[] = new int[N * M];

private final int b[] = new int[N * M];

private final int c[] = new int[N * M];



public final void method1()
{
int range = N * M;

for (int i = 0; i < range; ++i)
{
a = (int) ((Math.random() - 0.5) * 10.0);
}
}



public final void method2()
{
int range = N * M;

for (int i = 0; i < range; ++i)
{
c = 0;
}
}



public final void method3()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < M; ++j)
{
for (int k = 0; k < M; ++k)
{
c[i + j * N] += a[i + k * N] * b[k + j * N];
}
}
}
}
 
S

Skip

So I went ahead and wrote a very simple matrix multiplication program
in C and Java and benchmarked them. To my disappointment, C turned out
to be about 1.5 to 2 times faster than Java.
int a [][] = new int [N][M];
int b [][] = new int [N][M];
int c [][] = new int [N][M];

In java every array is a separate object, if you do:
int a [] = new int [N*M];
int b [] = new int [N*M];
int c [] = new int [N*M];

your code will be a LOT faster.

further: you benchmark only the first run it seems. the HotSpot JIT seems to
optimize the code only after the same code block is run a couple of times
(normally the 2nd time).

your plain java-code took 16.4s for me.
after i optimised it myself: 12.3s

after enableding the -server in java commandline 5.9s (you need the java SDK
for that)

May I add that the same calculations on float-matrices takes only 4.6s (java
1.4.2+ uses SSE), results are ofcourse slightly inaccurate.
 
M

Mike Cox

Skip said:
So I went ahead and wrote a very simple matrix multiplication program
in C and Java and benchmarked them. To my disappointment, C turned out
to be about 1.5 to 2 times faster than Java.
int a [][] = new int [N][M];
int b [][] = new int [N][M];
int c [][] = new int [N][M];

In java every array is a separate object, if you do:
int a [] = new int [N*M];
int b [] = new int [N*M];
int c [] = new int [N*M];

your code will be a LOT faster.

further: you benchmark only the first run it seems. the HotSpot JIT seems
to optimize the code only after the same code block is run a couple of
times (normally the 2nd time).

HotSpot JIT, is that SUN's JIT compiler or is that some third-part code that
compiles native code?
your plain java-code took 16.4s for me.
after i optimised it myself: 12.3s

after enableding the -server in java commandline 5.9s (you need the java
SDK for that)

In the java command or in javac? I'm testing out Tomcat, so could I compile
a servlet with the server option, like so: javac -server myjava.java

In the same thought, is it possible to compile a java servlet into native
code so it runs super fast on Tomcat?
 
M

Michael Borgwardt

Skip said:
int a [][] = new int [N][M];
int b [][] = new int [N][M];
int c [][] = new int [N][M];


In java every array is a separate object, if you do:

int a [] = new int [N*M];
int b [] = new int [N*M];
int c [] = new int [N*M];


your code will be a LOT faster.

Why? Allocating the arrays happens only once at the beginning of the program.
Should not be relevant.
further: you benchmark only the first run it seems. the HotSpot JIT seems to
optimize the code only after the same code block is run a couple of times
(normally the 2nd time).

That and not using -server seem to be the biggest factors to me.
 
A

Alex

I agree that hotspot compilers do much better work compared to years
ago.

Your test again too simple.
Matrix mul is nothing. Compiler option -O3 is not enough to gain
maximum from GCC, its only query to make code optimized, not absolutely
fast. So, please force a bunch of actual comand line options. You
should be surprised..

Try to feed FFT code or something complex (search for Ooura FFT). I
think you may get more real numbers, at least ~8-10 times slower java
code.

Try:

1. Use Intel optimizing compiler.
2. Enable Your CPU model optimization
2. Enable MMX/SSE/SSE2 optimizations
3. Enable Full functions inline (more: interprocedural and intermodular
optimizations).
4. Enable unrolling loops

We can talk about java JIT optimiser as weak (but fast) c-compiler like
borland turbo C, which tends to inline most of code. Nothing more.
Latest C/C++ compilers when they turned to do all what they might to
do, is ALWAYS faster than java JIT.
 
S

Skip

int a [] = new int [N*M];
int b [] = new int [N*M];
int c [] = new int [N*M];


your code will be a LOT faster.

Why? Allocating the arrays happens only once at the beginning of the program.
Should not be relevant.

It's about access times.

IIRC:
int[64] is a block of 64 ints in memory
int[16][4] are 16 blocks of 4 ints.

so the JVM has to find int[x][] first, then int[x][y]

and i proved it's faster.

16.9s ---> 12.3s


i did this optimization very often, and it was always much faster, as you
just saw.

HTH
 
M

Michael Borgwardt

Skip said:
It's about access times.

IIRC:
int[64] is a block of 64 ints in memory
int[16][4] are 16 blocks of 4 ints.

so the JVM has to find int[x][] first, then int[x][y]

That's only relevant if x is different for each access, otherwise
the result should be in first-level CPU cache and retrieval so fast
that it doesn't matter. So it depends on the access patterns of the
concrete application
and i proved it's faster.

16.9s ---> 12.3s

It sounded to me like that was the result after the code change *and*
giving the Hotspot compiler time to optimize the method.
i did this optimization very often, and it was always much faster, as you
just saw.

I wouldn't call it *much* faster, A factor of 5, *that's* "much faster",
and not at all an uncommon achievement.

But of course every improvement helps.
 
M

Michael Borgwardt

Mike said:
HotSpot JIT, is that SUN's JIT compiler or is that some third-part code that
compiles native code?

The former.
In the java command or in javac? I'm testing out Tomcat, so could I compile
a servlet with the server option, like so: javac -server myjava.java

No, it's a runtime option, not a compiler option.
In the same thought, is it possible to compile a java servlet into native
code so it runs super fast on Tomcat?

That's neither possible nor would it run "super fast".
 
M

Michael Borgwardt

Alex said:
Your test again too simple.
Matrix mul is nothing. Compiler option -O3 is not enough to gain
maximum from GCC, its only query to make code optimized, not absolutely
fast. So, please force a bunch of actual comand line options. You
should be surprised..

Try to feed FFT code or something complex (search for Ooura FFT). I
think you may get more real numbers, at least ~8-10 times slower java
code.

Those are pretty big claims. Why not do that yourself and show us the numbers?

We can talk about java JIT optimiser as weak (but fast) c-compiler like
borland turbo C, which tends to inline most of code. Nothing more.

That is untrue. Java JIT compilers can theoretically do anything that
C compilers can do, and more (since they have more information about
the current machine) and in practice the Hotspot JIT, as of 1.4.1,
performs inlining loop unrolling, dead code elimination, loop invariant
hoisting, common subexpression elimination, constant propagation, as well
as optimization of register usage and a lot of Java-specific stuff.

Latest C/C++ compilers when they turned to do all what they might to
do, is ALWAYS faster than java JIT.

Perhaps faster than a particular JIT, perhaps even faster than the best
current JIT, but in full generality that statement can only be untrue.
 
J

joeking

Michael said:
Skip said:
It's about access times.

IIRC:
int[64] is a block of 64 ints in memory
int[16][4] are 16 blocks of 4 ints.

so the JVM has to find int[x][] first, then int[x][y]

That's only relevant if x is different for each access, otherwise
the result should be in first-level CPU cache and retrieval so fast
that it doesn't matter. So it depends on the access patterns of the
concrete application

Is there a first-level CPU cache in JVM interpreted mode? ;-)

To answer the question at large, I think that although Java's speed
is generally no longer a problem, heavy number crunching is perhaps
the one area where Java isn't so hot - but hey, nothing is ever
perfect! That isn't to say that Java cannot do number crunching,
because it quite obviously can, merely that you have to work a bit
harder to get results with Java than you would with a lower level
language like C.

However - this does not mean that you should reject Java outright
(as so many foolishly do). Typically the actually number crunch
code in such applications accounts for only a small percentage of
the total source. There's I/O code, GUI code, and host of other
things at which Java performs on a par with other languages. By
writing the bulk of the application in Java, and using JNI to link
it to a handful of tight native routines which do all the math, you
can get the best of both worlds. And while the resulting
application might not be 100% portable, you have at least reduced
down to a minimum the amount of code which has to be rebuilt for
each platform.


-FISH- ><>
 
M

Michael Borgwardt

To answer the question at large, I think that although Java's speed
is generally no longer a problem, heavy number crunching is perhaps
the one area where Java isn't so hot - but hey, nothing is ever
perfect! That isn't to say that Java cannot do number crunching,
because it quite obviously can, merely that you have to work a bit
harder to get results with Java than you would with a lower level
language like C.

Even with number crunching, Java can be a big win if it means that
development is faster. If a full test run of the app takes 5 days,
then every bug that doesn't happen counts, and I heard that
scientists often write specialized simulations on the fly that
run only a few times - in those cases, whether the code takes 5 or
10 hours to run matters less than whether it takes 1 or 2 weeks
to write and debug.
 
J

joeking

Michael Borgwardt wrote:
[snipped...]
Even with number crunching, Java can be a big win if it means that
development is faster. If a full test run of the app takes 5 days,
then every bug that doesn't happen counts, and I heard that
scientists often write specialized simulations on the fly that
run only a few times - in those cases, whether the code takes 5 or
10 hours to run matters less than whether it takes 1 or 2 weeks
to write and debug.


Agreed. Which is why it is so disappointing when programmers
reject Java in favour of C, sighting heavy number crunching as
an excuse - when a look at their code would often reveal that
number crunching accounts for only a tiny portion of their
application, and actually (as you correctly pointed out)
development time can be more costly than 'running' time.

Even for applications where the balance is different, and
the software will be used frequently, it is still worth
writing the bulk in Java and limiting C down to only those
handful of routines which need to be extremely fast. Less
errors and faster turn-around on 95% of your source code
is better than 0%, after all. ;-)


-FISH- ><>
 
C

Chris Uppal

Michael said:
int[64] is a block of 64 ints in memory
int[16][4] are 16 blocks of 4 ints.

so the JVM has to find int[x][] first, then int[x][y]

That's only relevant if x is different for each access, otherwise
the result should be in first-level CPU cache and retrieval so fast
that it doesn't matter. So it depends on the access patterns of the
concrete application

Don't forget that you are still pulling more data into the CPU. So if 4 bytes
of L1 cache are in use to hold the reference to the secondary array, then those
four bytes cannot be used to hold the contents of the secondary array. Or --
more likely, I suspect -- in the case where memory accesses to the data in the
secondary arrays are mostly sequential (a necessary pre-requisite for caching
to have a hope of hiding the underlying inefficiency of double-indirection at
all) the data from the secondary arrays may evict the ref to the primary array
from the cache.

Of course, how much difference that will make in practise (not to mention any
other effects) is a matter of measurement. So here are my measurements (made
in a 1.3 Ghz WinXP laptop).

First off, a baseline.

Nick's C code (with minor changes to run the multiply in a loop) compiled using
MS Visual Studio 2003 in "debug" mode:
8.7 seconds per multiply.

Same but in "release" mode, plus all the optimisations I could think of turned
on (and array bounds checking turned off !):
4.3 seconds per multiply.

Now Nick's Java code (with mods to run the multiply in a loop, and also
changing N and M into static final ints and the three arrays into static
fields). BTW, I did check that there was no difference in the generated
bytecodes between javac -g and javac -O. In all cases using JDK 1.5.0

First using java -client:
13.5 seconds per multiply. (Eek! ;-)

Ok, now see if warming up the JITer will help. Run multiply 5 times before
starting to measure:
13.5 seconds per multiply.
No difference...

Now try java -server (with no warmup):
10.2 seconds per multiply
Hmm, better but still not impressive...

We may as well try the warmup with -server:
9.0 seconds per multiply
So warmup had more effect on -server than -client (in this case).
Interesting...

So, so far, the best we've got is >2 times slower than the M$ can make C code
run. Let's try some optimisations.

First, there's an obvious opportunity for hoisting common subexpressions (and
similar) out of the inner loop. We may as well start with that.
Running -sever:
9.0 seconds per multiply.
So the -server option is already doing that optimisation.

OK, so let's try switching to 1-dimensional arrays and using multiply (but not
optimising common subexpressions):
4.8 seconds per multiply.
Aha ! That's not so bad...

We may as well put the common-subexpression optimisations back in:
4.9 seconds per multiply
Looks like micro-optimising by hand is not a good idea when using -server.
(Actually, I tried breaking it down a bit, and it seems that some of the
hand-optimisations help, others hinder, taken all in all they just about
average out to making no difference at all.)

So, when running the server JVM, the switch to 1-dimensional arrays is very
effective. Indeed it seems to the /only/ effective optimisation.

Just for interest (as if the foregoing was for any other reason ;-) I tried the
optimisations on the client JVM. Using 1-d arrays brought the time down from
13 to 10 seconds, so the relative (and absolute) difference it made was much
smaller than for the server JVM. Adding in the subexpression optimisation
brought the time down still further, to around 7 seconds. I suppose it makes
sense that hand-optimisation would have more effect on the client JVM than the
server JVM.

For a last datapoint, I folded the hand-optimisations back into the C code.
Didn't make a damn of difference; confirming that the MS optimiser was (for
once) doing a decent job.

Bottom line, I suppose is (insofar as this test is representative):
1) it would appear that java -server /can/ be as fast as well-optimised C.
2) using 1-dimensional arrays /can/ be a big win, but only if everything else
is optimised (e.g. by running -server) to the extent where the array access is
in fact the bottleneck.

-- chris
 
J

John C. Bollinger

Michael Borgwardt wrote:
[snipped...]
Even with number crunching, Java can be a big win if it means that
development is faster. If a full test run of the app takes 5 days,
then every bug that doesn't happen counts, and I heard that
scientists often write specialized simulations on the fly that
run only a few times - in those cases, whether the code takes 5 or
10 hours to run matters less than whether it takes 1 or 2 weeks
to write and debug.



Agreed. Which is why it is so disappointing when programmers
reject Java in favour of C, sighting heavy number crunching as
an excuse - when a look at their code would often reveal that
number crunching accounts for only a tiny portion of their
application, and actually (as you correctly pointed out)
development time can be more costly than 'running' time.

It is _particularly_ disappointing when programmers do that because
their choice of Java alternative is poor for the task. The traditional
language for number crunching is Fortran, and although writing and
debugging Fortran may not be as easy as writing and debugging Java, it's
easier than writing and debugging C (IMO). Optimized Fortran code
generally beats out comparable C code, but not by large margins. Such
comparisons are inherently tricky, however, so YMMV.


John Bollinger
(e-mail address removed)
 
M

Morten Alver

Even with number crunching, Java can be a big win if it means that
development is faster. If a full test run of the app takes 5 days,
then every bug that doesn't happen counts, and I heard that
scientists often write specialized simulations on the fly that
run only a few times - in those cases, whether the code takes 5 or
10 hours to run matters less than whether it takes 1 or 2 weeks
to write and debug.

Absolutely true. I work on mathematical simulations, and I usually
modify the program without too many runs in between. I might get some
performance gain using C or Fortran, but the extra cost in development
time and debugging would be a lot more significant. Also, for my
specific application, using a non-object oriented language would be a
real hassle - and it's probably harder to find an object oriented
language that would consistently perform faster than Java (C++, perhaps,
I'm not sure).
 
C

Chris Smith

Is there a first-level CPU cache in JVM interpreted mode? ;-)

Based on your smiley, I'll assume you already know this. For the
benefit of others, though, this is a flawed question. Java code is very
rarely interpreted with standard JVM implementations (including Sun's)
on desktop platforms. It only happens in a few cases where the JVM
heuristically determines that it would be faster that way. Performance-
critical code is always JITed to native code and run natively, so of
course CPU cache matters.

Interpreters are largely limited to smaller platforms such as cell
phones and such. On these platforms, the CPU cache still does some
good, but it is watered down a bit by the internal memory accesses of
the interpreter itself.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris Smith

Alex said:
Try to feed FFT code or something complex (search for Ooura FFT). I
think you may get more real numbers, at least ~8-10 times slower java
code.

Overall, I am still of the opinion that the average performance of top C
compilers will beat the performance of Sun's Java virtual machine.
However, eight to ten times is a ridiculous number. Remember that
extraordinary claims require extraordinary evidence. You've provided no
evidence at all.

My own experience, by the way, is that it's difficult to do TOO awfully
much better than gcc -O3. In fact, -O2 does reasonably well, and the
difference between -O2 and -O3 is itself relatively small. Intel's
compiler can definitely generate better code on average for Intel
platforms that gcc does, but the margin is *not* going to be several
hundred percent.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
B

beliavsky

The following Fortran 95 code below, which multiplies two 800x800
double precision matrices, is shorter
and faster than the Java and C codes. On a 2.8GHz Pentium Pro machine,
compiled with Compaq Visual Fortran 6.6 with full optimization, it
takes 1.0 s. Accounting crudely for the difference in clock speeds,
that's a speed factor of (45/1.0)/(2.8/1.33) = 21 over Java. The Intel
Fortran compiler may be even faster.

Fortran 95, with its array operations and elemental functions, is a
higher level language for numerical work than Java, C, or C++, and
numerical codes should be faster both to write and to run using it. The
recently finalized Fortran 2003 standard adds support for OOP with
inheritance.

program xmatmul_time
implicit none
integer, parameter :: n = 800
real(kind=8) :: xx(n,n),yy(n,n)
real :: t1,t2,xsum
call random_seed()
call random_number(xx)
call random_number(yy)
call cpu_time(t1)
xsum = sum(matmul(xx,yy))
call cpu_time(t2)
print*,1000*(t2-t1),xsum
end program xmatmul_time
 
C

Chris Uppal

The following Fortran 95 code below, which multiplies two 800x800
double precision matrices, is shorter
and faster than the Java and C codes.

Interesting example. I'm pleased to see that the traditional superiority of
Fortan still holds. (Not that I'm a Fortran programmer myself). Presumably a
32-bit integer version would perform about the same ? Or would it be twice as
fast ?

On a 2.8GHz Pentium Pro machine,
compiled with Compaq Visual Fortran 6.6 with full optimization, it
takes 1.0 s. Accounting crudely for the difference in clock speeds,
that's a speed factor of (45/1.0)/(2.8/1.33) = 21 over Java. The Intel
Fortran compiler may be even faster.

Um, since Java will do the 32-bit int version in ~5 seconds on my 1.3 GHz
machine, I think you'll find that the real ratio is more like 2x, or even 4x,
than 21x.

Presumably the Fortan implementation uses a library multiplication that has
been carefully tuned to get maximum benefit from cache effects. It's be
interesting to know how the equivalent code expressed in Java (or C) would
perform.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top