Sines and Cosines

K

Kenneth P. Turvey

I saw the the monster thread in this group comparing the performance of
a single library function between C++ and Java and it gave me an idea.

In an earlier discussion in this newsgroup we discussed the fact that
Intel hardware might cause Java to perform poorly on transcendental
functions do to the poor accuracy of its implementation of the functions
in it's processor.

I had read some time ago (when I was pulling my hair out working on
Fourier transforms of images) that this was the reason I was getting
such poor performance out of Java on these problems.

I had never tested it. So today I wrote two similar programs, one in C
and one in Java, that just produced random numbers and then calculated
the sine and cosine of them. (Please don't tell anyone in comp.lang.c,
one pointless flame war is enough).

These numbers showed Java to be much less efficient in doing sines
and consines, but it still left the question open as to whether this
performance lag was due to the math involved itself or just some
inefficiency in the way Java was processing the loop or the array
lookups. So I ran another test to see how the performance compared on
the same program when the transcendental functions were replaced with
simple floating point arithmetic. In this case Java out performed C. So
it looks as if the Java compilers are, in fact, having to work around an
Intel bug on this platform. This work around is quite costly as can be
seen in the data below. I had hoped that Intel would have fixed this
problem by now, but I guess not.

For the purposes of comparison would someone please run the test
programs on a platform that does not use an Intel processor? I would
like to know if Java performs better on these platforms when compared to
the C implementation.

Thanks!



The results:

Sine/Cosine
Environment Calculations* Seconds
--------------------------------------------
C 40,000,000 3.118970
sun java 6.0 40,000,000 6.543
ibm java 6.0 40,000,000 12.669
ibm java 5.0 40,000,000 9.82

C 80,000,000 6.155521
sun java 6.0 80,000,000 13.038
ibm java 6.0 80,000,000 25.404
ibm java 5.0 80,000,000 12.351


Multiplication/Division
Environment Calculations* Seconds
--------------------------------------------
C 40,000,000 0.502109
sun java 6.0 40,000,000 0.432
ibm java 6.0 40,000,000 0.424
ibm java 5.0 40,000,000 0.427

C 80,000,000 1.005564
sun java 6.0 80,000,000 0.86
ibm java 6.0 80,000,000 0.855
ibm java 5.0 80,000,000 0.855

* Note that this is double the argument to the program
** These results are all based on the best of 10 runs.




--Java Program---------------------------------------
public class Main {
public static void main(String[] args) {
int nums = 10;
int runs = 1;
Random random = new Random();

try {
if (args.length < 1 && args.length > 3) {
throw new IllegalArgumentException();
}

nums = Integer.parseInt(args[0]);
if (args.length >= 2) {
runs = Integer.parseInt(args[1]);
}

if (args.length == 3) {
random = new Random(Integer.parseInt(args[2]));
}
}
catch(Exception e) {
System.err.println("Usage: Transcendental.jar <nums> [runs] [seed]");
System.exit(-1);
}

double[] values = new double[nums];
double[] sins = new double[nums];
double[] coss = new double[nums];

for (int run = 0; run < runs; run++) {
for (int index = 0; index < nums; index++) {
values[index] = random.nextDouble() * 2.0 * Math.PI;
}

long startMillis = System.currentTimeMillis();
for (int index = 0; index < nums; index++) {
//sins[index] = Math.sin(values[index]);
//coss[index] = Math.cos(values[index]);
sins[index] = values[index] * 17.0;
coss[index] = values[index] / 23.0;
}
long endMillis = System.currentTimeMillis();

System.out.println("Runtime: " + (endMillis - startMillis) / 1000.0
+ " seconds");
int index = random.nextInt(nums);
System.out.println("Example: value: " + values[index] + " Sine: "
+ sins[index] + " Cosine: " + coss[index]);
}
}
}
------------------------------

---C Program------------------
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

#define PI 3.14159265358979323846

double get_seconds(struct timeval end_time, struct timeval start_time);

int main(int argc, char* argv[]) {
int nums = 10;
int runs = 10;

srand(times(NULL));

if (argc < 2 || argc > 4) {
fprintf(stderr, "Usage: transcendental <num> [runs] [seed]\n");
exit(-1);
}

if (argc >= 2) {
sscanf(argv[1], "%d", &nums);
}
if (argc >= 3) {
sscanf(argv[2], "%d", &runs);
}
if (argc == 4) {
int seed;
sscanf(argv[3], "%d", &seed);
srand(seed);
}

double *values = (double*) malloc(nums * sizeof(double));
double *sins = (double*) malloc(nums * sizeof(double));
double *coss = (double*) malloc(nums * sizeof(double));

int run;
for (run = 0; run < runs; run++) {
int index;
for (index = 0; index < nums; index++) {
values[index] = rand() * 2.0 * PI / RAND_MAX;
}

struct timeval start;
struct timeval end;

gettimeofday(&start, NULL);

for (index = 0; index < nums; index++) {
//sins[index] = sin(values[index]);
//coss[index] = cos(values[index]);
sins[index] = values[index] * 17.0;
coss[index] = values[index] / 23.0;
}

gettimeofday(&end, NULL);

printf("Runtime: %lf seconds\n", get_seconds(end, start));

// Print one out at random so the compiler can't optimize it away.
index = (int) rand() * (double) nums / RAND_MAX;
printf("Example: value: %lf sine: %lf cos: %lf\n",
values[index], sins[index], coss[index]);
}

return 0;
}

double get_seconds(struct timeval end_time, struct timeval start_time) {
double end = end_time.tv_sec + end_time.tv_usec / 1000000.0;
double start = start_time.tv_sec + start_time.tv_usec / 1000000.0;
return end - start;
}
 
E

Eric Sosman

Kenneth said:
[... speed of sin,cos in Java vs C on Intel ...]
I had never tested it. So today I wrote two similar programs, one in C
and one in Java, that just produced random numbers and then calculated
the sine and cosine of them. (Please don't tell anyone in comp.lang.c,
one pointless flame war is enough).
> [...]

Your secret is safe with me.

It seems to me there are still too many variables in the
test, besides the sine and cosine calculations themselves. One
fairly prominent difference is Java's bounds-checked arrays vs.
C's "we don' need no steenkin' bounds!" pointer hooliganism.
It seems a lot of the array stuff is to keep C's optimizer from
throwing code away, but there are other ways to do that, e.g.:

double x = ...;
t0 = ... current time ...;
for (index = 0; index < nums; ++index) {
x = sin(x);
x = cos(x);
}
t1 = ... current time ...;
... print elapsed time and value of x ...

Also, it would be a good idea to mention what C compiler
you used, and with what optimization flags: There's a lot more
variation among cc's than among javac's!
 
M

Mark Thornton

Kenneth said:
These numbers showed Java to be much less efficient in doing sines
and consines, but it still left the question open as to whether this
performance lag was due to the math involved itself or just some
inefficiency in the way Java was processing the loop or the array
lookups. So I ran another test to see how the performance compared on
the same program when the transcendental functions were replaced with
simple floating point arithmetic. In this case Java out performed C. So
it looks as if the Java compilers are, in fact, having to work around an
Intel bug on this platform. This work around is quite costly as can be
seen in the data below. I had hoped that Intel would have fixed this
problem by now, but I guess not.

Intel don't consider it to be a bug. Their processors use a 66 bit
approximation of PI to do argument reduction. This isn't sufficient to
meet Java's accuracy specification for these trig functions. Is the Java
spec unreasonable?

Mark
 
K

Kenneth P. Turvey

It seems to me there are still too many variables in the
test, besides the sine and cosine calculations themselves. One
fairly prominent difference is Java's bounds-checked arrays vs.
C's "we don' need no steenkin' bounds!" pointer hooliganism.

I don't think we need to worry about that since the Java compiler can
optimize away the bounds checking in this case, and it seems to do so when
the results for simple floating point arithmetic are seen.
It seems a lot of the array stuff is to keep C's optimizer from
throwing code away, but there are other ways to do that, e.g.:

That was exactly the point. It doesn't just affect C though. Java would
be affected too, quite possibly.

double x = ...;
t0 = ... current time ...;
for (index = 0; index < nums; ++index) {
x = sin(x);
x = cos(x);
}
t1 = ... current time ...;
... print elapsed time and value of x ...

I think this has a couple problems, but the big one I see right away
without testing is that the domain in the second call is only between -1.0
and 1.0 not 0 and 2*PI.
Also, it would be a good idea to mention what C compiler
you used, and with what optimization flags: There's a lot more
variation among cc's than among javac's!

I'm using gcc with -O, but it shouldn't matter much for this code.
 
K

Kenneth P. Turvey

Intel don't consider it to be a bug. Their processors use a 66 bit
approximation of PI to do argument reduction. This isn't sufficient to
meet Java's accuracy specification for these trig functions. Is the Java
spec unreasonable?

I don't think the spec is unreasonable, but I would like the option of
substituting in a faster version of the methods. I don't see a good way
to do so. Using native methods wouldn't help and there isn't a way to do
in line assembly in Java.
 
E

Eric Sosman

Kenneth said:

Okay, here's my attempt (sources below). On a 3GHz
Pentium 4 running WinXP SP2, I get

Java 1.6.0_05:
nums = 1000000, runs = 10, theta = 0.0, delta = 3.8785094135818516
360 ms, final theta = -0.739085133899082
344 ms, final theta = -0.739085133899082
359 ms, final theta = -0.739085133899082
359 ms, final theta = -0.739085133899082
360 ms, final theta = -0.739085133899082
344 ms, final theta = -0.739085133899082
344 ms, final theta = -0.739085133899082
359 ms, final theta = -0.739085133899082
343 ms, final theta = -0.739085133899082
343 ms, final theta = -0.739085133899082

DJGPP 3.3.3 (elderly), -O3:
nums = 1000000, runs = 10, theta = 0, delta = 3.87851
330 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085
330 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085
385 ms, final theta = -0.739085

.... suggesting Java may be just a hair faster than C, but
that it's "in the noise." (Note, too, that the notions of
"time" used in the two programs are not exactly alike.) I'll
report on results from an AMD laptop later this weekend.

Sources:

public class TrigTime {
public static void main(String[] args) {
int nums = 1000000;
if (args.length > 0)
nums = Integer.parseInt(args[0]);
int runs = 10;
if (args.length > 1)
runs = Integer.parseInt(args[1]);
double theta = 0.0;
if (args.length > 2)
theta = Double.parseDouble(args[2]);
double delta = 1.23456789 * Math.PI;
if (args.length > 3)
delta = Double.parseDouble(args[3]);
System.out.println("nums = " + nums + ", runs = " + runs
+ ", theta = " + theta + ", delta = " + delta);
for (int r = 0; r < runs; ++r) {
cleanUpYourAct();
long t0 = System.currentTimeMillis();
for (int n = 0; n < nums; ++n) {
theta = Math.sin(theta + delta);
theta = Math.cos(theta + delta);
}
long t1 = System.currentTimeMillis();
System.out.println((t1 - t0) + " ms, final theta = "
+ theta);
}
}
/** Without this, the timings show wild variations. */
private static void cleanUpYourAct() {
for (int i = 0; i < 3; ++i) {
System.gc();
try {
Thread.sleep(250);
} catch (InterruptedException ex) {
// big deal
}
}
}
}



#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
int main(int argc, char *argv[])
{
int nums = 1000000;
int runs = 10;
double theta = 0.0;
double delta = 1.23456789 * M_PI;
int r, n;
if (argc > 1)
nums = atoi(argv[1]); /* sloppy, I know ... */
if (argc > 2)
runs = atoi(argv[2]);
if (argc > 3)
theta = atof(argv[3]);
if (argc > 4)
delta = atof(argv[4]);
printf ("nums = %d, runs = %d, theta = %g, delta = %g\n",
nums, runs, theta, delta);
for (r = 0; r < runs; ++r) {
clock_t t0, t1;
t0 = clock();
for (n = 0; n < nums; ++n) {
theta = sin(theta + delta);
theta = cos(theta + delta);
}
t1 = clock();
printf ("%.0f ms, final theta = %g\n",
(double)((t1 - t0) * 1000.0 / CLOCKS_PER_SEC), theta);
}
return 0;
}
 
K

Kenneth P. Turvey

... suggesting Java may be just a hair faster than C, but
that it's "in the noise." (Note, too, that the notions of
"time" used in the two programs are not exactly alike.) I'll
report on results from an AMD laptop later this weekend.

Using your tests on my computer with gcc, I get different results.

kt@lovelace:/tmp$ java TrigTime
nums = 1000000, runs = 10, theta = 0.0, delta = 3.8785094135818516
386 ms, final theta = -0.739085133899082
388 ms, final theta = -0.739085133899082
392 ms, final theta = -0.739085133899082
400 ms, final theta = -0.739085133899082
377 ms, final theta = -0.739085133899082
387 ms, final theta = -0.739085133899082
394 ms, final theta = -0.739085133899082
398 ms, final theta = -0.739085133899082
379 ms, final theta = -0.739085133899082
385 ms, final theta = -0.739085133899082
kt@lovelace:/tmp$ gcc -o test test.c -lm
kt@lovelace:/tmp$ ./test
nums = 1000000, runs = 10, theta = 0, delta = 3.87851
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
150 ms, final theta = -0.739085
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
160 ms, final theta = -0.739085
150 ms, final theta = -0.739085

So Java here runs at less than half the speed of C. Oh, and BTW, the
optimization level doesn't matter for this code under gcc.

If I run these with more numbers to get more comparable numbers, I get:


kt@lovelace:/tmp$ ./test 40000000
nums = 40000000, runs = 10, theta = 0, delta = 3.87851
5940 ms, final theta = -0.739085
5920 ms, final theta = -0.739085
5940 ms, final theta = -0.739085
5950 ms, final theta = -0.739085
5910 ms, final theta = -0.739085
5950 ms, final theta = -0.739085
5940 ms, final theta = -0.739085
5920 ms, final theta = -0.739085
5910 ms, final theta = -0.739085
5920 ms, final theta = -0.739085
kt@lovelace:/tmp$ java TrigTime 40000000
nums = 40000000, runs = 10, theta = 0.0, delta = 3.8785094135818516
14650 ms, final theta = -0.739085133899082
14612 ms, final theta = -0.739085133899082
14756 ms, final theta = -0.739085133899082
14628 ms, final theta = -0.739085133899082
14648 ms, final theta = -0.739085133899082
14612 ms, final theta = -0.739085133899082
14623 ms, final theta = -0.739085133899082
14596 ms, final theta = -0.739085133899082
14625 ms, final theta = -0.739085133899082
14592 ms, final theta = -0.739085133899082


The fact that these test end up being so much higher than my tests with
the same number of sine and cosine calculations shows that your test is
probably timing a lot more stuff that we don't care about than mine was.
So, I think I like my test a bit better. ;-)

Now, what is different between our two systems? Operating system wouldn't
really matter that much, would it? Is your's a 64 bit or 32 bit system?

My system is a 1.5 GHz laptop with a T5250 processor and two cores. I'm
running a 32 bit version of Ubuntu Linux. The JVM used above identifies
itself as:

java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode)
 
M

Mark Thornton

Kenneth said:
I don't think the spec is unreasonable, but I would like the option of
substituting in a faster version of the methods. I don't see a good way
to do so. Using native methods wouldn't help and there isn't a way to do
in line assembly in Java.

The simplest way would be to add another class SystemMath which
contained the same methods as Math and StrictMath, but used the usual
implementation for the platform. I think such a change is unlikely to
happen.
 
E

Eric Sosman

Kenneth said:
[...]
Now, what is different between our two systems? Operating system wouldn't
really matter that much, would it? Is your's a 64 bit or 32 bit system?

O/S matters a lot, because the libraries are completely
different: You've got a Linux distribution with ...?... math
library, and I've got the DJGPP environment running atop
Windows. (An elderly DJGPP, too, as I pointed out.)
 
K

Kenneth P. Turvey

The simplest way would be to add another class SystemMath which
contained the same methods as Math and StrictMath, but used the usual
implementation for the platform. I think such a change is unlikely to
happen.

I didn't mean for Sun to do it. I would just like to be able to drop in
my own replacement for the Math.sin and Math.cos methods. I've never used
the Java native interface, but I understood that this wouldn't be suitable
for this problem due to the overhead of passing arguments back and forth.
I could replace the two methods with my own wrappers around the C library
functions.

Would this work? Is there a way to do this quickly?
 
K

Kenneth P. Turvey

O/S matters a lot, because the libraries are completely
different: You've got a Linux distribution with ...?... math
library, and I've got the DJGPP environment running atop
Windows. (An elderly DJGPP, too, as I pointed out.)

The thing is that in this case the real problem is at the hardware level.
I suspect that our two compilers are producing essentially identical code
for the sin and cos, and that our libraries are dutifully translating
these calls to a single assembly language instruction in the Intel
floating point processor.

The thing is that our Java implementations are making different choices
about whether that strategy is accurate enough, with Java on my machine
the JVM is running under the assumption that the answer my floating
processor gives simply isn't accurate enough. Whereas on your machine a
choice is being made to use the floating point processor.

At least, that's what I guess is going on. So, is your machine's floating
point processor more accurate than mine? Is the implementation of Java on
my hardware making the wrong choice? These are the questions I have.

Ok, I found an example of the problem on the web in the initial report of
the problem as reported to Sun in Java. I implemented the first example
of the problem in C and, sure enough, my Intel box still displays the
problem. So my JVM is doing the right thing to work around the inaccuracy.

Try running this on your computer:

kt@lovelace:/tmp$ ./cos
Expected: 3ee4f8b588dd0ce3
Actual: 3ee4f8b588dd0ce1

Notice that the difference isn't in the last bit.

#include <math.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
unsigned long long *ivalp, eval;
unsigned long long ival;
sscanf("0x3ff921f0d7e968a9", "%llx", &ival); // input
sscanf("0x3ee4f8b588dd0ce3", "%llx", &eval); // expected
ivalp = &ival;
double *val;
val = (double*) ivalp;
double nval = cos(*val);
unsigned long long *oval;
oval = (unsigned long long*) &nval;
printf("Expected: %llx\n", eval);
printf("Actual: %llx\n", *oval);
}

What result do you get?
 
K

Kenneth P. Turvey

O/S matters a lot, because the libraries are completely
different: You've got a Linux distribution with ...?... math
library, and I've got the DJGPP environment running atop
Windows. (An elderly DJGPP, too, as I pointed out.)
[Snip]

Ok, I found an example of the problem on the web in the initial report of
the problem as reported to Sun in Java. I implemented the first example
of the problem in C and, sure enough, my Intel box still displays the
problem. So my JVM is doing the right thing to work around the inaccuracy.
[Snip]

As a side note, here's the Java code from the original bug report. You
can be sure that this code will execute correctly on your computer. I'm
sure it is in the automated test suite at Sun. If, however, the C code I
provided returns the wrong answer, I would like to see this run, just to
make sure.

public class Test {
public static void main(String [] args) throws Exception {

//this is the output from JCK test:
//cos failed for 3ff921f0d7e968a9 Expected 3ee4f8b588dd0ce3 +/- 1 ulp Got: 3ee4f8b588dd0ce1
//cos failed for 3ff921fb54442d1b Expected bcc5cb3b399d747f +/- 1 ulp Got: bcc5cb4000000000
//sin failed for c01921fb54442d18 Expected 3cb1a62633145c07 +/- 1 ulp Got: 3cb1a60000000000

System.out.println("x1 = " + Double.longBitsToDouble(0x3ff921f0d7e968a9L));
System.out.println("x2 = " + Double.longBitsToDouble(0x3ff921fb54442d1bL));
System.out.println("cos(x1) = " + Math.cos(Double.longBitsToDouble(0x3ff921f0d7e968a9L)));
System.out.println("cos(x2) = " + Math.cos(Double.longBitsToDouble(0x3ff921fb54442d1bL)));
System.out.println("cos(x1) = " + StrictMath.cos(Double.longBitsToDouble(0x3ff921f0d7e968a9L)));
System.out.println("cos(x2) = " + StrictMath.cos(Double.longBitsToDouble(0x3ff921fb54442d1bL)));
System.out.println("x3 = " + Double.longBitsToDouble(0xc01921fb54442d18L));
System.out.println("sin(x3) = " + Math.sin(Double.longBitsToDouble(0xc01921fb54442d18L)));
System.out.println("sin(x3) = " + StrictMath.sin(Double.longBitsToDouble(0xc01921fb54442d18L)));

}
}
 
E

Eric Sosman

Kenneth said:
The thing is that in this case the real problem is at the hardware level.
I suspect that our two compilers are producing essentially identical code
for the sin and cos, and that our libraries are dutifully translating
these calls to a single assembly language instruction in the Intel
floating point processor.

I'll take your word for it.
The thing is that our Java implementations are making different choices
about whether that strategy is accurate enough, with Java on my machine
the JVM is running under the assumption that the answer my floating
processor gives simply isn't accurate enough. Whereas on your machine a
choice is being made to use the floating point processor.

I'll take your word for it.
At least, that's what I guess is going on. So, is your machine's floating
point processor more accurate than mine? Is the implementation of Java on
my hardware making the wrong choice? These are the questions I have.

Ok, I found an example of the problem on the web in the initial report of
the problem as reported to Sun in Java. I implemented the first example
of the problem in C and, sure enough, my Intel box still displays the
problem. So my JVM is doing the right thing to work around the inaccuracy.

Try running this on your computer:
[...]
What result do you get?

Expected: 3ee4f8b588dd0ce3
Actual: 3ee4f8b588dd0ce2

.... that is, off by one ULP.
 
K

Kenneth P. Turvey

Expected: 3ee4f8b588dd0ce3
Actual: 3ee4f8b588dd0ce2

... that is, off by one ULP.

So Intel has changed the way they do things. Are you running a 64 bit OS
or 32 bit? What kind of processor do you have?

This is really interesting.

It means that Java will be competitive with C when I buy my next
computer. Cool!
 
E

Eric Sosman

Kenneth said:
So Intel has changed the way they do things.

It's possible, but I don't think you have enough evidence
to support the conclusion.
Are you running a 64 bit OS
or 32 bit? What kind of processor do you have?

From an earlier message in this thread:
This is really interesting.

It means that Java will be competitive with C when I buy my next
computer. Cool!

Now I *definitely* think you're jumping to conclusions.
But, hey, if it floats your boat ...
 
K

Kenneth P. Turvey

It's possible, but I don't think you have enough evidence
to support the conclusion.

Your floating point processor is returning a different answer for the
same set of inputs to the cosine function. I think that's a lot of
evidence. In addition, your system is not showing a performance
degradation when moving from C to Java on either of our benchmarks. I
think that's a lot of evidence as well.
 
E

Eric Sosman

Kenneth said:
Your floating point processor is returning a different answer for the
same set of inputs to the cosine function. I think that's a lot of
evidence. In addition, your system is not showing a performance
degradation when moving from C to Java on either of our benchmarks. I
think that's a lot of evidence as well.

Well, if it floats your boat. Have a nice life!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,792
Messages
2,569,639
Members
45,353
Latest member
RogerDoger

Latest Threads

Top