Bignums, object grind, and garbage collection

John Ersatznom · Dec 16, 2006

I'm working on a Java project in which I'm wrangling with three
intertwined issues: bignum arithmetic, a need to optimize memory use/gc
behavior, and a need for speed. As such, I've got the following apparent
options, which each have caveats as well as good points:

1. Use BigInteger and BigDecimal
Pros:
* Already done for me.
* Clean and safe.
* Using code will be easy to understand by others.
* 100% Pure Java; platform independent.
Cons:
* Since they're immutable, lengthy calculations will
generate large numbers of "new BigFoo" and heaps of
discarded objects for the GC to process.
* I'm not sure the implementation is as fast as
possible. In particular, it doesn't seem to be
documented anywhere whether these classes switch to
Karatsuba multiplication and then to FFTs at
appropriate bit lengths or just stick with slow, O(n^2)
long multiplication throughout.
2. Use something like the GNU math library via JNI; there are
libraries out there that use sensible multiplication
implementations at large bit sizes.
Pros:
* Already done for me save creating the wrapper classes
* I can probably make my bignums mutable, so instances
can be assigned to and otherwise recycled reducing
newing and GCing.
* Fast multiplies.
Cons:
* Non-immutable objects will make the using code harder
to understand and more error-prone.
* Not using BigDecimal and BigInteger will make the using
code less clear.
* Going through JNI is slow, so some inner loops would
also have to be implemented in C and called down through
JNI.
* Not 100% Pure Java; will require compiling some C and
some Java to build on any given box. The C can at least
be highly portable, since all the UI/graphics/whatever
would be done in Java.
3. Roll my own mutable bignum classes, presumably in Java.
Pros:
* I can definitely make my bignums mutable, so instances
can be assigned to and otherwise recycled reducing
newing and GCing.
* I can use fast implementations for various bit sizes.
* I can avoid JNI and have 100% Pure Java, platform
independent.
Cons:
* Non-immutable objects will make the using code harder
to understand and more error-prone.
* Not using BigDecimal and BigInteger will make the using
code less clear.
* A lot of work to do myself, some of it doubtless wheel-
reinventing.
* May not be as fast as native code.
* May not be as fast as BigDecimal and BigInteger for
small bit lengths until the effects of object churn are
felt in the form of lots of GC CPU use and a process
size in the nine-figures.
* Implementing operations will be tricky with no easy way to
trap overflows. For efficient adds, there's a fairly
obvious method using arrays of longs with 63 bits of bignum
per long, using the sum becoming negative as a carry flag.
OTOH, multiplies look like they'd prefer using ints with
31 bits per int and intermediate longs, for the basic long
division multiply or Karatsuba. (FFT would require a wholly
different implementation and I'd cross that when I came to it;
my early priority would be getting a working bignum using just
long multiplication.)
4. Find some library or code that resulted from someone else doing 3.

The tradeoffs among the above seem to depend at least partially on the
vagaries of the VM, and in particular of the GC. It's not hard to
imagine a VM implementation smart enough to recycle objects of a
frequently used class under the hood. It still seems unlikely that it
can be as efficient as explicitly freeing or assigning to the bignums
under some circumstances.

My determinations so far are:
* Using JNI for anything is a last resort.
* The choice is really down to mutable bignums vs. BigInteger
and BigDecimal.

If the bignums need to be mutable to get efficiently running code that
doesn't bloat up the process to hundreds of megs and spend more time in
GC than in my own code, then option 1 is out. On the other hand, if
getting decent multiplies at large bit sizes isn't going to happen with
BigFoo, then option 1 is out and I may as well make the darn things mutable.

Since I'm not aware of any pre-existing code suitable for 4, and JNI is
a last resort, it's basically down to 1 vs. 3.

What I'd like is:
a) People to weigh in on bignums and garbage collection:
* Would code that generates billions of intermediate
values during calculations but keeps none of them
for very long be seriously bogged down and the
memory requirements bloated by using the immutable
bignums?
* Are the multiplies in the standard BigFoos inefficient
when the values are hundreds or even thousands of bits
long?
b) In case either answer above is "yes" and I need to make my own:
* How best to go about implementing the basic, non-FFT bignum?
Use ints or longs? Tips on implementing multiplication
(and subtraction!)?
* What are sensible thresholds for switching implementations,
in terms of bit lengths, to keep multiplies fast?
* How best to implement a BigDecimal-alike given a
BigInteger-alike?

Thank you.

John W. Kennedy · Dec 16, 2006

John said:
* I'm not sure the implementation is as fast as
possible. In particular, it doesn't seem to be
documented anywhere whether these classes switch to
Karatsuba multiplication and then to FFTs at
appropriate bit lengths or just stick with slow, O(n^2)
long multiplication throughout.

For whatever it's worth, Ruby finds the first 14 perfect numbers a /lot/
faster than Java does. (Perl is a lot slower. GCLISP is fastest.)

Chris Smith · Dec 16, 2006

John Ersatznom said:
What I'd like is:
a) People to weigh in on bignums and garbage collection:
* Would code that generates billions of intermediate
values during calculations but keeps none of them
for very long be seriously bogged down and the
memory requirements bloated by using the immutable
bignums?

Because your application will be different from all others, this is hard
to answer in general. What I can do is say that there is good reason to
entertain the possibility that this may not actually be a problem at
all. Implementations of garbage collectors for Java are tuned to be as
efficient as possible with relatively small, short-lived objects; and it
is even theoretically faster that it will be faster than a non-garbage-
collected implementation in this case.

The performance figures here will depend critically on the environmental
factors: how much memory is available as GC slack, the sizes of the
nursery and first intermediate generations relative to the rate of
allocation, and so forth. The best thing to do would be to implement a
non-trivial part of the calculation -- something reasonably
representative of the kinds of calculation that you expect to do in the
final product -- and test it to determine whether the performance is
acceptable. For helpful information, try reading the advice at
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html. Especially the
info on command line options for tuning allocation segment sizes should
be worthwhile.

* Are the multiplies in the standard BigFoos inefficient
when the values are hundreds or even thousands of bits
long?

The source code for BigInteger and BigDecimal are available in the
src.zip file that comes with every JDK installation. Briefly scanning
the BigInteger implementation, it looks like it's quadratic on the
lengths of the values, which means you're out of luck here.

I'll leave your (b) questions for someone who knows more than I.

John Ersatznom · Dec 16, 2006

John said:
For whatever it's worth, Ruby finds the first 14 perfect numbers a /lot/
faster than Java does. (Perl is a lot slower. GCLISP is fastest.)

Is Ruby a broadly-ported system with free* development tools, a similar
range of standard classes and APIs including decent GUI and other
functionality, with a decent free* IDE available, and one that's somehow
on-topic for comp.lang.java.programmer? If so, I may look into it.

And I very much doubt any flavor of LISP is faster than C or native
assembly for this kind of thing, absent some kind of contrived
circumstance or using purpose-built LISP-oriented hardware.

* As in speech but just as importantly as in beer

John Ersatznom · Dec 16, 2006

Chris said:
Because your application will be different from all others, this is hard
to answer in general. What I can do is say that there is good reason to
entertain the possibility that this may not actually be a problem at
all. Implementations of garbage collectors for Java are tuned to be as
efficient as possible with relatively small, short-lived objects; and it
is even theoretically faster that it will be faster than a non-garbage-
collected implementation in this case.

I don't see how it can be faster, at least for the case of using
new...gc versus a mutable object for a temporary, rather than something
long-lived with an indefinite lifecycle.

Especially the
info on command line options for tuning allocation segment sizes should
be worthwhile.

Not something I want the users to have to muck about with if I can help
it. For them it should be plug-and-play if at all possible. And without
making global settings changes that will muck up every *other* Java app
or applet they use, either.

The source code for BigInteger and BigDecimal are available in the
src.zip file that comes with every JDK installation. Briefly scanning
the BigInteger implementation, it looks like it's quadratic on the
lengths of the values, which means you're out of luck here.

And BigDecimal? Or is it implemented over BigInteger?

John W. Kennedy · Dec 16, 2006

John said:
Is Ruby a broadly-ported system with free* development tools, a similar
range of standard classes and APIs including decent GUI and other
functionality, with a decent free* IDE available, and one that's somehow
on-topic for comp.lang.java.programmer? If so, I may look into it.

It's got most of that, in fact. But my point is that, even though Ruby
is interpreted -- not even bytecoded -- its big-number performance is
about five times faster than Java's, HotSpot and all, which rather
implies that Java's big numbers could be a lot sleeker than they are.

And I very much doubt any flavor of LISP is faster than C or native
assembly for this kind of thing, absent some kind of contrived
circumstance or using purpose-built LISP-oriented hardware.

C and Assembler don't have intrinsic support of big numbers.

John Ersatznom · Dec 16, 2006

John said:
It's got most of that, in fact. But my point is that, even though Ruby
is interpreted -- not even bytecoded -- its big-number performance is
about five times faster than Java's, HotSpot and all, which rather
implies that Java's big numbers could be a lot sleeker than they are.

Sounds like I need to go with my own bignums then.

Any suggestions on implementing the basic
array-of-primitive-integer-type case are welcome, as well as any
pointers to freely reusable libraries for the purpose. (They must be
freely distributable with the calling code. LGPL preferred, and failing
that some kind of open source preferred, but anything that can be called
from GPL and LGPL code and binary distributed with same would do.
Requiring a nonstandard native library is a big minus.)

nebulous99 · Dec 16, 2006

John said:
Any suggestions on implementing the basic
array-of-primitive-integer-type case are welcome, as well as any
pointers to freely reusable libraries for the purpose. (They must be
freely distributable with the calling code. LGPL preferred, and failing
that some kind of open source preferred, but anything that can be called
from GPL and LGPL code and binary distributed with same would do.
Requiring a nonstandard native library is a big minus.)

Here's two, but they might have some downsides.

Apfloat: http://www.apfloat.org/apfloat_java/
Claims to be faster than BigInteger etc. and it's LGPL. But these seem
to be immutable objects. If you're running some kind of science
simulation that generates and discards billions of distinct values in
the space of a few hours there'll be a lot of gc overhead and memory
wastage, I expect (in another post you implied these rates of object
generation would occur in your application). Apfloat seems to use the
disk for some reason for temporary files and

JScience: http://en.wikipedia.org/wiki/JScience
Also claims to be faster, and to have some kind of recycling system
under the hood to cut down on GC use, but it is also huge and complex.
Studying it to learn how to effectively use the bits you need for AP
math would probably take a while. It's also open source (BSD license).
See also http://en.wikipedia.org/wiki/Javolution which mentions the
object recycling, and the home pages linked from the wikipedia
articles. JScience is apparently built on it. Javolution is also
BSD-licensed.

Both libraries appear to be 1.5-aware (generics syntax is evident in
their javadocs, e.g. Foo<Bar>) and may require 1.5 (I'm pretty sure
Apfloat does).

If I were you, I'd go with JScience or roll-your-own, with Apfloat a
dark-horse candidate if you have a magic VM that can cope with millions
of allocations a second of small to medium-size objects and arrays
without turning into molasses or selling OutOfMemoryErrors like
hotcakes, or a multi-GB-RAM workstation with 64-bit everything out the
wazoo. If you need speed, the built-in BigFoos are, of course, out of
the question.

John Ersatznom · Dec 16, 2006

JScience: http://en.wikipedia.org/wiki/JScience
Also claims to be faster, and to have some kind of recycling system
under the hood to cut down on GC use, but it is also huge and complex.
Studying it to learn how to effectively use the bits you need for AP
math would probably take a while. It's also open source (BSD license).
See also http://en.wikipedia.org/wiki/Javolution which mentions the
object recycling, and the home pages linked from the wikipedia
articles. JScience is apparently built on it. Javolution is also
BSD-licensed.

This may be just what the doctor ordered; thank you.

I especially like these bits:

"Operations on instances of this class are quite fast as information
substantially below the precision level (aka noise) is not
processed/stored. There is no limit on a real precision but precision
degenerates (due to numeric errors) and calculations accelerate as more
and more operations are performed." (Real class API.)

"Transparent Object Recycling For example, our benchmark indicates that
adding immutable LargeInteger is up to 8-10x faster than adding
java.math.BigInteger" (Main page. And with semantically-immutable objects?!)

I can use these under the hood, and maybe even use some of the realtime
stuff throughout the project, which may eventually want or need
distributed capabilities and suchlike. XML externalization as an
alternative to Serializable is also attractive on a number of levels.

Chris Smith · Dec 16, 2006

John Ersatznom said:
I don't see how it can be faster, at least for the case of using
new...gc versus a mutable object for a temporary, rather than something
long-lived with an indefinite lifecycle.

Loosely speaking, it's because the time required for garbage collection
can be made proportional to the size of objects that survive the
collection (+ the size of the root set). Manual memory management is
always proportional to the number (perhaps also size) of objects that do
not survive the collection. If you're using a lot of temporary objects,
the number of objects that don't survive in the newest generations can
be far higher than the number that do.

Not something I want the users to have to muck about with if I can help
it. For them it should be plug-and-play if at all possible. And without
making global settings changes that will muck up every *other* Java app
or applet they use, either.

Presumably, you'll have some kind of script (or Windows .lnk file, or
whatever) that runs your application. That's where you'd put the
command line options. It would affect anything that's not run with that
script.

And BigDecimal? Or is it implemented over BigInteger?

I don't know. The point was that you can check src.zip and read the
source code.

Lew · Dec 16, 2006

John said:
"Transparent Object Recycling For example, our benchmark indicates that
adding immutable LargeInteger is up to 8-10x faster than adding
java.math.BigInteger" (Main page. And with semantically-immutable
objects?!)

Be cautious making assumptions about how fast or slow GC makes things.

The GC is fast indeed for multiplicities of small objects with short
lifespans. They are quickly discarded and quickly reclaimed.

It is also tunable, and different algorithms are invokable.

Object creation time may be a tad difficult, but I have read that Java's
creation times are faster than most people believe. I have never heard that
Sun's JVMs re-use objects in some sort of creation pool, but that doesn't mean
that they don't.

The Java lib versions may use O(n^2) algorithms, but a hand-rolled BigFoo with
immutable fields and good algorithms may be faster than you suspect.

Do not overlook the possible influence of JIT compilation to optimize away
apparent inefficiencies in your source, especially under load over time.

Anything involving billions of anything will be slow.

You are probably stuck with having to implement prototypes of more than one
approach, and actually profile and measure them. You could start with the LPGL
libraries nebulous pointed out, compared with Java's standard implementations.

You clearly know your algorithms. The right algorithms will have so much more
effect than putative difficulties with GC or object creation times.

You can also throw hardware at the problem, like multiway / multithreaded
multiprocessors.

- Lew

John Ersatznom · Dec 16, 2006

Lew said:
The GC is fast indeed for multiplicities of small objects with short
lifespans. They are quickly discarded and quickly reclaimed.

It is also tunable, and different algorithms are invokable.

I can't expect the end-users to do lots of tuning as part of the
installation, I'm afraid. Unless there's a way to include such tuning in
the distribution, and have the tuning work under different use patterns
by different users, and have it not impact anything else on the
deployment system, including other Java stuff that may live there.

Object creation time may be a tad difficult, but I have read that Java's
creation times are faster than most people believe. I have never heard
that Sun's JVMs re-use objects in some sort of creation pool, but that
doesn't mean that they don't.

It still seems to me that given these two choices, the former must
invariably be faster:

x.add(y) changing the fields in x, whose old value is no longer needed
and where the add can be implemented using either no temporaries, or
reused temporaries.

Foo z = x.add(y) where in addition to the fields in z having to be set,
which is the same work done above, but also:
* a new Foo has to be allocated
* The disused x has to be garbage collected later
(Here the add impmementation ends with something like "return new
Foo(args)".)

However little amortized overhead the additional allocation of z and the
collecting of x takes, it isn't zero and it adds up.

The memory usage is clearly also higher.

Do not overlook the possible influence of JIT compilation to optimize
away apparent inefficiencies in your source, especially under load over
time.

If you can convince me that the Hotspot server VM JIT is smart enough to
change Foo z = x.add(y) under the hood into effectively x.add(y) where x
is simply changed in content and renamed as z. If x is a local variable
and the VM can *prove* that x gets the sole reference to its object, and
that reference never leaves the function (x being a temporary in a
calculation), then that might be possible. Of course, that depends on
several things: probably the object x refers to had to be allocated in
the same method using x, rather than coming in in a parameter; the class
must be provably not keeping an internal reference, such as for a cache...

Anything involving billions of anything will be slow.

True. And it will be more sensitive to inefficiencies. A single
calculation taking a fraction of a second can be 30% slower and no-one
cares, so long as it's isolated and the throughput bottleneck remains
user interaction or I/O. A simulation running for hours taking 130 hours
instead of 100 is taking more than an extra *day*, which is definitely
noticeable! Also, anything involving billions of anything is potentially
going to crash with OOME. (Most VMs including Sun's also get unstable
when memory gets scarce, and abnormal termination of the VM is IME
frequent when memory is tight and especially if the app has already
recovered from OOME, and quite rare otherwise at least for Sun VMs.)
If we have a sequential calculation, such as a simulation with
successive states, ideally we only need a process size around twice the
size of the state plus a little -- to hold the existing state, the next
state being built with reference to the existing state, and some
temporaries. At the end of the cycle, the references to next state and
existing state can simply be swapped, and the new state overwriting the
two-iterations-ago state as it gets built in turn, as well as the
temporaries overwritten.

Use immutable objects and lots of "new" each cycle, however, and the
picture changes. The size grows with every iteration. Then if you're
lucky the gc begins to use a significant fraction of the CPU and the
size stabilizes, with old state data being deallocated at the same rate
as new state data is allocated. The process is somewhat larger and
(maybe only slightly) slower than the picture painted above, but it's
stable and it's pretty internally rigorous, as there's no danger
something is overwritten whose old value is still being used, creating
errors that creep into the calculations quietly.

If you're unlucky the gc can't keep up with the sheer rate of object
allocation demand and either the gc dominates the cpu time as the
simulation is forced to slow down until it only wants new objects at the
maximum rate the gc/allocation code of the VM can produce them or the gc
can't cope and OOME gets thrown. Then the simulation crashes, or it
tries to save or recover only to risk the VM crashing.

You are probably stuck with having to implement prototypes of more than
one approach, and actually profile and measure them. You could start
with the LPGL libraries nebulous pointed out, compared with Java's
standard implementations.

Actually they were BSD licensed, but that's even more liberal from the
look of things.

You clearly know your algorithms. The right algorithms will have so much
more effect than putative difficulties with GC or object creation times.

The right algorithms combined with the sleekest memory management will
be better than either by itself.

You can also throw hardware at the problem, like multiway /
multithreaded multiprocessors.

The right algorithms, the sleekest memory management, and the fastest
hardware will be better than any two by themselves.

I know it's not crucial to heavily optimize quick, event-driven bits of
code; only stuff that, in total, takes a user-perceptible amount of
time. But the runs may take hours between the user initiating one and
getting a result. A 3-ms routine taking one millisecond longer to
respond to a keystroke is not noticeable. A 3-hour job slowing to take 4
hours is.

Lew · Dec 17, 2006

John said:
> ... several salient points ...

You make good points. I am curious how well the JVM would keep up with
immutable classes in an app like yours, and how much difference it really
makes to use mutable instances.

Since the Sun classes evidently use a slow algorithm you are thrown into the
world of third parties or roll-yer-own anyway.

- Lew

John Ersatznom · Dec 17, 2006

Lew said:
You make good points. I am curious how well the JVM would keep up with
immutable classes in an app like yours, and how much difference it
really makes to use mutable instances.

Since the Sun classes evidently use a slow algorithm you are thrown into
the world of third parties or roll-yer-own anyway.

I'm currently leaning toward rolling my own wrapper classes, but using
the JScience LargeInteger under the hood with a PoolContext. I had a
look at the LargeInteger source code, and it uses normal multiplies up
to a point and then Karatsuba multiplies, which looks efficient. OTOH,
the Real class in JScience does pretty much every operation twice to
maintain a scientifically-controlled error bound. For some numerical
stuff this would be important, but for some of my plans I need speed and
behavior similar to a normal float or double that happens to have a huge
mantissa, which means wrapping LargeInteger with a power-of-2 exponent
and doing everything only once.

Eventually I want an FFT implementation to throw at floating-point
calculations reaching the several thousands of bits, though, as well,
but I'll cross that one when I come to it.

One thing I'll probably be doing is dropping the LSBs off the more
accurate operand in adds and subtracts. Their effect disappears in the
noise anyway.

Lew · Dec 17, 2006

John said:
I'm currently leaning toward rolling my own wrapper classes, but using
the JScience LargeInteger under the hood with a PoolContext.

Is a PoolContext an object pool for the LargeInteger objects? If so, it might
have poorer performance than small immutable objects, since long-lived objects
are subject to more expensive GC than short-lived ones.

Or not.

I haven't tried such a thing as yours, and everything I've read about GC makes
it sound unpredictable and mysterious.

Only test runs of your classes with immutable vs. pooled objects will tell us
for sure. With different GC strategies. I wonder if anyone has run comparable
tests before.

- Lew

John Ersatznom · Dec 17, 2006

Lew said:
Is a PoolContext an object pool for the LargeInteger objects? If so, it
might have poorer performance than small immutable objects, since
long-lived objects are subject to more expensive GC than short-lived ones.

This assumes that a) LargeIntegers are, despite their name, small

and
b) the PoolContext objects get gc'd so much more expensively it
outweighs their getting gc'd much less frequently. (One LargeInteger
that gets 10 uses gets 1 more expensive gc per 10 uses instead of 10
less expensive gcs per 10 uses. So it has to be at least Nx more
expensive for one that lives N times as long as usual.

Only test runs of your classes with immutable vs. pooled objects will
tell us for sure. With different GC strategies. I wonder if anyone has
run comparable tests before.

I'm going to, before too too much longer, although there's a somewhat
confounded benchmark at http://jscience.org/doc/benchmark.html --
confounded because it doesn't separate the effects of PoolContexting
from the effects of using LargeInteger in place of BigInteger. Separate
comparisons of LargeInteger (PoolContext) to LargeInteger (no context)
and LargeInteger (no context) to BigInteger would be more informative.

Forcing garbage collection	8	Nov 30, 2009
Permanent Generation and garbage collection	12	Aug 12, 2009
PermGen, Garbage Collection and CPU utilization	3	Feb 2, 2009
Another garbage collection question	9	Feb 1, 2008
PermGen, Class Unloading and Garbage Collection once again	1	Sep 1, 2009
ctypes and garbage collection	2	Sep 6, 2010
Understanding Inheritance and Polymorphism	1	Feb 18, 2024
Memory problems (garbage collection)	6	Apr 23, 2009

Bignums, object grind, and garbage collection

John Ersatznom

John W. Kennedy

Chris Smith

John Ersatznom

John Ersatznom

John W. Kennedy

John Ersatznom

nebulous99

John Ersatznom

Chris Smith

Lew

John Ersatznom

Lew

John Ersatznom

Lew

John Ersatznom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads