T
Tony Morris
Roedy Green said:Consider a program where you read a file into RAM in one great String,
process it, create a new version, then if it is different, write it
back on one fell swoop. Repeat for thousands of files.
From GC's point of view, you are creating whacking two huge byte
arrays for the I/O, a huge StringBuilder and two huge String objects
then discarding them for every file.
Ideally there should be some sort of mini GC where just these objects
are reclaimed, leaving the rest alone.
A generational collector obviously helps. Is there anything simple I
can do to boost this along.
Hello Roedy,
If you are doing small modifications to the class java.lang.String before
performing some operation on them, you might consider not creating large
copies of them at all. For example, in order to change one character of a
String, you must perform a data copy with your modification. If what you
really want is that String to appear to have that character changed through
the String contract (its methods), then it is an unfortunate limitation of
both java.lang.String and arrays in that they are restricted to an in-memory
representation of their backing data while exposing too much contract i.e.
not workaroundable (java.lang.CharSequence is a frivoleous attempt to
prevent this restriction). It would be better instead to have that data
arbitrarily represented in a type that provides an appropriate abstraction,
and some method that makes that abstraction appear as if a character were
changed, without enforcing that a data copy occurred. In fact, it would be
better if that data copy didn't occur and allow the client to decide when to
copy for performance reasons.
To illustrate an example, consider the following, more appropriate
abstraction of an array:
interface Array<E>
{
int length();
void set(E e, int index) throws ArrayIndexOutOfBoundsException;
E get(int index) throws ArrayIndexOutOfBoundsException;
}
An implementation of this interface is not restricted to an in-memory
representation, but suppose it was anyway - for simplicity.
Now if I wanted to change the element at the nth index, a typical array
would force a copy with something like System.arraycopy.
Since you don't have that restriction here - since you have a more
appropriate abstraction - this enforcement does not hold.
You could write the following interface:
interface ReplaceNthElementOfArray
{
<E> Array<E> replace(int n, E e, Array<E> a);
}
Implementations are not forced to perform a data copy in order to fulfill
the contract, since it has much less excess than the contract of a typical
array.
<plug shame="high">
I am not sure if all of this is even relevant to your question (I am a bit
vague on what you really want), but if it is, I can point you to
ContractualJ [http://contractualj.com], which has attempted to provide a
more appropriate abstraction of both arrays, and Strings by way of a
net.tmorris.adt.Sequence. In the case of an ordered sequence of characters,
I prefer to use a Sequence<Character> along with a
net.tmorris.primitives.Charable over a java.lang.String, since this allows
me to perform operations on the contract without the intrinsic excess
contract of most (all?) Java types. As a consequence, this often results in
significant performance improvements - no need to copy data due to excess
type contract.
</plug>