Optimising Garbage Collection

T

Tony Morris

Roedy Green said:
Consider a program where you read a file into RAM in one great String,
process it, create a new version, then if it is different, write it
back on one fell swoop. Repeat for thousands of files.

From GC's point of view, you are creating whacking two huge byte
arrays for the I/O, a huge StringBuilder and two huge String objects
then discarding them for every file.

Ideally there should be some sort of mini GC where just these objects
are reclaimed, leaving the rest alone.

A generational collector obviously helps. Is there anything simple I
can do to boost this along.

Hello Roedy,
If you are doing small modifications to the class java.lang.String before
performing some operation on them, you might consider not creating large
copies of them at all. For example, in order to change one character of a
String, you must perform a data copy with your modification. If what you
really want is that String to appear to have that character changed through
the String contract (its methods), then it is an unfortunate limitation of
both java.lang.String and arrays in that they are restricted to an in-memory
representation of their backing data while exposing too much contract i.e.
not workaroundable (java.lang.CharSequence is a frivoleous attempt to
prevent this restriction). It would be better instead to have that data
arbitrarily represented in a type that provides an appropriate abstraction,
and some method that makes that abstraction appear as if a character were
changed, without enforcing that a data copy occurred. In fact, it would be
better if that data copy didn't occur and allow the client to decide when to
copy for performance reasons.

To illustrate an example, consider the following, more appropriate
abstraction of an array:
interface Array<E>
{
int length();
void set(E e, int index) throws ArrayIndexOutOfBoundsException;
E get(int index) throws ArrayIndexOutOfBoundsException;
}

An implementation of this interface is not restricted to an in-memory
representation, but suppose it was anyway - for simplicity.
Now if I wanted to change the element at the nth index, a typical array
would force a copy with something like System.arraycopy.
Since you don't have that restriction here - since you have a more
appropriate abstraction - this enforcement does not hold.

You could write the following interface:
interface ReplaceNthElementOfArray
{
<E> Array<E> replace(int n, E e, Array<E> a);
}
Implementations are not forced to perform a data copy in order to fulfill
the contract, since it has much less excess than the contract of a typical
array.

<plug shame="high">
I am not sure if all of this is even relevant to your question (I am a bit
vague on what you really want), but if it is, I can point you to
ContractualJ [http://contractualj.com], which has attempted to provide a
more appropriate abstraction of both arrays, and Strings by way of a
net.tmorris.adt.Sequence. In the case of an ordered sequence of characters,
I prefer to use a Sequence<Character> along with a
net.tmorris.primitives.Charable over a java.lang.String, since this allows
me to perform operations on the contract without the intrinsic excess
contract of most (all?) Java types. As a consequence, this often results in
significant performance improvements - no need to copy data due to excess
type contract.
</plug>
 
T

Tony Morris

Roedy Green said:
I have a number of applications that process a file in one fell
swoop.

1. seek out the pattern <!-- macro XXXX parm1 parm2 ... -->and
expand the macros, replacing any previous expansion. The macros are
written in Java.

2. compact removing whitespace from HTML that has no effect on
rendering.

3. clean up the "see" section at the bottom of each page, sorting
links, and resolving words into links.

4. remove excess blank lines

5. align in columns.

6. spaces to tabs

7. tabs to spaces

8. search for <DT entries and extract items for indexing.

9. convert & to &amp; appropriately

10. remove duplicate lines

11. reflow text to standard line length.

12. split files into pieces based on embedded markers.

That sounds very much to me like you do indeed need a more appropriate
abstraction than java.lang.String, since performing those operations
individually will incur a performance hit by way of a data copy each time
(as each String instance is created). You could of course, try to do
something clever with collections, but they too have issues related to
inappropriate abstractions. I don't want to plug myself again, but in my
opinion, I cannot find a better recommendation - except of course, write
such an abstraction yourself.

Tony Morris
http://tmorris.net/
Java Questions and Answers
http://jqa.tmorris.net/
 
R

Roedy Green

Consider a program where you read a file into RAM in one great String,
process it, create a new version, then if it is different, write it
back on one fell swoop. Repeat for thousands of files.

From GC's point of view, you are creating whacking two huge byte
arrays for the I/O, a huge StringBuilder and two huge String objects
then discarding them for every file.

Ideally there should be some sort of mini GC where just these objects
are reclaimed, leaving the rest alone.

A generational collector obviously helps. Is there anything simple I
can do to boost this along.
 
R

Roedy Green

If you are doing small modifications to the class java.lang.String before
performing some operation on them, you might consider not creating large
copies of them at all

I have a number of applications that process a file in one fell
swoop.

1. seek out the pattern <!-- macro XXXX parm1 parm2 ... -->and
expand the macros, replacing any previous expansion. The macros are
written in Java.

2. compact removing whitespace from HTML that has no effect on
rendering.

3. clean up the "see" section at the bottom of each page, sorting
links, and resolving words into links.

4. remove excess blank lines

5. align in columns.

6. spaces to tabs

7. tabs to spaces

8. search for <DT entries and extract items for indexing.

9. convert & to &amp; appropriately

10. remove duplicate lines

11. reflow text to standard line length.

12. split files into pieces based on embedded markers.
 
S

Stefan Ram

Tony Morris said:
In the case of an ordered sequence of characters, I prefer to
use a Sequence<Character> along with a
net.tmorris.primitives.Charable over a java.lang.String, since
this allows me to perform operations on the contract without
the intrinsic excess contract of most (all?) Java types. As a
consequence, this often results in significant performance
improvements - no need to copy data due to excess type
contract.

It somewhat reminds me of the way I implement "toString" in
"de.dclj.ram.type.tuple.ComparableTuple0".

Why do I have to convert my CharSequence to a String if the
client would be happy with a CharSequence?

Therefore, I offer "toCharSequence" for all clients:

/** {@inheritDoc} */
public java.lang.CharSequence toCharSequence()
{ java.lang.CharSequence result = "()";
if( this.size > 0 )
{ final StringBuilder builder = new StringBuilder();
/* ... */
result = builder; }
return result; }

An then "toString" for clients who really need
a string:

/** {@inheritDoc} */
public java.lang.String toString()
{ return this.toCharSequence().toString(); }

(The full code is avaiable at:
http://www.purl.org/stefan_ram/html/ram.jar/de/dclj/ram/type/tuple/ComparableTuple.html
http://www.purl.org/stefan_ram/html/ram.jar/de/dclj/ram/type/tuple/ComparableTuple0.html
http://www.purl.org/stefan_ram/java/de/dclj/ram/type/tuple/ComparableTuple.java
http://www.purl.org/stefan_ram/java/de/dclj/ram/type/tuple/ComparableTuple0.java
)
 
R

Robert Klemme

Roedy said:
Consider a program where you read a file into RAM in one great String,
process it, create a new version, then if it is different, write it
back on one fell swoop. Repeat for thousands of files.

From GC's point of view, you are creating whacking two huge byte
arrays for the I/O, a huge StringBuilder and two huge String objects
then discarding them for every file.

You don't need those huge byte arrays if you do conversion during reading
from the stream.
Ideally there should be some sort of mini GC where just these objects
are reclaimed, leaving the rest alone.

A generational collector obviously helps. Is there anything simple I
can do to boost this along.

I agree with Tony, you should think of a different representation of your
file contents in memory. This soulds like a typical application of a
parser. You might get away with a scanner that hacks your input into
tokens though.

HTH

robert
 
R

Roedy Green

I agree with Tony, you should think of a different representation of your
file contents in memory. This soulds like a typical application of a
parser. You might get away with a scanner that hacks your input into
tokens though.

The giant string works well. I do the i/o as fast as theoretically
possible. You can't beat indexof for for finding the parts of the
string I have to process. This way I can really rip through a large
number of files quickly.

My biggest file is only about half a megabyte and most are much
smaller.
 
A

Alun Harford

Roedy Green said:
The giant string works well. I do the i/o as fast as theoretically
possible. You can't beat indexof for for finding the parts of the
string I have to process. This way I can really rip through a large
number of files quickly.

My biggest file is only about half a megabyte and most are much
smaller.

Why not read it into a StringBuffer, change it, write it back? (you can even
recycle the StringBuffer if you want). StringBuffer has indexof, which is
still nice and fast.

Alun Harford
 
R

Roedy Green

Why not read it into a StringBuffer, change it, write it back? (you can even
recycle the StringBuffer if you want). StringBuffer has indexof, which is
still nice and fast.

I am not changing it in the sense of overlaying a few pieces.. I am
taking pieces of it, and replacing other pieces with a different
length piece. The most efficient way to do that is to create a new
StringBuilder, rather than deleting and inserting from the old one, at
least in terms of CPU to slide to make room for the pieces. It should
work with less RAM though with your method, so long as StringBuilder
does not go allocating temp space to do its sliding.

I'll put that idea on my todo stack to experiment with to see what the
net effect is. The program toggles back and forth between consuming
35% and 65% of the CPU cycles on an idle machine. I don't know if that
is mostly in my algorithm or in GC. GC should be quick. There are that
many live objects other than classes.
 
R

Roedy Green

I'll put that idea on my todo stack to experiment with to see what the
net effect is. The program toggles back and forth between consuming
35% and 65% of the CPU cycles on an idle machine. I don't know if that
is mostly in my algorithm or in GC. GC should be quick. There are that
many live objects other than classes.

Your algorithm is actually much better than I first thought. Most of
the time when I expand a macro, the expansion is the same as last
time. There is no need to replace anything, and even if I did, it
would be the same length as the old expansion, and hence
StringBuilder. replace should not need to shift anything.

So long as my estimate for StringBuilder size is good, this should cut
my memory use in half. Thanks! I never would have thought of this on
my own.

Perhaps I can do this up brown by recycling the StringBuilder. I
noticed there is no StringBuilder.clear or setSize. I wonder what the
most efficient way to clear for reuse is.

I need to chew on the problem of deciding when to strink a
StringBuilder. There is no point is keeping it huge just because one
file was huge. Maybe I could simply process the files in ascending
size order.

To make this even faster, I would need a way to do I/O direct to/from
the char[] in the StringBuilder.
 
T

Thomas Weidenfeller

Roedy said:
The giant string works well. I do the i/o as fast as theoretically
possible. You can't beat indexof for for finding the parts of the
string I have to process.

From you original description of the problem, where you wrote you look
for certain sequences of characters, I have to disagree! Like I
mentioned in previous discussions, look for Boyer-Moore,
Knuth-Morris-Pratt, Rabin-Karp or similar algorithms. You can beat
linear, character by character text searches under certain conditions.

As for chopping the the data into pieces, and rebuild the stuff in
memory, have you considered to not copy the data around, but assemble
the result from "chunks" of the original data, and your changed data?

Your output would be described as a sequence of "chunks". Each chunks
either refers to original data (e.g. by an offset and length), or some
new data you just changed.

You have gathering write operations in nio, so writing these chunks
could be relatively fast. "Could", because it depends on the nio
implementation and the operating system. If the operating system
supports scatter/gather I/O, and if nio uses this, it might work and be
somewhat faster.

/Thomas
 
R

Roedy Green

You have gathering write operations in nio, so writing these chunks
could be relatively fast.

It would be interesting to see how clever nio can get about copying.
It still has to decode and encode. decoding does not know how long the
result will be ahead of time. It makes a conservative guess and trims
the result, before creating the string. FileReader thus dithers
horribly copying the array over and over and over before you get your
string.

Your gather algorithm is extremely memory efficient, especially if the
expansions for the most part are the same as last time -- the usual
case.

I wrote a Boyer Moore. It was only marginally faster than indexOf.
Perhaps it could do better by working on a raw char[].

see http://mindprod.com/jgloss/products1.html#BOYER

I do quite a bit of this sort of programming, so I am interested in
the general case the best way to write such programs -- how to you get
the file in, modified and out with minimal fuss and RAM consumption.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,135
Latest member
VeronaShap
Top