Out of memory with file streams

H

Hendrik Maryns

Hi all,

I have little proggie that queries large linguistic corpora. To make
the data searchable, I do some preprocessing on the corpus file. I now
start getting into trouble when those files are big. Big means over 40
MB, which isn’t even that big, come to think of it.

So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):

final BufferedReader corpus = new BufferedReader(new FileReader(inFile));
final ObjectOutputStream treeOut = new ObjectOutputStream(new
BufferedOutputStream(new FileOutputStream(treeFile)));
final int nbTrees = TreebankConverter.parseNegraTrees(corpus, treeOut);
try {
treeOut.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
try {
corpus.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}

parseNegraTrees then does the following: it scans through the input
file, constructs trees that are described in it in some text format
(NEGRA), converts those trees to a binary format, and writes them as
Java objects to the treeFile. Each of those trees consists of nodes
with a left daughter, a right daughter and a list of strings of length
at most 5. And those are short strings: words or abbreviations. So
this shouldn’t take too much memory, I would think.

This is also done one by one:

TreebankConverter.skipHeader(corpus);
String bosLine;
while ((bosLine = corpus.readLine()) != null) {
final StringTokenizer tokens = new StringTokenizer(bosLine);
final String treeIdLine = tokens.nextToken();
if (!treeIdLine.equals("%%")) {
final String treeId = tokens.nextToken();
final NodeSet forest = parseSentenceNodes(corpus);
final Node root = forest.toTree();
final BinaryNode binRoot = root.toBinaryTree(new ArrayList<Node>(), 0);
final BinaryTree binTree = new BinaryTree(binRoot, treeId);
treeOut.writeObject(binTree);
}
}

I see no reason in the above code why the GC wouldn’t discard the trees
that have been constructed before.

So the only place for memory problems I see here is the file access.
However, as I grasp from the Javadocs, both FileReader and
FileOutputStream are, indeed streams, that do not have to remember what
came before. Is the buffering the problem, maybe?

TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFH3loDe+7xMGD3itQRAk06AKCAuQKLbrbGkExVP20J49E6U470qACdE5pq
HiASjlcBPxQWxV6zUbpMvd8=
=m33o
-----END PGP SIGNATURE-----
 
L

Lew

Hendrik Maryns wrote:
....
So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):

final BufferedReader corpus = new BufferedReader(new FileReader(inFile));
final ObjectOutputStream treeOut = new ObjectOutputStream(new
BufferedOutputStream(new FileOutputStream(treeFile)));
final int nbTrees = TreebankConverter.parseNegraTrees(corpus, treeOut);
try {
treeOut.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
try {
corpus.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
...
This is also done one by one:

TreebankConverter.skipHeader(corpus);
String bosLine;
while ((bosLine = corpus.readLine()) != null) {
final StringTokenizer tokens = new StringTokenizer(bosLine);
final String treeIdLine = tokens.nextToken();
if (!treeIdLine.equals("%%")) {
final String treeId = tokens.nextToken();
final NodeSet forest = parseSentenceNodes(corpus);
final Node root = forest.toTree();
final BinaryNode binRoot = root.toBinaryTree(new ArrayList<Node>(), 0);
final BinaryTree binTree = new BinaryTree(binRoot, treeId);
treeOut.writeObject(binTree);
}
}

I see no reason in the above code why the GC wouldn’t discard the trees
that have been constructed before.

So the only place for memory problems I see here is the file access.
However, as I grasp from the Javadocs, both FileReader and
FileOutputStream are, indeed streams, that do not have to remember what
came before. Is the buffering the problem, maybe?

When incomplete code is posted with a question, the answer is pretty much
always in the code not posted. Check through the code you left out of your
post for packratted references.

<http://mindprod.com/jgloss/sscce.html>
<http://mindprod.com/jgloss/packratting.html>
 
M

Mark Space

Hendrik said:
So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):

More likely you have an error in the code and the tree is growing to the
size of the entire file.

Do you have access to a profiler? Most profilers also annalyze garbage
collection too. Each object that survives garbage collection gets
marked as one "generation" older by the gc, so the trick is to look for
objects which survive many generations.

If you don't have a good debugger/profiler, get one. It's basically
required for any serious development work. The profiler that comes with
NetBeans 6 is excellent, and it's trivial to import an existing project
into NetBeans. Give it a shot.
 
Z

Zig

Hi all,

I have little proggie that queries large linguistic corpora. To make
the data searchable, I do some preprocessing on the corpus file. I now
start getting into trouble when those files are big. Big means over 40
MB, which isn’t even that big, come to think of it.

So I am on the lookout for a memory leak, however, I can’t find it. The
preprocessing method basically does the following (suppose the inFile
and the treeFile are given Files):

final BufferedReader corpus = new BufferedReader(new FileReader(inFile));
final ObjectOutputStream treeOut = new ObjectOutputStream(new
BufferedOutputStream(new FileOutputStream(treeFile)));
final int nbTrees = TreebankConverter.parseNegraTrees(corpus, treeOut);
try {
treeOut.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}
try {
corpus.close();
} catch (final IOException e) {
// if it cannot be closed, it wasn’t open
}

parseNegraTrees then does the following: it scans through the input
file, constructs trees that are described in it in some text format
(NEGRA), converts those trees to a binary format, and writes them as
Java objects to the treeFile. Each of those trees consists of nodes
with a left daughter, a right daughter and a list of strings of length
at most 5. And those are short strings: words or abbreviations. So
this shouldn’t take too much memory, I would think.

This is also done one by one:

TreebankConverter.skipHeader(corpus);
String bosLine;
while ((bosLine = corpus.readLine()) != null) {
final StringTokenizer tokens = new StringTokenizer(bosLine);
final String treeIdLine = tokens.nextToken();
if (!treeIdLine.equals("%%")) {
final String treeId = tokens.nextToken();
final NodeSet forest = parseSentenceNodes(corpus);
final Node root = forest.toTree();
final BinaryNode binRoot = root.toBinaryTree(new ArrayList<Node>(),
0);
final BinaryTree binTree = new BinaryTree(binRoot, treeId);
treeOut.writeObject(binTree);
}
}

I see no reason in the above code why the GC wouldn’t discard the trees
that have been constructed before.

So the only place for memory problems I see here is the file access.
However, as I grasp from the Javadocs, both FileReader and
FileOutputStream are, indeed streams, that do not have to remember what
came before. Is the buffering the problem, maybe?

You are right, FileOutputStream & FileReader are pretty primitive.
ObjectOutputStream, OTOH is a different matter. ObjectOutputStream will
keep references to objects written to the stream, which enables it to
handle cyclic object graphs, and repeating references of the same object
are handled predictably.

You can force ObjectOutputStream to clean up by using:

treeOut.writeObject(binTree);
treeOut.reset();

This should notify ObjectOutputStream that you will not be re-referencing
any previously written objects, and allow the stream to release it's
internal references.

HTH,

-Zig
 
H

Hendrik Maryns

Mark Space schreef:
More likely you have an error in the code and the tree is growing to the
size of the entire file.

Do you have access to a profiler? Most profilers also annalyze garbage
collection too. Each object that survives garbage collection gets
marked as one "generation" older by the gc, so the trick is to look for
objects which survive many generations.

If you don't have a good debugger/profiler, get one. It's basically
required for any serious development work. The profiler that comes with
NetBeans 6 is excellent, and it's trivial to import an existing project
into NetBeans. Give it a shot.

I use Eclipse, and installed TPTP, but it’s a PITA, I didn’t get it
running. Maybe I really should give NetBeans another try. However, the
answer of Zig was right on the spot, so no more need for now.

Thanks anyway, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFH3rT+e+7xMGD3itQRAgaVAJ0YyA/uaMvlhPAnVz4BouSSQhmAwACfQQPu
p+NHpGHstnJHohiOwxvTU5g=
=ZOoH
-----END PGP SIGNATURE-----
 
H

Hendrik Maryns

Zig schreef:
You are right, FileOutputStream & FileReader are pretty primitive.
ObjectOutputStream, OTOH is a different matter. ObjectOutputStream will
keep references to objects written to the stream, which enables it to
handle cyclic object graphs, and repeating references of the same object
are handled predictably.

You can force ObjectOutputStream to clean up by using:

treeOut.writeObject(binTree);
treeOut.reset();

This should notify ObjectOutputStream that you will not be
re-referencing any previously written objects, and allow the stream to
release it's internal references.

That’s exactly what I needed. The API could have been more informing
over the memory implications of this backreferencing mechanism. The
memory footprint is not even mentioned in the Javadoc of the reset() method.

Thank you very much!
H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFH3rVMe+7xMGD3itQRAmQXAJ0RgQSgwPI4PmJiYNVyWjQSqS9howCbB6wJ
Sa0AGfxbxrlkDoBWNiGGerA=
=h91a
-----END PGP SIGNATURE-----
 
M

Mark Space

Hendrik said:
The API could have been more informing
over the memory implications of this backreferencing mechanism. The
memory footprint is not even mentioned in the Javadoc of the reset()
method.

"Reset" is also a terrible name for a method of an output stream.
"Reset" normally means something totally different when talking about IO
streams.

From collections, "clear" probably would have been better. Maybe
"cleanUpMemory" would have been even better. And maybe using weak
references would have been even better still. Don't make the user deal
with the internal memory of an object.
 
M

Mark Thornton

Mark said:
"Reset" is also a terrible name for a method of an output stream.
"Reset" normally means something totally different when talking about IO
streams.

From collections, "clear" probably would have been better. Maybe
"cleanUpMemory" would have been even better. And maybe using weak
references would have been even better still. Don't make the user deal
with the internal memory of an object.

WeakReference's would probably be a bad idea --- the overhead is
significant.

Mark Thornton
 
M

Mark Space

Mark said:
WeakReference's would probably be a bad idea --- the overhead is
significant.


I've seen an article or white paper on this, I think from Sun. They
said to put all references in a regular old Map (HashMap, etc.), then
use one weak or soft reference to the map itself.

SoftReference cache = new SoftReference( new HashMap() );

Then, when the GC needs to collect this, it sees one soft reference,
frees that, and then can automatically free everything in the hash map
without any more checking or involved processing. It's much much
lighter on the GC than using a weak hash map, which makes the GC check
every single weak reference.

I'll try to find that article, I don't see it right now....
 
Z

Zig

"Reset" is also a terrible name for a method of an output stream.
"Reset" normally means something totally different when talking about IO
streams.

From collections, "clear" probably would have been better. Maybe
"cleanUpMemory" would have been even better. And maybe using weak
references would have been even better still. Don't make the user deal
with the internal memory of an object.


Well, "reset" does reset the ObjectOutputStream back to the state it was
initialized to: all objects are new, no class descriptors have been
written, etc. Using reset() will decrease your memory overhead, but at the
cost of increasing your data size (since the class descriptors have to be
re-serialized).

Weak / Soft references would be slick, but there is an extra "quirk" that
would have to be addressed. reset() will put a RESET marker in the stream
in order for ObjectInputStream to recognize that the stream has been reset
(at which point it will dump it's references). Even while
ObjectOutputStream could use soft references to determine that an object
is no longer referenced, you also have to notify ObjectInputStream whent
it's safe to clear it's references to objects that will not be
subsequently re-referenced later in the stream.

For simple objects you can get around all of this by using

ObjectOutputStream.writeUnshared() / ObjectInputStream.readUnshared(), but
those will still graph references for the nested object you'll get in a
tree structure :/

Anyway, hope that was interesting food for thought,

-Zig
 
Z

Zig

Zig schreef:

That’s exactly what I needed. The API could have been more informing
over the memory implications of this backreferencing mechanism. The
memory footprint is not even mentioned in the Javadoc of the reset()
method.

Glad to help!
 
H

Hendrik Maryns

Mark Space schreef:
"Reset" is also a terrible name for a method of an output stream.
"Reset" normally means something totally different when talking about IO
streams.

From collections, "clear" probably would have been better. Maybe
"cleanUpMemory" would have been even better. And maybe using weak
references would have been even better still. Don't make the user deal
with the internal memory of an object.

That’s what I think as well, so I created a bug report: 1209747 (not yet
visible).

H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFH36xle+7xMGD3itQRAj/hAJsHso7tOB0I9ZFmkXX4xP1Z7qQzMwCeJaXd
CrV6D8CsUlfYqY/HYrYaBvk=
=mS9D
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top