After deserialization program occupies about 66% more RAM

S

setar

My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. But when I serialize that dictionary
and then deserialize the program occupies about 620MB. Dictionary is the
only variable to which program has reference in the moment of serialization
and after deserialization.

Deserialization works correctly i.e. after deserialization I obtain the same
object of dictionary as I had before serialization (I can check it because I
can store dictionaries to text files and compare them - files representing
dictionaries before and after serialization are the same).
After deserialization I close stream which I used to read dictionary object
from file and run garbage collection.

Here are methods which I use to serialize and deserialize:
------------------------------------
public class Dictionary implements Serializable {
...
public void serializeTo(String fileName) throws IOException {
ObjectOutputStream out = new ObjectOutputStream(new
FileOutputStream(fileName));
out.writeObject(this);
out.close();
}

public static Dictionary deserializeFrom(String fileName) throws
IOException, ClassNotFoundException {
ObjectInputStream in = new ObjectInputStream(new
FileInputStream(fileName));
Dictionary dictionary = (Dictionary)in.readObject();
//collator hasn't been serialized to file, so we must recreate it
manually
dictionary.collator = Collator.getInstance(dictionary.getLocale());
in.close();
System.gc();
return dictionary;
}
}
------------------------------------

Anybody knows what can I do to decrease the amount of memory used after
deserialization?

Thanks for any hints.
 
O

Oliver Wong

setar said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. But when I serialize that
dictionary and then deserialize the program occupies about 620MB.
Dictionary is the only variable to which program has reference in the
moment of serialization and after deserialization. [...]

Anybody knows what can I do to decrease the amount of memory used after
deserialization?

Since it looks like the memory is approximately doubled, perhaps you now
have two copies of your dictionary object in memory? Did you release all
references to your first copy after serializing?

- Oliver
 
T

Thomas Hawtin

setar said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. But when I serialize that dictionary
and then deserialize the program occupies about 620MB. Dictionary is the
only variable to which program has reference in the moment of serialization
and after deserialization.

My guess is that it's the data held in the object streams that causes
the apparent increase in memory usage. How are you measuring it?

A memory profiler may well help. Even using a basic Sun J2SE 5.0 JDK,
for instance, you can use jmap -histo.

Tom Hawtin
 
S

setar

User said:
Since it looks like the memory is approximately doubled, perhaps you
now have two copies of your dictionary object in memory? Did you release
all references to your first copy after serializing?

It is not this problem because my program for deserialization testing looks
like this:
public static void main(String[] args) throws Exception {
dd.library.dictionary.Dictionary dictionary
=
dd.library.dictionary.Dictionary.deserializeFrom("dictionary.serialize");
int i = 0; //here is a breakpoint where I check the amount of memory
//used after deserialization
}

I use a new program and there is only one variable - one dictionary.
 
S

setar

User said:
My guess is that it's the data held in the object streams that causes the
apparent increase in memory usage. How are you measuring it?

A memory profiler may well help. Even using a basic Sun J2SE 5.0 JDK, for
instance, you can use jmap -histo.

But I close stream before measuring memory usage. I measure memory usage in
Windows task manager (I substract the amount of memory used by all programs
before run of my program from the amount of memory used by all programs
after building dictionary by my program).

I have used Java Memory Profiler (www.khelekore.org/jmp/) and it shows that
program objects use more or less the same amount of memory before
serialization and after deserialization (As I remember in Java Memory
Profiler there is no summary amount of memory. There is only summary amount
of memory used by objects of each class - I checked these clases which use
more than 100kB of memory).
 
S

setar

I have used Java Memory Profiler (www.khelekore.org/jmp/) and it shows
that
program objects use more or less the same amount of memory before
serialization and after deserialization (As I remember in Java Memory
Profiler there is no summary amount of memory. There is only summary
amount of memory used by objects of each class - I checked these clases
which use more than 100kB of memory).


Sorry, Java Memory Profiler shows a total amount of memory used by program,
and it is the same before serialization and after deserialization. I
measured it some days before when dictionary was smaller:
* before serialization:
- Windows task manager: 160MB
- Java Memory Profiler: 122.24MB (used by 3'212'143 objects)
* after deserialization:
- Windows task manager: 280MB
- Java Memory Profiler: 120.38MB (used by 3'209'026 objects)
 
T

Thomas Hawtin

setar said:
But I close stream before measuring memory usage.

But the stream and all it's gubbins is for a moment in memory at the
same time as the entire dictionary. So the maximum allocated heap will
rise at that point (IIRC, in some circumstances it can be handed back to
the operating systems, but I don't know all the ins and outs of that).

Exactly what happens is likely to be version dependent. For instance, I
guess that pre-1.5 ObjectInputStream may create String objects with
oversized char arrays.

You maybe able to reduce the amount of memory consumed by using a
customised serial form. Key classes should define readObject and
writeObject, in which they should use, for instance, readUnshared and
writeUnshared.
I measure memory usage in
Windows task manager (I substract the amount of memory used by all programs
before run of my program from the amount of memory used by all programs
after building dictionary by my program).

Such measurements of memory are notoriously misleading.

Tom Hawtin
 
E

Eric Sosman

setar said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]

... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?

Whatever it is, my advice is to spend no time at all
trying to tune and adjust and tweak a data structure that is
so grotesquely bloated. Just throw it away and replace it
with something else -- an ArrayList<String> would be orders
of magnitude more efficient.
 
S

setar

User said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]

... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?

I not only store text of words but also many more information about them,
for example: translation to english, synonyms, hypernyms, hyponyms
(ontology) and language. For each mentioned elements (they are actually
phrases of words not single words) I also store phrase parsed to component
words with information about type of connection between words and phase text
generated by concatenating parsed words (it can be different).
I will try to decrease amount of memory used by one word (phase) but I
estimated that on average one word must occupy at least 700 bytes.
Except of these I have three indices to be able to search words.
 
C

Chris Uppal

setar said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. But when I serialize that
dictionary and then deserialize the program occupies about 620MB.
Dictionary is the only variable to which program has reference in the
moment of serialization and after deserialization.

Like Thomas, I rather suspect that you may be misinterpreting what you are
seeing. One simple test would be to deserialise the same data say 10 times (or
as many times as you can without running out of memory[*]) and keep there
results in some array (so they don't get reclaimed). Use a new
ObjectInputStream each time. If the memory used keeps going up by the same
unexpectedly large amount each time, then you'll know the problem is real.

([*] you'll probably have to use a somewhat smaller dictionary for these
tests.)

If the problem is real, then one thing I'd check is the way that String sharing
is affecting your application. If you have one long String and then create
many substrings from that, then the substrings will share the internal char[]
array of the main String. If you serialise all the strings (including the
original one) and then deserialise then the sharing will be lost, and so you'll
increase the overall amount of memory used.

-- chris
 
S

setar

I've installed evaluation version of JProfiler. I will check everything and
I will write later.
 
R

Robert Klemme

User said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]
... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?

I not only store text of words but also many more information about them,
for example: translation to english, synonyms, hypernyms, hyponyms
(ontology) and language. For each mentioned elements (they are actually
phrases of words not single words) I also store phrase parsed to component
words with information about type of connection between words and phase text
generated by concatenating parsed words (it can be different).
I will try to decrease amount of memory used by one word (phase) but I
estimated that on average one word must occupy at least 700 bytes.
Except of these I have three indices to be able to search words.

Serialization blows up strings. You can see with the attached program
if used with a debugger (I tested with 1.4.2 and 1.5.0 with Eclipse).
You can see that (1) copies of strings do not share the char array any
more and (2) that the char array is larger than that of the original
even though only some characters are used (the latter is true for 1.4.2
only, so Sun actually has improved this).

Kind regards

robert

package serialization;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io_ObjectInputStream;
import java.io_ObjectOutputStream;

public class SharingTest {

/**
* @param args
* @throws IOException in case of error
* @throws ClassNotFoundException never
*/
public static void main( String[] args ) throws IOException, ClassNotFoundException {
String root = "foobar";
Object[] a1 = { root, root.substring( 3 ) };
Object[] a2 = { root, root.substring( 3 ) };

ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
ObjectOutputStream objectOut = new ObjectOutputStream( byteOut );

objectOut.writeObject( a1 );
objectOut.writeObject( a2 );

objectOut.close();

ByteArrayInputStream byteIn = new ByteArrayInputStream( byteOut.toByteArray() );
ObjectInputStream objectIn = new ObjectInputStream( byteIn );

Object[] c1 = ( Object[] ) objectIn.readObject();
Object[] c2 = ( Object[] ) objectIn.readObject();

// breakpoint here
System.out.println( c1 == c2 );

for ( int i = 0; i < c1.length; ++i ) {
System.out.println( i + ": " + ( c1 == c2 ) );
}
}

}
 
P

Paul Davis

Not so sure about this test. By adding a similar println after the
first declaration, I get the same results as the deserialized area.
Robert said:
User said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]
... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?

I not only store text of words but also many more information about them,
for example: translation to english, synonyms, hypernyms, hyponyms
(ontology) and language. For each mentioned elements (they are actually
phrases of words not single words) I also store phrase parsed to component
words with information about type of connection between words and phase text
generated by concatenating parsed words (it can be different).
I will try to decrease amount of memory used by one word (phase) but I
estimated that on average one word must occupy at least 700 bytes.
Except of these I have three indices to be able to search words.

Serialization blows up strings. You can see with the attached program
if used with a debugger (I tested with 1.4.2 and 1.5.0 with Eclipse).
You can see that (1) copies of strings do not share the char array any
more and (2) that the char array is larger than that of the original
even though only some characters are used (the latter is true for 1.4.2
only, so Sun actually has improved this).

Kind regards

robert

--------------000303000500040801080806
Content-Type: text/plain
Content-Disposition: inline;
filename="SharingTest.java"
X-Google-AttachSize: 1302

package serialization;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io_ObjectInputStream;
import java.io_ObjectOutputStream;

public class SharingTest {

/**
* @param args
* @throws IOException in case of error
* @throws ClassNotFoundException never
*/
public static void main( String[] args ) throws IOException, ClassNotFoundException {
String root = "foobar";
Object[] a1 = { root, root.substring( 3 ) };
Object[] a2 = { root, root.substring( 3 ) };

/* This produces the same results as the one below */
System.out.println(a1 == a2);

for (int i = 0; i < a1.length; ++i)
{
System.out.println(i + ": " + (a1 == a2));
}
ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
ObjectOutputStream objectOut = new ObjectOutputStream( byteOut );

objectOut.writeObject( a1 );
objectOut.writeObject( a2 );

objectOut.close();

ByteArrayInputStream byteIn = new ByteArrayInputStream( byteOut.toByteArray() );
ObjectInputStream objectIn = new ObjectInputStream( byteIn );

Object[] c1 = ( Object[] ) objectIn.readObject();
Object[] c2 = ( Object[] ) objectIn.readObject();

// breakpoint here
System.out.println( c1 == c2 );

for ( int i = 0; i < c1.length; ++i ) {
System.out.println( i + ": " + ( c1 == c2 ) );
}
}

}

--------------000303000500040801080806--
 
P

Paul Davis

Robert said:
User said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]
... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?

I not only store text of words but also many more information about them,
for example: translation to english, synonyms, hypernyms, hyponyms
(ontology) and language. For each mentioned elements (they are actually
phrases of words not single words) I also store phrase parsed to component
words with information about type of connection between words and phase text
generated by concatenating parsed words (it can be different).
I will try to decrease amount of memory used by one word (phase) but I
estimated that on average one word must occupy at least 700 bytes.
Except of these I have three indices to be able to search words.

Serialization blows up strings. You can see with the attached program
if used with a debugger (I tested with 1.4.2 and 1.5.0 with Eclipse).
You can see that (1) copies of strings do not share the char array any
more and (2) that the char array is larger than that of the original
even though only some characters are used (the latter is true for 1.4.2
only, so Sun actually has improved this).

Kind regards

robert

--------------000303000500040801080806
Content-Type: text/plain
Content-Disposition: inline;
filename="SharingTest.java"
X-Google-AttachSize: 1302

package serialization;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io_ObjectInputStream;
import java.io_ObjectOutputStream;

public class SharingTest {

/**
* @param args
* @throws IOException in case of error
* @throws ClassNotFoundException never
*/
public static void main( String[] args ) throws IOException, ClassNotFoundException {
String root = "foobar";
Object[] a1 = { root, root.substring( 3 ) };
Object[] a2 = { root, root.substring( 3 ) };

ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
ObjectOutputStream objectOut = new ObjectOutputStream( byteOut );

objectOut.writeObject( a1 );
objectOut.writeObject( a2 );

objectOut.close();

ByteArrayInputStream byteIn = new ByteArrayInputStream( byteOut.toByteArray() );
ObjectInputStream objectIn = new ObjectInputStream( byteIn );

Object[] c1 = ( Object[] ) objectIn.readObject();
Object[] c2 = ( Object[] ) objectIn.readObject();

// breakpoint here
System.out.println( c1 == c2 );

for ( int i = 0; i < c1.length; ++i ) {
System.out.println( i + ": " + ( c1 == c2 ) );
}
}

}

--------------000303000500040801080806--


Changing the code to actually show the internal reference shows that
the deserialized version produces the same results as the one before
serialization.
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io_ObjectInputStream;
import java.io_ObjectOutputStream;

public class SharingTest
{

/**
* @param args
* @throws IOException in case of error
* @throws ClassNotFoundException never
*/
public static void main(String[] args)
throws IOException, ClassNotFoundException
{
String root = "foobar";
String[] a1 = { root, root.substring(3)};
String[] a2 = { root, root.substring(3)};
System.out.println(a1 == a2);

for (int i = 0; i < a1.length; ++i)
{
System.out.println(i + ": " + (a1.intern() == a2.intern()));
}

ByteArrayOutputStream byteOut = new ByteArrayOutputStream();
ObjectOutputStream objectOut = new ObjectOutputStream(byteOut);

objectOut.writeObject(a1);
objectOut.writeObject(a2);

objectOut.close();

ByteArrayInputStream byteIn =
new ByteArrayInputStream(byteOut.toByteArray());
ObjectInputStream objectIn = new ObjectInputStream(byteIn);

String[] c1 = (String[])objectIn.readObject();
String[] c2 = (String[])objectIn.readObject();
System.out.println("-----------------------------------");
// breakpoint here
System.out.println(c1 == c2);

for (int i = 0; i < c1.length; ++i)
{
System.out.println(i + ": " + (c1.intern() == c2.intern()));
}
}

}
 
P

Paul Davis

setar said:
I've installed evaluation version of JProfiler. I will check everything and
I will write later.
Could you post the code?
It might be easier to evaluate something that we can actually see.
:)
 
R

Robert Klemme

Changing the code to actually show the internal reference shows that
the deserialized version produces the same results as the one before
serialization.

What exactly do you mean by "same results"? Of course string values
remain the same. I was talking about internal representation (i.e. the
char arrays used). You cannot see that with a Java program alone, you
need a memory profiler or a debugger to actually see those instances and
determine which are identical and which not.

Also, using String.intern() completely changes the semantics memory
wise. Of course the comparison returns true because it is actually the
same instance (and thus also the same char[] internally). My point was
that if strings are constructed from each other then serializing and
deserializing can seriously affect memory usage because of the changed
internal representation (no more sharing of char[]).

Using intern() might also be a bad idea for changing data because
interned strings will continuously increase the VM's memory. This might
not be an issue for short lived applications but it certainly can be for
long running apps.

Regards

robert
 
P

Paul Davis

Robert said:
What exactly do you mean by "same results"? Of course string values
I apologize for being unclear, by same results, I meant that:

System.out.println(a1 == a2);
for (int i = 0; i < a1.length; ++i)
{
System.out.println(i + ": " + (a1 == a2));
}

produced the same output as:

System.out.println(c1 == c2);
for (int i = 0; i < c1.length; ++i)
{
System.out.println(i + ": " + (c1 == c2));
}

meaning that there is no difference between the original values and the
deserialized ones.
remain the same. I was talking about internal representation (i.e. the
char arrays used). You cannot see that with a Java program alone, you

The intern() method returns a reference to the internal reference used
by the string object (according to the javadoc anyway).
need a memory profiler or a debugger to actually see those instances and
determine which are identical and which not.

Also, using String.intern() completely changes the semantics memory
wise. Of course the comparison returns true because it is actually the
same instance (and thus also the same char[] internally). My point was
that if strings are constructed from each other then serializing and
deserializing can seriously affect memory usage because of the changed
internal representation (no more sharing of char[]).

Using intern() might also be a bad idea for changing data because
interned strings will continuously increase the VM's memory. This might
not be an issue for short lived applications but it certainly can be for
long running apps.

I agree the intern() method should probably never be used. I was using
it here to demonstrate that the objects were pointing to the same
reference internally.
Regards

robert

Please forgive but, I don't understand what the example is trying to
demonstrate when the tests performed on the deserialized objects
produce the same output as the tests on the original objects.
false
0: true
1: false
-----------------------------------
false
0: true
1: false
 
R

Robert Klemme

Robert said:
What exactly do you mean by "same results"? Of course string values
I apologize for being unclear, by same results, I meant that:

System.out.println(a1 == a2);
for (int i = 0; i < a1.length; ++i)
{
System.out.println(i + ": " + (a1 == a2));
}

produced the same output as:

System.out.println(c1 == c2);
for (int i = 0; i < c1.length; ++i)
{
System.out.println(i + ": " + (c1 == c2));
}


Ok, now I understand. But that was not the main point of that piece of
code.
meaning that there is no difference between the original values and the
deserialized ones.

With regard to internal relationships between instances, yes. But
deserialized instances are differently set up with regard of size and
sharing of the internal buffer.
The intern() method returns a reference to the internal reference used
by the string object (according to the javadoc anyway).

I am not sure I fully agree, there is no such thing as an "internal
reference". "interned reference" is probably a bit better.
String.intern() will either return the same ref and store it in its
internal map (or whatever representation Sun chose) or you get a
reference to another instance representing an equivalent string but
already present in the internal data structure.

Quote:

A pool of strings, initially empty, is maintained privately by the class
String.

When the intern method is invoked, if the pool already contains a string
equal to this String object as determined by the equals(Object) method,
then the string from the pool is returned. Otherwise, this String object
is added to the pool and a reference to this String object is returned.
I agree the intern() method should probably never be used. I was using
it here to demonstrate that the objects were pointing to the same
reference internally.

Not exactly: you interned the strings after deserialization and thus it
comes at no surprise that they point to the same instance after you did
that.
Please forgive but, I don't understand what the example is trying to
demonstrate when the tests performed on the deserialized objects
produce the same output as the tests on the original objects.

As said, that output was not the main point. As I wrote above, set a
breakpoint at the line indicated and then look at object identities.
Then you'll see what I mean and try to convey from the beginning.

Kind regards

robert
 
E

Eric Sosman

setar said:
User said:
My program stores in RAM dictionary with about 100'000 words. This
dictionary occupies about 380MB of RAM. [...]

... thus using an average of 3800 bytes per word! What
are you storing: bit-map images of the printed text?


I not only store text of words but also many more information about them,
for example: translation to english, synonyms, hypernyms, hyponyms
(ontology) and language. For each mentioned elements (they are actually
phrases of words not single words) I also store phrase parsed to component
words with information about type of connection between words and phase text
generated by concatenating parsed words (it can be different).
I will try to decrease amount of memory used by one word (phase) but I
estimated that on average one word must occupy at least 700 bytes.
Except of these I have three indices to be able to search words.

Thanks for the more complete description. It could be
(I can't tell; your description is still only partial) that
it's the "other" data that's inflating the size when you
serialize and deserialize. Perhaps a memory profiler could
point out the pieces of the data structure that grow unusually
large when you do this.
 
V

vladimirkondratyev

It looks like there is a memory leak (in your code or somewhere inside
Java core classes). I recommend you to YourKit Java Profiler
http://www.yourkit.com for memory analyzes. The "Biggest Objects" or
"Class Tree" tools immediately will show the objects that retain the
most of the memory.

BR, Vladimir
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top