An interesting memory release problem?

K

Kevin

Hi guys,
how can we explain this that I just observed?

The task is to count the frequency of words in text. A simple change in
the code can result in HUGE memory usage difference.

The program goes like this:

The count data structure is a "Hashtable ht", it hashes String to myInt
(which is a tiny class to wrap a int).

The text is read in line by line, break into tokens of String[].


String oneLine = read in one text line.
String[] tokens = oneLine.split(MySpliterString);

while (tokens != null)
{
for (int i=0;i<tokens.length;i++)
{
myInt count = (myInt ) ht.get(tokens[index]);
if (count == null)
{
ht.put(new String(tokens[index]), new KDOneInt(1)); // use little
memory.
// ht.put(tokens[index], new KDOneInt(1)); // use A LOT OF
memory.
}
else
{
count.increaseValueByOne();
};
};
//
for (int cc = 0; cc<tokens.length; cc++)
{
tokens[cc] = null;
};
tokens = null;
oneLine = read in one text line.
tokens = oneLine.split(MySpliterString);
}

///////////
Now, comes the magic part:
if I use the line as above:
ht.put(new String(tokens[index]), new KDOneInt(1)); // use little
memory.
the program use little memory.
But if I use the line as above (commented out):
// ht.put(tokens[index], new KDOneInt(1)); // use A LOT OF
memory.
the program will use much much much more memory!

What is the point for that?

Thanks and have a nice day!
 
K

Kevin

By the way, I observe the memory usage by using:

Runtime s_runtime = Runtime.getRuntime ();
println((s_runtime.totalMemory () - s_runtime.freeMemory ())/1000000);

The above two methods, which only differ by a "new String(..)", can
result in a difference of 10M memory to 60M memory.
 
K

Kevin McMurtrie

I imagine that your tokenizer is using String.substring() for the
tokens. That method produces a String that shares a char[] buffer with
the original string. Every time you add a new token, you're bringing
along the whole text line it came from. It's very efficient for
temporary objects but it sucks up memory in your case.

The String(String) constructor breaks the buffer sharing; that's it's
one and only purpose. It's good to use for long-term cache keys.

If you wanted to optimize a bit more you could create a custom hashing
map that operated directly on a char[] key and int value. Converting to
an 8 bit character set like UTF-8 would help even more if there won't be
too many escape codes. The 8 bit conversion will have a computational
cost, of course, but you'll break substring buffer sharing at the same
time.
 
T

Thomas G. Marshall

Kevin coughed up:
Sorry, a tiny type error above:
the "KDOneInt" should be "myInt ".


The String class actually maintains an array of characters. This array can
be much larger than the characters used, believe it or not.

If you look in the source, you'll discover that eventually substrings are
grabbed from a string, and those substrings are using the *same* internal
array as the entire string is.

Your source:

1. String oneLine = read in one text line.
2. String[] tokens = oneLine.split(MySpliterString);

In #1, the string internal character array is coming from your read in
method (a BufferedReader perhaps?) This internal character array is always
larger than what you're string length is.

In #2, the split eventually calls String.substring() (if you follow the
calls from String.split(...) -> Pattern.split(...), which returns a
substring using the same internal character array as the original. It does
this by using a special private String constructor "for speed" that does not
trim away any space.

Long and the short?

1. always new String(input string from BufferedReader and the like)
2. OR always new String(mystring.split(regex));


Read through the following, as this was best described by Jon Skeet a long
time ago:

http://groups.google.com/[email protected]

<QUOTE from Jon Skeet, 13-Sep-2001>
Consider:

StringBuffer largeBuffer = new StringBuffer (100000);
largeBuffer.append ("small text");

String largeString = largeBuffer.toString();
map.put (largeString, value); // Whoops, 200K of memory gone

// This copies the contents rather than sharing the buffer,
// so we end up with a much smaller memory footprint
String smallString = new String (largeString);
map.put (smallString, value);

It's the only worthwhile use of the String constructor which takes a
String that I've seen. It's *very* useful when reading large numbers of
strings in from files using a BufferedReader, as the BufferedReader
starts off with an 80 character buffer each time. In the case of reading
a dictionary, you could be wasting vast amounts of memory in unfilled
buffers.
</QUOTE>

Now *note*. In jdk 1.5, StringBuffer.toString() has changed to now actually
perform a "new String(...)" with the actual length used, which obsoletes his
particular example using StringBuffer.

But his points are 100% valid for what you're doing (regex splits calling
substring internally).
 
T

Thomas G. Marshall

Thomas G. Marshall coughed up:


....[rip]...
Long and the short?

1. always new String(input string from BufferedReader and the like)
2. OR always new String(mystring.split(regex));

Note that these are general guide lines. In your case you will not want to
bother trimming the input string, and *always* trim the split result (#2
above).


....[rip]...
 
J

John McGrath

for (int i=0;i<tokens.length;i++)
{
myInt count = (myInt ) ht.get(tokens[index]);

Is this the correct code? You use "index" to index the "tokens" array,
but you do not show a declaration of the variable and the loop variable is
"i".
ht.put(new String(tokens[index]), new KDOneInt(1));
// use little memory.
// ht.put(tokens[index], new KDOneInt(1));
// use A LOT OF memory.

You are using String.split() to produce the tokens array, so the tokens
will be substrings of the String passed to String.split(). That means
that string will not be garbage collected, since the hash table has
references to it. When you use "new String(tokens[index])", you are
creating a new String. As a result, you are not storing references to the
untokenized string, and it can be garbage collected.

The memory performance of the above code will depend on how the strings
are being read in. If the resulting strings were compact, then you would
not see any significant difference in performance. But if you are reading
strings into a large buffer of some kind, and then using that buffer, then
all of the buffers will hang around, since you still have references to
substrings created from it.
 
K

Kevin

I really appreciate the answers from all your guys, Kevin McMurtrie,
Thomas G. Marshall, John McGrath, etc. Thanks a lot!

Several more points to make the whole thing clear:

yes, in the code I get one or two type errors, cause I just type it
instead of copy from the program. But it should be ok to get the
meanings.

The text lines are read from a BufferedReader which reads from gz file,
like:
BufferedReader gzipReader = new BufferedReader
(
new InputStreamReader(
new GZIPInputStream(new FileInputStream(myFileName) )
)
);
//////////
I get the lines of the file by calling:
String currentLine=new String(gzipReader.readLine());
and split it like:
String[] ret = currentLine.split(splitter);
and use each item in the String[] ret.
///////////

I still feel not so confidence since even after that, when I run it on
large files, it still use more memory than I though. Here are the
details:

For the data, I need to keep these information:

1) All the line numbers (Integer) for all the lines. I save them to 10
ArrayLists, each item of the array is Integer. (distribute the line
numbers into 10 ArrayList is because of some data related issues.)
2) The count hashtable, which hash String to myInt (which is a tiny
class who wraps one int).

For the data size I just tested, there are about 100K different words
need to be hashed (so the .size() of the hashtable is 100K).
And there are total 10 million lines. (so the 10 ArrayLists' size add
up to 10 million).

Before I noticed the memory issue, all the above data will use about
450M memory.
After I changed the hashtable.put to use new String(), it will use 250M
memory.

Does 250M sound OK for the data?
I estimate the average lengh of the token (word) is about 6 chars
(range from 12 charst to 1 chars). Its count of course is 1 int (4
byte). So the "raw" data will use 10M*4 + 100K*(6+4) = 41M. While plus
the overhead of Hashtable and Array and Integer (in the ArrayList) and
the class myInt, is the 250M worth itself? Or I may work harder to
bring down the memory? (because actually the total data I need to scan
is at least 20X larger than that -- of course, any other way is to
break the data so each time only works on part of it, but let's first
not go that way).

Any comments?
Thanks guys.
 
J

Jesper Nordenberg

Kevin said:
String oneLine = read in one text line.
String[] tokens = oneLine.split(MySpliterString);

I think String.split() does not copy the characters from the splitted
string, instead it creates substrings that reference into the splitted
string. As long as those substrings can be reached, the entire
splitted string remains in memory.
ht.put(new String(tokens[index]), new KDOneInt(1)); // use little
memory.

This will copy the characters to a new string, potentially releasing
the splitted string when 'tokens' goes out of scope.
// ht.put(tokens[index], new KDOneInt(1)); // use A LOT OF
memory.

This line might put the substring into the hash map, causing the
entire splitted string to never be released by the GC. Each text line
where a new word is found will remain in memory when this code is
used. That's probably the source of your memory problem.

/Jesper Nordenberg
 
J

Jesper Nordenberg

Kevin said:
I still feel not so confidence since even after that, when I run it on
large files, it still use more memory than I though. Here are the
details:

For the data, I need to keep these information:

1) All the line numbers (Integer) for all the lines. I save them to 10
ArrayLists, each item of the array is Integer. (distribute the line
numbers into 10 ArrayList is because of some data related issues.)
2) The count hashtable, which hash String to myInt (which is a tiny
class who wraps one int).

For the data size I just tested, there are about 100K different words
need to be hashed (so the .size() of the hashtable is 100K).
And there are total 10 million lines. (so the 10 ArrayLists' size add
up to 10 million).

Before I noticed the memory issue, all the above data will use about
450M memory.
After I changed the hashtable.put to use new String(), it will use 250M
memory.

Does 250M sound OK for the data?
I estimate the average lengh of the token (word) is about 6 chars
(range from 12 charst to 1 chars). Its count of course is 1 int (4
byte). So the "raw" data will use 10M*4 + 100K*(6+4) = 41M. While plus
the overhead of Hashtable and Array and Integer (in the ArrayList) and
the class myInt, is the 250M worth itself? Or I may work harder to
bring down the memory? (because actually the total data I need to scan
is at least 20X larger than that -- of course, any other way is to
break the data so each time only works on part of it, but let's first
not go that way).

A quick calculation for the space required for the word strings:

100k * (6 (char array) + 4 (char array length) + 12 (fields in String
class) + 4 (reference to String object)) = 2.6M

which is negligible in comparison. So, to save space you need to
optimize the ArrayList of Integers. Each element in the ArrayList will
consume at least 4 (int value) + 4 (reference to integer object) = 8
bytes, totalling 80M. Then the ArrayList will have some overhead due
to unused capacity. Using an ArrayList which supports ints will
eliminate the integer object reference, saving 4 bytes / element. Try
for example http://trove4j.sourceforge.net. It also contains a Object
-> int hash map that you can use.

/Jesper Nordenberg
 
T

Thomas G. Marshall

Jesper Nordenberg coughed up:
A quick calculation for the space required for the word strings:

100k * (6 (char array) + 4 (char array length) + 12 (fields in String
class) + 4 (reference to String object)) = 2.6M

which is negligible in comparison. So, to save space you need to
optimize the ArrayList of Integers. Each element in the ArrayList will
consume at least 4 (int value) + 4 (reference to integer object)

Is this reference part true? I don't remember where I read it (which makes
this suspect at the start ;) ), and I cannot find it quickly in the JLS, but
I thought that java "pointers" were only show (in Object.toString()) as the
bottom 32 bits of what is really a 64 bit number. Upward scalability I
guess. Does anyone know for sure?
 
T

Thomas G. Marshall

Jesper Nordenberg coughed up:
Kevin said:
String oneLine = read in one text line.
String[] tokens = oneLine.split(MySpliterString);

I think String.split() does not copy the characters from the splitted
string, instead it creates substrings that reference into the splitted
string. As long as those substrings can be reached, the entire
splitted string remains in memory.

Right, but that doesn't hurt much unless the substrings are infact the same
internal array size as the original, and they are.

ht.put(new String(tokens[index]), new KDOneInt(1)); // use
little memory.

This will copy the characters to a new string, potentially releasing
the splitted string when 'tokens' goes out of scope.
// ht.put(tokens[index], new KDOneInt(1)); // use A LOT OF
memory.

This line might put the substring into the hash map, causing the
entire splitted string to never be released by the GC. Each text line
where a new word is found will remain in memory when this code is
used. That's probably the source of your memory problem.

/Jesper Nordenberg
 
J

John McGrath

I estimate the average lengh of the token (word) is about 6 chars

Note that characters in Java have a range of '\u0000' to '\uFFFF', so you
should be counting 2 bytes each.
Any comments?

Any analysis of the memory requirements for a Java program will depend on
the JVM being used. The Java Language Specification and the Java Virtual
Machine Specification do not specify how objects will be stored
internally. Instead, they specify behavior.

Still, you can make some reasonable guesses. For example, since ints have
a range of -2,147,483,648 to 2,147,483,647, you can be certain that they
will take at least 32 bits, or 4 bytes.

Determining the size of an Object is quite a bit trickier, since they need
to support object method calls and field references, object introspection,
garbage collection, finalization, object IDs, object monitors, etc. It is
up to the JVM to determine how these features are implemented and exactly
what storage is required to support these features.

There was an interesting article on the subject in JavaWorld a while back.
The author did an empirical study of memory required to allocate objects:

http://www.javaworld.com/javaworld/javatips/jw-javatip130.html

He found that, in JDK 1.3.1, the baseline memory requirement for an Object
was 8 bytes. For Strings, the requirement was 40 bytes for a String of up
to 3 characters, and then 8 bytes for each additional 4 characters.
 
C

Chris Uppal

Thomas said:
Is this reference part true? I don't remember where I read it (which
makes this suspect at the start ;) ), and I cannot find it quickly in the
JLS, but I thought that java "pointers" were only show (in
Object.toString()) as the bottom 32 bits of what is really a 64 bit
number. Upward scalability I guess. Does anyone know for sure?

The size of a reference is implementation-defined. On a 32-bit JVM it's almost
certainly be 32 bits. On a 64-bit JVM it'll almost certainly be 64 bits.
There is no need for future-proofing since the layout of memory inside a JVM is
strictly private (and -- short of a C- or assembler-level debugger --
inaccessible).

Incidentally, I think Jesper's estimate of how much space is wasted by an
ArrayList of Integers is on the low side.

Consider an array of 1000000 ints. The chances are that this will take 4 *
1000000 byes for the actual data, plus an undefined amount for the 'object
header' for the array. That will (on a 32-bit Sun JVM) probably be a further
12 bytes. Making a total of 4000012 bytes.

Now consider an ArrayList holding the 'same' data. That will require the
ArrayList object itself, an Object[] array to hold the references to the
million instances of Integer, plus those Integers themselves. The ArrayList
object itself will be of negligible size (in this context). Let's say that it
holds a reference to the Object[] array, the int index of the next element of
the array to be filled in, and an int modification count. On a 32-bit JVM that
probably totals 12 bytes. Add in the object header, say another 8 bytes (on a
current 32-bit Sun JVM)making 20 bytes. The Object[] array itself (assuming
that the ArrayList is exactly 'full' so there's no wasted space) will take
4000012 bytes, just the same as the int[] array above, totalling 4000032 bytes.
Now there's the space for a million Integer instances. Each of these will
probably have 12 bytes of object header, plus 4 bytes to store the actual int
value, making 16 bytes. So that's 16000000 bytes for the Integers. Overall
total 20000032.

I.e. around 16 MBytes that are simply /wasted/ for each million int values.

In fact the wastage may be much higher.

For one thing the ArrayLists will likely /not/ be exactly full. If they
average 3/4 full, then that's another 1000000 bytes to add (on average).
Taking the wasted space up to around 17 MBytes. This estimate does depend on
the implementation of ArrayList. In fact the Sun implementation in JDK1.5.0
increases the size of the underlying array in steps of 1.5, rather than a
(reasonable) step size of 2.0 as I'm assuming here (mainly to simplify the
arithmetic). With an increment of 1.5, the average fullness will be 5/6 rather
than 3/4, so multiply my estimates of slop by 2/3.

For another thing, not all JVM implementations manage to squeeze the object
header down to 8 bytes. Even on a 32 bit machine 12 bytes of object header is
not uncommon.

Lastly, if this code is running on a 64-bit JVM, then object references will
need to be at least 64-bits. That, presumably, also includes the 'hidden'
reference to the object's class (or vtable) in the object header, which will be
12 bytes rather than 8 (or 16 rather than 12 for an array). Doing all the sums
again for a 64-bit JVM gives 4000016 for an int[] array, vs. 28000044 bytes.
Giving a total wastage (per million int values) of around 28Mbytes. Adding in
the, on average, 2000000 bytes 'slop' for the ArrayList takes this up to 30
MBytes.

So, if the OP wants to store 10 million int values, using ArrayLists will
waste anything from 160 MBytes to 300 Mbytes.

And that's ignoring the possibility of fragmentation in "real" memory, not to
mention the effect that those 10 million unnecessary Integer instances will
have on GC...

-- chris
 
C

Chris Uppal

Kevin said:
BufferedReader gzipReader = new BufferedReader
(
new InputStreamReader(
new GZIPInputStream(new FileInputStream(myFileName) )
)
);

I don't think this relates to your memory problems, but the above is not doing
what it should. You should always put the buffering immediately around the raw
I/O, since the purpose of buffering is to reduce the number of individual calls
out to the operating system. (Although, in this case, the buffering that
GZIPInputStream does internally would do something to mitigate the effects of
using the unbuffered FileInputStream).

I think it should be (untested):

InputReader gzipReader =
new InputStreamReader(
new GZIPInputStream(
new BufferedInputStream(
new FileInputStream(
myFilename))));

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,173
Latest member
GeraldReund
Top