String.substring, under the hood

R

Roedy Green

Here is how String.substring works:

public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > count) {
throw new StringIndexOutOfBoundsException(endIndex);
}
if (beginIndex > endIndex) {
throw new StringIndexOutOfBoundsException(endIndex -
beginIndex);
}
return ((beginIndex == 0) && (endIndex == count)) ? this :
new String(offset + beginIndex, endIndex - beginIndex,
value);
}


Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

So the efficiencies have changed. Substring no longer pins the
underlying big string. On the other hand, you will create many string
objects by using substring. So be careful with it. It is no longer
free in terms of ram to have many substrings of your big string.
 
S

Stefan Schulz

Roedy Green wrote:
[...]
Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

So the efficiencies have changed. Substring no longer pins the
underlying big string. On the other hand, you will create many string
objects by using substring. So be careful with it. It is no longer
free in terms of ram to have many substrings of your big string.

While i consider such things implementation details of
java.lang.String, and therefore not really my concern, the new
behaviour is more in line with the "reasonable expectations" of most
programmers. If i create a new object, i expect its storage to be
allocated somewhere. Also, if i drop a reference to a very long string,
but retain a tiny subsection, i expect to be able to drop all but the
small subsection.
 
S

Stefan Schulz

Also, upon looking at the code again, it still creates a "view" which
is backed by the same char array (same as before!)
 
C

Chris Uppal

Roedy said:
Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

The substring created will share the underlying char[] array.

To the best of my memory that has always been the behavior.
Unfortunately, I don't have source from a JDK before 1.4.2 handy to
check.

-- chris
 
C

Chris Smith

Roedy said:
Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

This seems to come up every once in a while.
return ((beginIndex == 0) && (endIndex == count)) ? this :
new String(offset + beginIndex, endIndex - beginIndex,
value);

This is a call to a private constructor inside the String class, which
reuses the underlying char[]. It does not do the same thing as the
public String(String) constructor, which copies the underlying data. So
when people say that "new String" copies the underlying char[], you
should only apply that statement to the String(String) overloaded
constructor, and not to the String(int,int,char[]) private overload used
there.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
T

Thomas Hawtin

Stefan said:
Roedy Green wrote:
[...]
Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

As pointed out in other postings, it doesn't copy. String uses the
rather confusing technique of rearranging arguments in order to give
constructors different semantics. A (package) private constructor does
not do the additional copy.

Pin refers to stopping an object from being moved by the garbage
collector. The new (sub)String (strongly) references the full character
array of the original String.
While i consider such things implementation details of
java.lang.String, and therefore not really my concern, the new
behaviour is more in line with the "reasonable expectations" of most
programmers. If i create a new object, i expect its storage to be
allocated somewhere. Also, if i drop a reference to a very long string,
but retain a tiny subsection, i expect to be able to drop all but the
small subsection.

Performance is externally visible behaviour. It is quite normal for
client code to take it into account.

Tom Hawtin
 
S

Stefan Schulz

So the efficiencies have changed. Substring no longer pins the
Pin refers to stopping an object from being moved by the garbage
collector. The new (sub)String (strongly) references the full character
array of the original String.

This is exactly what i said. I just wonder what the OP meant when the
complained about a new String being created... with Strings being
immutable, you need to create a new copy each time you modify it (for
example, by taking a substring). The backing character array is not
copied, though (which can lead to unexpectedly high memory costs for
small strings).
Performance is externally visible behaviour. It is quite normal for
client code to take it into account.

That is correct, however the definition of the substring method does
not offer any guarantees about performance. It might be constant time,
but possibly wasting space (the current method), or it might take
linear to the length of the substring, or anything else. The method
definiton does not tell you one way or another, so you should not rely
on the behaviour. Maybe another JRE will do things completely the other
way around. The actual time needed depends on the implementation, and
without any specified behaviour is not an external characteristic.
 
O

Owen Jacobson

Here is how String.substring works:

....snip Sun's implementation...
return ... new String(offset + beginIndex, endIndex - beginIndex, value);
Note that it now always creates a new string (unless the substring is
the string itself.) It used to create a view into the underlying
string.

So the efficiencies have changed. Substring no longer pins the
underlying big string. On the other hand, you will create many string
objects by using substring. So be careful with it. It is no longer
free in terms of ram to have many substrings of your big string.

Note which constructor this invokes: String (int, int, char[]). Sun's
implementation of substring, at least as of 1.5.05 and as far back as I've
been using Java, shares the char[] containing the String's characters
on calls to substring. It still pins the underlying char array, and is
still cheap both computationally and memory-wise if the originating
string's lifespan is at least as long as those of the substrings.
 
O

Owen Jacobson

On Sat, 14 Jan 2006 22:32:15 +0000, Owen Jacobson wrote:

....snip...

edit: ****, beaten.
 
R

Roedy Green

Here is how String.substring works:

here is my latest understanding:


substring is clever. It does not make a deep copy of the substring the
way most languages do. It just creates a pointer into the original
immutable String, i.e. points to the value char[] of the base string,
and tracks the starting offset where the substring starts and count of
how long the substring is. This could be confusing if you were
low-level debugging since you would see the whole String, not just the
substring. There were reports of a bug in Microsoft's implementation
of substring. The downside of this cleverness is a tiny substring of a
giant base String could suppress garbage collection of that big String
in memory even if the whole String were no longer needed. (actually
its value char[] array is held in RAM; the String object itself could
be collected.)

It is probably still a good idea to use indexOf( lookFor, offset )
with a rather than creating a substring first and using indexOf(
lookFor ) on that.

If you know a tiny substring is holding a giant string in RAM, that
would otherwise be garbage collected, you can break the bond by using
littleString = new String( littleString ) which will create a new
smaller backing char[] with no ties to the original String.

If you are a curious sort, and study the code for String. substring in
src.zip, this sharing logic might not be apparent. The key is a
non-public String constructor that takes parameters in the reverse of
the usual order String (int offset, int count, char value[]).
 
A

Alan Krueger

Thomas said:
Performance is externally visible behaviour. It is quite normal for
client code to take it into account.

It might be externally visible, but it may not be guaranteed by the
creator of the class. Relying on internal implementation details
violates encapsulation and may break if the internal implementation is
changed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top