Substring

  • Thread starter Dirk Bruere at NeoPax
  • Start date
D

Dirk Bruere at NeoPax

M

Mark Space

Dirk said:
I assume that "" is not the same as a null string?
What is it like in memory?

I think that is a null string. It's not the same as a null reference.

In memory? An object with one null field and two integers. The String
object basically looks like this:

class String {
char[] buff;
int begin;
int count;
}

So set buff to null and begin and count to 0;
 
D

Dirk Bruere at NeoPax

Mark said:
Dirk said:
I assume that "" is not the same as a null string?
What is it like in memory?

I think that is a null string. It's not the same as a null reference.

In memory? An object with one null field and two integers. The String
object basically looks like this:

class String {
char[] buff;
int begin;
int count;
}

So set buff to null and begin and count to 0;

Does that mean that

String str=null;
String str="";

are different?

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
T

Tom Anderson

I assume that "" is not the same as a null string? What is it like in
memory?

It's probably best to think of it as being like an array of length zero.
It's there, but it has nothing in it.

The actual layout in memory is more complex, because String objects are
more complex than arrays: a String is actually a character array, plus two
integers, one being the offset of the start of the string in that array,
and one being the count of characters in the string. For example, the
string "hello" could be represented by a string like this:

characters = {'h', 'e', 'l', 'l', 'o'}
offset = 0 // ^ start
count = 5 // ^ ^ ^ ^ ^ included

But it could equally be:

characters = {'w', 'h', 'y', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e'}
offset = 4 // ^ start
count = 5 // ^ ^ ^ ^ ^ included

The reason for this is that it means that several string objects can share
the same underlying character array which can save memory. This happens
when you start with one big string and then extract substrings from it
(and further substrings from the substrings). An example of when this
might happen is parsing - if you read a whole file into one big string,
then broke out individual lines as strings, then broke each line down into
words or whatever, you might have hundreds of string objects, but you
would only need one copy of the character array for all of them. There are
downsides to this design, but the designers of java evidently felt that
the benefits outweighed the disadvantages.

Anyway, the upshot is that the empty string can have all sorts of possible
layouts in memory, depending on how it was created! The simplest possible
layout is this:

characters = {} // no characters
offset = 0
count = 0

But it could also be:

characters = {'w', 'h', 'y', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e'}
offset = 4 // ^ start
count = 0 // no characters included

The one common factor is that count will always be zero.

tom
 
A

Arne Vajhøj

Dirk said:
Mark said:
Dirk said:
I assume that "" is not the same as a null string?
What is it like in memory?

I think that is a null string. It's not the same as a null reference.

In memory? An object with one null field and two integers. The String
object basically looks like this:

class String {
char[] buff;
int begin;
int count;
}

So set buff to null and begin and count to 0;

Does that mean that

String str=null;
String str="";

are different?

With most definitions of different: yes.

Arne
 
M

Mark Rafn

Dirk Bruere at NeoPax said:
I assume that "" is not the same as a null string?

Correct. "" is a zero-length string, often called an "empty string". null is
a special reference value indicating no string at all.

Some differences are as follows:
String s = "";
String n = null;
s.length() == 0; // true
n.length() == 0; // NullPointerException
"".equals(s); // true
"".equals(n); // false
String y = "test" + s; // y is now "test"
String y = "test" + n; // y is now "testnull" - stringbuilder is funny.

What is it like in memory?

It's dark and warm.
 
M

Mike Schilling

Dirk said:
I assume that "" is not the same as a null string?
Correct.

What is it like in memory?

That's not a question you get to ask in Java. But the two differ in many
ways, e.g.

String n = null;
String e = "";

e.length() is 0
n.length() throws a NullPointerException
 
M

Mark Space

Dirk said:
Does that mean that

String str=null;
String str="";

are different?


Very different. Any object can have a null reference:

Object o = null;
JButton b = null;

Only Strings can represent a null string; a null string is a fully
constructed object:

String s = ""; // This is an object of type string

All string methods still work fine on this object:

s.length();
s.substring(0,0);

etc. If you have the former (a null reference) you'll get a
NullPointerException:

String s2 = null;
s2.length(); // NPE!
s2.substr(0,0); // NPE!
 
M

Mark Space

Tom said:
But it could also be:

characters = {'w', 'h', 'y', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't',
'h', 'e', 'r', 'e'}
offset = 4 // ^ start
count = 0 // no characters included

The one common factor is that count will always be zero.


I just checked the source and this is indeed what it does. Which is
unfortunate because the null string will be holding on to a character
array it doesn't need.

The only way I can see off hand to change this is to use the String(
String ) constructor. That's the only String method that I see that
trims its output.

String test = "This is a test.";
String sub = test.substring( 3, 3 );

Will initialize "sub" with a 15 character array, even though it doesn't
need it.

String sub = new String( test.substring( 3, 3 ) );

Will initialize "sub" with a new zero length char array and offset and
length of 0.
 
D

Dirk Bruere at NeoPax

L

Lew

Mark said:
I just checked the source and this is indeed what it does. Which is
unfortunate because the null string will be holding on to a character
array it doesn't need.

The nil string ("null string" is just too confusing) would only "hold on" to
the character array because it was used by some other String expression, so it
really doesn't seem all that very unfortunate to me.
The only way I can see off hand to change this is to use the String(
String ) constructor. That's the only String method that I see that
trims its output.

And creates a whole nother object to do so.
String test = "This is a test.";
String sub = test.substring( 3, 3 );

Will initialize "sub" with a 15 character array, even though it doesn't
need it.

But it's the exact same array as the one pointed to by 'test', so it's not
doing any harm, and it's apparently faster in that it saves allocation of a
new array.

Let's see, no harm, some benefit. I don't see a problem.
String sub = new String( test.substring( 3, 3 ) );

Will initialize "sub" with a new zero length char array and offset and
length of 0.

Thus sacrificing the speed improvement of the other way.
 
L

Lew

Dirk said:
Lew said:
This follows directly from the Javadocs for the method,
<http://java.sun.com/javase/6/docs/api/java/lang/String.html#substring(int, int)>

as a moment's investigation reveals.

I asked the Q [sic] because it was *not* obvious to me, and I read it before
posting here.

They have magical-seeming powers in the Q dimension, but being fictional, they
are unlikely to answer you.

Let me show the derivation from the Javadocs for String#substring( int, int ).
This will help you to make better use of Javadocs going forward, and to
unlock their power.

The Javadocs tell us:
Returns a new string that is a substring of this string.
The substring begins at the specified beginIndex and
extends to the character at index endIndex - 1.
Thus the length of the substring is endIndex-beginIndex.

Let's say we're interested in
"Q dimension".substring( 2, 2 )

It begins at 'beginIndex' 2, which is the "d" of "dimension", and ends just
prior to the same location, at (endIndex - 1) == 2 - 1 == 1, which is just
before the "d". (endIndex-beginIndex) == 0, so the length is 0.

I visualize the indexes as being the infinitesimal gap between successive
characters, so it starts just before the "d" of "dimension" and ends in the
same place, covering zero characters. This matches the length of 0.

So the result is a String beginning just prior to the "d" of dimension but
just after the " " after "Q", containing exactly zero characters and ending
just where it began.

"Q dimension".substring( 2, 3 )

would begin just before index 2, the "d", and end just before index 3, just
before the first "i", thus including the character at (endIndex - 1) == 2,
with length one. That would be the String "d".

"Q dimension".substring( 3, 2 )

would throw 'IndexOutOfBoundsException' because "beginIndex is larger than
endIndex".
 
M

Mark Space

Lew said:
But it's the exact same array as the one pointed to by 'test', so it's
not doing any harm, and it's apparently faster in that it saves
allocation of a new array.

Let's see, no harm, some benefit. I don't see a problem.


In this brief example, sure. But what if the "test" string is large and
acquired some other way besides a program constant? Let's say read in
as part of parsing a large text file, or downloaded from the network?

I don't think it takes too much imagination to see the behavior of
substring() in the is case as a memory leak. It's just like the example
of nulling out a reference to "release memory" so that an object can be
garbage collected as soon as possible. The often mentioned example is a
stack implementation, but file buffers will probably be larger objects.

If a large memory buffer is held in memory because of a much smaller
String reference, couldn't that be seen as undesirable in some, if not
many, circumstances?

Obviously, one doesn't want create new strings for no good reason.
Computer programmers often have to choose between fast algorithms which
use more memory (as substring() does) and slower algorithms which
conserve memory (as String(String) does). Selecting which one to use
can be a dark art, but one should be aware that they have a choice and
different optimizations are available (and I'm thinking primarily of the
OP here, who seems still a little unclear about basics like references
and how they work).
 
D

Dirk Bruere at NeoPax

Lew said:
Dirk said:
Lew said:
Dirk Bruere at NeoPax wrote:
In String class
public String substring(int beginIndex,
int endIndex)

What happens if beginIndex==endIndex ?

Eric Sosman wrote:
If beginIndex is valid, you get a String whose length is
zero, which you would write as "" in Java source code. If
beginIndex is invalid you get IndexOutOfBoundsException.

This follows directly from the Javadocs for the method,
<http://java.sun.com/javase/6/docs/api/java/lang/String.html#substring(int, int)>

as a moment's investigation reveals.

I asked the Q [sic] because it was *not* obvious to me, and I read it
before posting here.

They have magical-seeming powers in the Q dimension, but being
fictional, they are unlikely to answer you.

Let me show the derivation from the Javadocs for String#substring( int,
int ). This will help you to make better use of Javadocs going forward,
and to unlock their power.

The Javadocs tell us:
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.

Let's say we're interested in
"Q dimension".substring( 2, 2 )

It begins at 'beginIndex' 2, which is the "d" of "dimension", and ends
just prior to the same location, at (endIndex - 1) == 2 - 1 == 1, which
is just before the "d". (endIndex-beginIndex) == 0, so the length is 0.

I visualize the indexes as being the infinitesimal gap between
successive characters, so it starts just before the "d" of "dimension"
and ends in the same place, covering zero characters. This matches the
length of 0.

So the result is a String beginning just prior to the "d" of dimension
but just after the " " after "Q", containing exactly zero characters and
ending just where it began.

"Q dimension".substring( 2, 3 )

would begin just before index 2, the "d", and end just before index 3,
just before the first "i", thus including the character at (endIndex -
1) == 2, with length one. That would be the String "d".

"Q dimension".substring( 3, 2 )

would throw 'IndexOutOfBoundsException' because "beginIndex is larger
than endIndex".

Yes.
BTW, Q=question
BTW = By the way

Which led me to the Q... what does "" mean ie is it null or not?
Hence the thread.
Hope that clarifies things.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
D

Dirk Bruere at NeoPax

Peter said:
No disrespect intended to the OP, but a person who is still at this
phase in their learning is a long way off from caring about this level
of performance optimization, even in the situations where it might be
applicable.

Pete

True - I've 30 GFLOPS of performance to run a simple text comms protocol
interface. Performance is not an issue.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
 
M

Mark Space

Peter said:
There may well be scenarios in which one starts with a large string, and
then retains just a single tiny subset of that string, but I doubt they
are all that common. And even there, in most cases the retention of the
original string isn't going to be a problem.

I'm going to go out on a limb here and say if at least 80% of a large
string isn't retained by smaller strings in this scenario, most people
would call that a memory leak.

I've never actually run into this problem, but then I don't do a lot of
heavy string parsing either. It would be interesting to look at a
survey of production programs with a lot of parsing to see what sort of
memory characteristics they exhibit.
In the scenario where it _is_ a problem, someone will have determined
that through proper measurement, and then can state unequivocally that a
work-around is needed. But, as you yourself pointed out, Java does in
fact offer a reasonable work-around: instantiate a new String instance
passing the substring as the constructor.


I think you're exaggerating a bit here to try to emphasize your point,
but of course it's true that all programs should be modeled in a
profiler before optimizing. One could just as easily profile a program
and determine that an extra String(String) call was causing a problem
and needed to be removed.

All I'm really trying to say here (and my original point) is it pays to
be aware of how libraries are actually implemented, and what sort of
alternatives to the obvious (naive?) method calls exist.
 
R

Roedy Green

What happens if beginIndex==endIndex ?

to quote the Red Queen, "shall we try the experiment?"

--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Official Java Classes 10
Can an Applet beep? 4
ListModel name 10
Accessing static field 21
Sorting a JList 4
File over network timeout 3
Free keyboard applet 5
Change character in string 105

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top