Improving String.equals() implementation

  • Thread starter softwarepearls_com
  • Start date
S

softwarepearls_com

In Sun's implementation, String's equals() override starts as follows:

public boolean equals(Object anObject) {

if (this == anObject) {
return true;
}

... I really wonder whether that first test really helps to improve
String's performance? I would guess that the ratio of non-this Strings
to "this" Strings would, in the majority of applications, be heavily
in favour of non-this Strings. If that guess is correct, then we're
being burdened with a test which slows down a critical routine.

Maybe this is a remnant of the very early days when javac was a key
component of the JDK, and the compiler's reliance on interning Strings
may have suggested the above optimization?
 
L

Lew

In Sun's implementation, String's equals() override starts as follows:

    public boolean equals(Object anObject) {

        if (this == anObject) {
            return true;
        }

.. I really wonder whether that first test really helps to improve
String's performance? I would guess that the ratio of non-this Strings
to "this" Strings would, in the majority of applications, be heavily
in favour of non-this Strings. If that guess is correct, then we're
being burdened with a test which slows down a critical routine.

Maybe this is a remnant of the very early days when javac was a key
component of the JDK, and the compiler's reliance on interning Strings
may have suggested the above optimization?

Or maybe it can really help. A 'String' will contain an arbitrary
number, possibly rather large, of characters that must be compared one-
by-one for value equality. The quick object-equality test turns the
number of worst-case comparisons from "length" to "length + 1", a
small overhead against potentialhy turning "length" comparisons into
only one. Weigh that against how common it is for 'String's to be
interned or for different references to point to the same 'String'.
In the general case, one can understand that 'String's under
comparison are likely to be related somehow in the first place,
possibly increasing the odds of the same reference being compared to
itself.

I can easiliy imagine that this optimization is relevant and effective
in enough cases to warrant it.
 
J

Joshua Cranmer

softwarepearls_com said:
In Sun's implementation, String's equals() override starts as follows:

public boolean equals(Object anObject) {

if (this == anObject) {
return true;
}

.. I really wonder whether that first test really helps to improve
String's performance? I would guess that the ratio of non-this Strings
to "this" Strings would, in the majority of applications, be heavily
in favour of non-this Strings. If that guess is correct, then we're
being burdened with a test which slows down a critical routine.

This test is about as expensive as saying if (a == b). Let's assume that
equals has gotten to the point where it's a critical factor. In that
case, the method would already be JIT compiled. The expense of
pointer-based equality should be probably equivalent to a test + jz
instruction on x86 machines. That's only a few machine cycles.

What's the rest of the method? First, there is an instanceof query,
which is more complex than a simple test extension, followed by a size
equality check, followed by a loop doing comparison on each member of
the string. If the equality test fails, it's essentially the same cost
as comparing two strings with an extra character.

In fact, this sort of equality is often used in equals tests; when
Joshua Bloch gives an example of equals() methods, one of the parts is
the |if (this == o)| test.

If String::equals IS your bottleneck, then you might want to look into
interning your strings. That's why String::intern is there.

In short: it's a cheap test that allows one to bypass a very expensive test.
 
S

softwarepearls_com

     I would guess that you have not made any measurements to support
or refute your guess ...  Hint, hint.

Correct. I used the word "guess" purposefully. Depending on where this
thread leads us, I may invest in doing the experiment and properly
profiling. If Terracotta and IBM can deviate from Sun's String
implementation, there can't be anything stopping me from experimenting
too :)
     There seem to be a couple misunderstandings here.  First, javac's
importance to the JDK was by no means an artifact of "the very early
days" (if you don't believe me, search your disk for javac or javac.exe,
delete it, and see what happens).

Sounds like an interesting experiment to do. But as an Eclipse user, I
haven't had javac waste my time for a number of years (even in the
early days I switched to jikes pretty early). I'd be surprised if
javac was actually needed by any JRE APIs..
 Second, the compiler doesn't intern
the Strings it finds in your code; the JVM interns them when your class
gets initialized.

Sure, but I imagined that javac would also use String.intern to try
and squeeze maximum performance out of its parsing approach. Again,
this is just guessing. I've never felt the need to look at the
compiler's insides.
 
J

John W Kennedy

Lew said:
Or maybe it can really help. A 'String' will contain an arbitrary
number, possibly rather large, of characters that must be compared one-
by-one for value equality. The quick object-equality test turns the
number of worst-case comparisons from "length" to "length + 1", a
small overhead against potentialhy turning "length" comparisons into
only one. Weigh that against how common it is for 'String's to be
interned or for different references to point to the same 'String'.
In the general case, one can understand that 'String's under
comparison are likely to be related somehow in the first place,
possibly increasing the odds of the same reference being compared to
itself.

I can easiliy imagine that this optimization is relevant and effective
in enough cases to warrant it.

And don't forget the null check, class check, and length check that all
have to come before the data check.
 
M

Mark Space

John said:
And don't forget the null check, class check, and length check that all
have to come before the data check.

Why check for null? Throw a NPE....
 
L

Lew

softwarepearls_com said:
Sounds like an interesting experiment to do. But as an Eclipse user, I
haven't had javac waste my time for a number of years (even in the
early days I switched to jikes pretty early).

How does being an Eclipse user allow one not to use javac? I've used
Eclipse, and it never let me off the hook for using javac. How else
would it produce bytecode?
I'd be surprised if javac was actually needed by any JRE APIs..

It never was, not even in 1.1 days. But that isn't what we are
discussing. We are discussing whether the JDK needs javac.

Futhermore, in Java 6 we now have Java classes that implement the
compiler, built in to the API for any application to use. NetBeans,
and possibly Eclipse, already leverages this so that its internal
syntax checking now matches that of javac.
 
S

softwarepearls_com

How does being an Eclipse user allow one not to use javac?  I've used
Eclipse, and it never let me off the hook for using javac.  How else
would it produce bytecode?

Eclipse uses its own compiler technology. It doesn't use javac under
the hood.
 
M

Mark Space

Eric said:
Violates the equals() contract, that's why: thing.equals(null)
should yield false and throw no exception.

Ah, ok. That's what I get for posting before checking the docs....
 
L

Lew

Eclipse uses its own compiler technology. It doesn't use javac under
the hood.

Are you saying that Eclipse does not intern its Strings?

javac, be it from Eclipse or Oracle or Sun or IBM, is still javac.
Something has to compile the code. Your claim was that javac was
somehow at fault for String interning:
Maybe this is a remnant of the very early days when javac was a key
component of the JDK, and the compiler's reliance on interning Strings
may have suggested the above optimization?

In that context, I refer to any brand of compiler as "javac". Even if
you don't, the fact remains that interning of Strings is just as
necessary to Eclipse's compilation as it is to Sun's, and that that
interning occurs in the JRE anyway, not in the compiler.

And I'm still puzzled about Eclipse's compilation. When I set up
Eclipse, I have to tell it where my JDKs are, and I can have it use
any one installed on my system. I routinely have it use Sun's at my
work site. Are you saying that Eclipse still doesn't use the JDK's
compiler in that circumstance?

If so, yet another reason to detest Eclipse.
 
J

John W Kennedy

Eric said:
Violates the equals() contract, that's why: thing.equals(null)
should yield false and throw no exception.

Still, there's no need to check for null if you're making an
instanceof test anyhow: `null instanceof AnyClass' is false.

But that still entails a null test when you get down to the machine
code, unless, perhaps, you implement null internally as a reference to a
singleton object of class (let us call it) $Null -- which introduces
some weird twists and turns of its own ($Null cannot be a subclass of
Object, for one). Does any JVM in fact do that?
 
J

John W Kennedy

Lew said:
Are you saying that Eclipse does not intern its Strings?

javac, be it from Eclipse or Oracle or Sun or IBM, is still javac.
Something has to compile the code. Your claim was that javac was
somehow at fault for String interning:


In that context, I refer to any brand of compiler as "javac". Even if
you don't, the fact remains that interning of Strings is just as
necessary to Eclipse's compilation as it is to Sun's, and that that
interning occurs in the JRE anyway, not in the compiler.

And I'm still puzzled about Eclipse's compilation. When I set up
Eclipse, I have to tell it where my JDKs are, and I can have it use
any one installed on my system. I routinely have it use Sun's at my
work site. Are you saying that Eclipse still doesn't use the JDK's
compiler in that circumstance?

It doesn't use javac because it uses an incremental compiler. Since
modern practice is /not/ to do significant optimization at the
Java-to-bytecode level, and since the Eclipse Java editor requires
keeping the source completely parsed at all times anyway, along with all
the signature information of other classes (and their JavaDocs, for that
matter), yanking in javac to do the same processing over again is a
waste of time. In Eclipse, every time you save your source file, you've
recompiled.

(This is why it took so long for Eclipse to support Java 5.0 -- the
significant changes in the Java language.)
 
J

J. Stoever

Lew said:
And I'm still puzzled about Eclipse's compilation. When I set up
Eclipse, I have to tell it where my JDKs are

You must be using some buggy broken OS then. When normal people set up
Eclipse on a working, professional OS, it works just find without me
having to tell it where anything is.

SCNR
 
R

RedGrittyBrick

Lew said:
See? A reason to hate Eclipse.

Not for me.
Believe it or not, there are occasionally important differences between
different versions of Java, and one needs to be able to simply plug in
the correct one for a given project. For Eclipse not to support a
certain version of Java because it feels it must use its own,

AFAIK Sun's javac does not support incremental compilation. At least not
in the way that Eclipse does.

perhaps not as reliable,

Where did that conjecture come from? Perhaps more reliable!

version of a compiler is at best confusing, and at
worst destructive to a project that needs a particular version.

As you know, you can set the language version in Eclipse.

I don't *wnat* the IDE to subvert the job of the compiler. I want it to
use the compiler that I want it to use, with the Ant build that I specify.

As you know, Eclipse supports all that.
And it remains that using Eclipse's idiosyncratic compiler does not

Unless you consider incremental compilation an idiosyncracy, I don't
know what you mean. Please elucidate.

relieve the JRE of responsiblity for interning Strings, nor in any way
makes intern() the compiler's issue as was averred upthread. Whatever
compiler one uses, a real one or Eclipse's.

In what way is Eclipse's compiler not real? It produces .class files
that execute correctly under Sun's JVM - that seems real enough to me.
 
S

softwarepearls_com

     I would guess that you have not made any measurements to support
or refute your guess ...  Hint, hint.

OK, done the footwork.. created an XString class with the same fields
and equals logic as String. Then ran the profiling benchmark (below)
with the first if in place, and without. Here are the results:

With if check:

XString: 2041.559302 String: 2072.287253
XString: 2090.282933 String: 2076.000323
XString: 2092.470786 String: 2074.713913
XString: 2092.733079 String: 2073.96601
XString: 2092.218692 String: 2074.628331
XString: 2091.529936 String: 2072.115415
XString: 2092.015061 String: 2075.19818
XString: 2092.343548 String: 2073.740385
XString: 2092.733192 String: 2073.502145
XString: 2092.513025 String: 2074.677844

Without if check:

XString: 2052.875498 String: 2084.60745
XString: 2077.818231 String: 2080.210148
XString: 2080.658763 String: 2085.940975
XString: 2077.86366 String: 2086.251417
XString: 2074.737521 String: 2087.026064
XString: 2082.000822 String: 2087.007095
XString: 2080.464209 String: 2089.209287
XString: 2082.935433 String: 2089.499631
XString: 2078.67812 String: 2090.466871
XString: 2079.904874 String: 2091.204318

The timings show that the initial if does waste some time, but it's an
almost trivial amount. The test class includes in the times some
ArrayList.indexOf() overhead.. but even if this was factored out, the
gain would still be trivial.

package org.lv.tests;

import static org.lv.lego.constants.NumericConstants.MILLION;

import org.lv.lego.math.random.RandomKit;

import java.util.ArrayList;
import java.util.List;

public class XString {

private static final int NUM_LINES = (int) (0.05 * MILLION);
private static final int NUM_SEARCHES = 5000;
private static final int NUM_TESTS = 10;

private int count;
private char[] value;
private int offset;

XString(char[] chars) {

value = chars;
count = chars.length;
}

public boolean equals(Object anObject) {

// if (this == anObject) {
// return true;
// }
if (anObject instanceof XString) {
XString anotherString = (XString) anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}

public static void main(String[] args) {

System.out.println("Generating random Strings..");
List<String> lines = new ArrayList<String>();
List<XString> xlines = new ArrayList<XString>();
String lastLine = null;
XString lastXLine = null;
for (int i = 0; i < NUM_LINES; i++) {

String randomLineString = RandomKit.randomLineOfText();
XString randomLine = new
XString(randomLineString.toCharArray());
xlines.add(randomLine);
lines.add(randomLineString);
lastLine = randomLineString;
lastXLine = randomLine;
}
System.out.println("Done. Searching through them..");

for (int test = 0; test < NUM_TESTS; test++) {

long then = System.nanoTime();
for (int i = 0; i < NUM_SEARCHES; i++) {
int indexOf = xlines.indexOf(lastXLine);
}
long now = System.nanoTime();
long elapsed = now - then;
System.out.print("XString: ");
System.out.print(((double) elapsed) / MILLION);
System.out.print(" ");

then = System.nanoTime();
for (int i = 0; i < NUM_SEARCHES; i++) {
int indexOf = lines.indexOf(lastLine);
}
now = System.nanoTime();
elapsed = now - then;
System.out.print("String: ");
System.out.println(((double) elapsed) / MILLION);
}
}
}
 
T

Tom Anderson

OK, done the footwork.. created an XString class with the same fields
and equals logic as String. Then ran the profiling benchmark (below)
with the first if in place, and without. Here are the results:

With if check:

XString: 2041.559302 String: 2072.287253
XString: 2090.282933 String: 2076.000323
XString: 2092.470786 String: 2074.713913
XString: 2092.733079 String: 2073.96601
XString: 2092.218692 String: 2074.628331
XString: 2091.529936 String: 2072.115415
XString: 2092.015061 String: 2075.19818
XString: 2092.343548 String: 2073.740385
XString: 2092.733192 String: 2073.502145
XString: 2092.513025 String: 2074.677844

Without if check:

XString: 2052.875498 String: 2084.60745
XString: 2077.818231 String: 2080.210148
XString: 2080.658763 String: 2085.940975
XString: 2077.86366 String: 2086.251417
XString: 2074.737521 String: 2087.026064
XString: 2082.000822 String: 2087.007095
XString: 2080.464209 String: 2089.209287
XString: 2082.935433 String: 2089.499631
XString: 2078.67812 String: 2090.466871
XString: 2079.904874 String: 2091.204318

The timings show that the initial if does waste some time, but it's an
almost trivial amount.

Moreover, it shows that this is so *in the worst case*. In your test
setup, the only time two strings will be identical, so exploiting the fast
path via ==, will be on the very last string in the list. Thus, you're
making the check but getting no benefit from it for the other 50 000
strings.

The heuristic behind the inclusion of the == test in String.equals is, i
assume, that comparisons between identical strings are common. What we
really need is a real-world benchmark that does a lot of string
comparison, so we can test if this is really true. Javac? Some random J2EE
app? It would be quite straightforward to substitute a ==-less version of
String using -Xbootclasspath and measure performance.

tom
package org.lv.tests;

import static org.lv.lego.constants.NumericConstants.MILLION;

import org.lv.lego.math.random.RandomKit;

import java.util.ArrayList;
import java.util.List;

public class XString {

private static final int NUM_LINES = (int) (0.05 * MILLION);
private static final int NUM_SEARCHES = 5000;
private static final int NUM_TESTS = 10;

private int count;
private char[] value;
private int offset;

XString(char[] chars) {

value = chars;
count = chars.length;
}

public boolean equals(Object anObject) {

// if (this == anObject) {
// return true;
// }
if (anObject instanceof XString) {
XString anotherString = (XString) anObject;
int n = count;
if (n == anotherString.count) {
char v1[] = value;
char v2[] = anotherString.value;
int i = offset;
int j = anotherString.offset;
while (n-- != 0) {
if (v1[i++] != v2[j++])
return false;
}
return true;
}
}
return false;
}

public static void main(String[] args) {

System.out.println("Generating random Strings..");
List<String> lines = new ArrayList<String>();
List<XString> xlines = new ArrayList<XString>();
String lastLine = null;
XString lastXLine = null;
for (int i = 0; i < NUM_LINES; i++) {

String randomLineString = RandomKit.randomLineOfText();
XString randomLine = new
XString(randomLineString.toCharArray());
xlines.add(randomLine);
lines.add(randomLineString);
lastLine = randomLineString;
lastXLine = randomLine;
}
System.out.println("Done. Searching through them..");

for (int test = 0; test < NUM_TESTS; test++) {

long then = System.nanoTime();
for (int i = 0; i < NUM_SEARCHES; i++) {
int indexOf = xlines.indexOf(lastXLine);
}
long now = System.nanoTime();
long elapsed = now - then;
System.out.print("XString: ");
System.out.print(((double) elapsed) / MILLION);
System.out.print(" ");

then = System.nanoTime();
for (int i = 0; i < NUM_SEARCHES; i++) {
int indexOf = lines.indexOf(lastLine);
}
now = System.nanoTime();
elapsed = now - then;
System.out.print("String: ");
System.out.println(((double) elapsed) / MILLION);
}
}
}
 
T

Tom Anderson

Ah, ok. That's what I get for posting before checking the docs....

But:

public boolean equals(Object obj) {
try {
if (this == obj) return true ; // controversial!
String that = (String)obj ;
if (this.length != that.length) return false ;
for (int i = 0 ; i < this.length ; ++i)
if (this.chars != that.chars) return false ;
return true ;
} catch (ClassCastException e) {
return false ;
} catch (NullPointerException e) {
return false ;
}
}

In the case of comparing to an actual String, this should be quite fast -
i don't think there's any per-execution overhead to setting up a try
block. When comparing to null or a non-String, though, there will be a
throw-catch, which is disproportionately expensive.

tom
 
T

Tom Anderson

But that still entails a null test when you get down to the machine
code, unless, perhaps, you implement null internally as a reference to a
singleton object of class (let us call it) $Null -- which introduces
some weird twists and turns of its own ($Null cannot be a subclass of
Object, for one). Does any JVM in fact do that?

JVMs, not that i'm aware of. Smalltalk and Python VMs, yes.

tom
 
S

softwarepearls_com

Are you saying that Eclipse does not intern its Strings?

javac, be it from Eclipse or Oracle or Sun or IBM, is still javac.

Like I said, Eclipse uses its own compiler, so there is no "javac from
Eclipse". The compiler used by Eclipse is a completely different
compiler.
Something has to compile the code.  Your claim was that javac was
somehow at fault for String interning:

That does not imply "fault".
In that context, I refer to any brand of compiler as "javac".

Inventing your own vocabulary can be problematic when others
understand a more standard meaning.
And I'm still puzzled about Eclipse's compilation.  When I set up
Eclipse, I have to tell it where my JDKs are, and I can have it use
any one installed on my system.  I routinely have it use Sun's at my
work site.  Are you saying that Eclipse still doesn't use the JDK's
compiler in that circumstance?

Correct. Eclipse needs to know which JDK or JRE you wish to use so
that it can use the classes.jar or rt.jar that's central to any JDK/
JRE. But Eclipse will not use javac.exe (on Windows) in any JDK you
may point at it.
If so, yet another reason to detest Eclipse.

While Eclipse may be inferior to IntelliJ, it's hardly something to be
detested. When I compare my Java productivity using Eclipse with my
(Java) productivity roughly 10 years ago using Multi-Edit for Windows,
the two just don't compare. The refactorings alone are worth Eclipse's
learning curve.

Eclipse's compiler is so valuable and powerful that the Eclipse
project even allows it to be used standalone. If my memory serves me
right, even IntelliJ can be configured to use the Eclipse compiler..
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,607
Members
45,241
Latest member
Lisa1997

Latest Threads

Top