efficient use of hashtable with string like keys

H

Harald Kirsch

My applications does a lot of text crunching, filtering
information out of the text. For reasons of efficiency
all the text handling is done in StringBuffer objects.

When the application finds something interesting, it wants
to store it in a Hashtable with the text to be used as a
key still in the StringBuffer.

In order to test whether the key is already known to the Hashtable
I currently create a String object from the StringBuffer. But
most of the time the key happens to be already known to the
Hashtable and the String created is immediately thrown away again.

How can I use a reusable object, e.g. a StringBuffer, as a key to
the Hashtable such that hashing is performed according .equals()
instead of according to object identity.

I know that it will be a disaster if I change a stored object later,
but maybe someone knows a clean way to get around the superflous
creation of a String object.

Thanks,
Harald.
 
V

VisionSet

Harald Kirsch said:
In order to test whether the key is already known to the Hashtable
I currently create a String object from the StringBuffer. But
most of the time the key happens to be already known to the
Hashtable and the String created is immediately thrown away again.

Well if most occasions the Hashtable already holds it then don't worry, you
never made a new String to check it with. Strings are interned by default,
that is only one copy of an identical string is made, even when you do new
String("xyz"), if xyz already exists in the JVM, your reference just points
to that.
 
M

Marco Schmidt

Harald Kirsch:

[...]
I know that it will be a disaster if I change a stored object later,
but maybe someone knows a clean way to get around the superflous
creation of a String object.

I don't know about a perfect solution, but if you do create the String
and replace it with what intern() returns on that new String, you
avoid having a String with the same content twice:

String newString = ...;
newString = newString.intern();

But I'd do some measurements if you really have a bottleneck. There
may be no need for optimization.

Regards,
Marco
 
J

John C. Bollinger

VisionSet said:
Well if most occasions the Hashtable already holds it then don't worry, you
never made a new String to check it with. Strings are interned by default,
that is only one copy of an identical string is made, even when you do new
String("xyz"), if xyz already exists in the JVM, your reference just points
to that.

I'm sorry, but that's correct only for compile-time computed Strings.
Strings created at runtime may indeed be equal to but distinct from
other Strings in the same VM. It is always possible to get a reference
to a globally shared String representing the same sequence of characters
(via the intern() method) but you still end up creating and then
discarding a distinct String object. See the JLS discussion at

http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#19369


John Bollinger
(e-mail address removed)
 
J

John C. Bollinger

Harald said:
My applications does a lot of text crunching, filtering
information out of the text. For reasons of efficiency
all the text handling is done in StringBuffer objects.

When the application finds something interesting, it wants
to store it in a Hashtable with the text to be used as a
key still in the StringBuffer.

In order to test whether the key is already known to the Hashtable
I currently create a String object from the StringBuffer. But
most of the time the key happens to be already known to the
Hashtable and the String created is immediately thrown away again.

How can I use a reusable object, e.g. a StringBuffer, as a key to
the Hashtable such that hashing is performed according .equals()
instead of according to object identity.

You have a misconception. Hashtables use Objects' (keys') hashCode
method to perform hashing and their equals() methods to compare keys
that are in the same hash bucket. The Map interface explains the
applicable contract in some detail.

It is possible that you are confused because StringBuffers inherit their
equals() and hashCode() methods from Object, which implements equals()
in terms of object identity.
I know that it will be a disaster if I change a stored object later,
but maybe someone knows a clean way to get around the superflous
creation of a String object.

Changing an object that is in use as a Hashtable key in a way that
affects its hashCode or equals() method results in unspecified behavior
of the Hashtable. You don't want that.

There is no solution that provides the behavior you want within the
constraints you have specified (source key data from a StringBuffer, key
object to implement equals differently than StringBuffer.equals() does,
and new object creation impermissable). The only way to reduce the
number of transient object instantiations is to rework your algorithm so
that you get fewer duplicate hits in the first place.


John Bollinger
(e-mail address removed)
 
A

Adam Maass

Harald Kirsch said:
My applications does a lot of text crunching, filtering
information out of the text. For reasons of efficiency
all the text handling is done in StringBuffer objects.

When the application finds something interesting, it wants
to store it in a Hashtable with the text to be used as a
key still in the StringBuffer.

In order to test whether the key is already known to the Hashtable
I currently create a String object from the StringBuffer. But
most of the time the key happens to be already known to the
Hashtable and the String created is immediately thrown away again.

How can I use a reusable object, e.g. a StringBuffer, as a key to
the Hashtable such that hashing is performed according .equals()
instead of according to object identity.

I know that it will be a disaster if I change a stored object later,
but maybe someone knows a clean way to get around the superflous
creation of a String object.

First off, creating a String object from a StringBuffer is relatively
inexpensive... the String and the StringBuffer initially share internal
storage. Only when the StringBuffer is susequently modified does the
StringBuffer class make a copy of the internal storage.


Alternatively, you need to write your own class that has the mutability of
StringBuffer but the equals and hashCode semantics of String. Not hard,
really.


-- Adam Maass
 
C

Chris Uppal

Adam said:
Alternatively, you need to write your own class that has the
mutability of StringBuffer but the equals and hashCode semantics of
String. Not hard, really.

Or switch the HashTable to a java.util.TreeMap with a custom Comparator.

-- chris
 
H

Harald Kirsch

John C. Bollinger said:
You have a misconception. Hashtables use Objects' (keys') hashCode
method to perform hashing and their equals() methods to compare keys
that are in the same hash bucket. The Map interface explains the
applicable contract in some detail.

It is possible that you are confused because StringBuffers inherit their
equals() and hashCode() methods from Object, which implements equals()
in terms of object identity.

No, I was not confused, but my question was probably not clear
clear enough about this point. That it cannot work with the available
implementation of .hashcode() and .equals() is obvious. I hoped someone
came up with an ingenious idea of how to borrow String.hashcode() and and
String.equals() and apply it to StringBuffer without rewriting everything.
There is no solution that provides the behavior you want within the
constraints you have specified (source key data from a StringBuffer, key
object to implement equals differently than StringBuffer.equals() does,
and new object creation impermissable). The only way to reduce the
number of transient object instantiations is to rework your algorithm so
that you get fewer duplicate hits in the first place.

Unfortunately the keys stem directly from real world data. The task is
simple: store a key only once, even if seen many times in the
input. I don't see how this could be reworked to not produce the
to-be-checked keys from the data.

Thanks anyway,
Harald.
 
H

Harald Kirsch

Adam Maass said:
First off, creating a String object from a StringBuffer is relatively
inexpensive... the String and the StringBuffer initially share internal
storage. Only when the StringBuffer is susequently modified does the
StringBuffer class make a copy of the internal storage.

The whole point of using a StringBuffer is to change it all the time.
In fact in my application it is totally counterproductive that
StringBuffer.toString() tries create a String object which shares the
char[] because an immediate, inevitable, subsequent change of the StringBuffer
then reallocates the StringBuffers char[]. This is bad for two reasons:

1) After the StringBuffer, during several .append, grew large enough to
hold all the typical pieces of text I deal with, the reallocation most
probably makes it too short, triggering even more reallocations in subsequent
..appends.

2) The String object created keeps the char[] which is most probably far
to large.

I guess this unhelpful implementation only serves the compiler well to
implement things like "a"+"b"+345+"some other text" fairly efficient.

Harald.
 
A

Adam Maass

Harald Kirsch said:
Adam Maass said:
First off, creating a String object from a StringBuffer is relatively
inexpensive... the String and the StringBuffer initially share internal
storage. Only when the StringBuffer is susequently modified does the
StringBuffer class make a copy of the internal storage.

The whole point of using a StringBuffer is to change it all the time.
In fact in my application it is totally counterproductive that
StringBuffer.toString() tries create a String object which shares the
char[] because an immediate, inevitable, subsequent change of the StringBuffer
then reallocates the StringBuffers char[]. This is bad for two reasons:

1) After the StringBuffer, during several .append, grew large enough to
hold all the typical pieces of text I deal with, the reallocation most
probably makes it too short, triggering even more reallocations in subsequent
.appends.

2) The String object created keeps the char[] which is most probably far
to large.

I guess this unhelpful implementation only serves the compiler well to
implement things like "a"+"b"+345+"some other text" fairly efficient.

In which case, you need a custom class with the mutability of StringBuffer
but the .equals and .hashCode semantics of String. This is not hard to do
since the source of these standard classes is available.

-- Adam Maass
 
R

Roedy Green

I hoped someone
came up with an ingenious idea of how to borrow String.hashcode() and and
String.equals() and apply it to StringBuffer without rewriting everything.

A few thoughts:

1. You can't change the lookup hashCode of an object in a HashMap on
the fly. You have to remove it, change the fields the hashCode
depends on and re-add it. HashMap presumes the hashCode won't change.

2. You would likely want to cache it since it is expensive to compute.

3. If the StringBuffer is changing all the time, you might want a
faster but lower quality algorithm, or perhaps even one that just
looks at the first few characters to speed computation. You don't
need to recompute it if the first few chars don't change.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top