keeping only unique strings

J

Jean Lutrin

Hi all,

I'd like to sort through a huge list of Strings and keep only
unique Strings. There can not be in my result two String a and b for
which : a.equals(b) == true.

So my first idea was to simply put each String in a HashMap, which
would automatically discard duplicates (and this is really easy to
implement).

However, nothing garantees that two Strings being equals have the same
object reference (which is why we have to use .equals() btw !?).

So what is the simplest way I could use to do this ? (I'd like to do this
with a sort of hashtable/map/set whatever to have an O(1) complexity).

Do I have to make my own "string" class with a special hashcode method ?

Any help appreciated (btw, this is not for homework, I'm way too old for that),

Jean
 
G

Gordon Beaton

I'd like to sort through a huge list of Strings and keep only
unique Strings. There can not be in my result two String a and b for
which : a.equals(b) == true.

So my first idea was to simply put each String in a HashMap, which
would automatically discard duplicates (and this is really easy to
implement).

However, nothing garantees that two Strings being equals have the
same object reference (which is why we have to use .equals() btw
!?).

You've answered your own question: use a HashMap. What makes you think
that HashMap cares about identical object references?

/gordon
 
R

Roedy Green

Wouldn't a SortedSet (TreeSet) do the job nicely?

SortedSets are for when you want to MAINTAIN a set in order. A simple
sort is faster if you just want to put them in order once.

In The Replicator (an app that debuted last night) I use merges to
combine sortedArrayLists in various ways much faster than you could
add the elements to an existing sorted set.

The time for the many merging operations is barely even noticeable.
 
J

John C. Bollinger

Jean said:
I'd like to sort through a huge list of Strings and keep only
unique Strings. There can not be in my result two String a and b for
which : a.equals(b) == true.

So my first idea was to simply put each String in a HashMap, which
would automatically discard duplicates (and this is really easy to
implement).

However, nothing garantees that two Strings being equals have the same
object reference (which is why we have to use .equals() btw !?).

Is that a question? You are correct, two Strings that are equal are not
guaranteed to be the same object. You can discard duplicate String
objects soon after creation by careful use of String.intern(), but that
does nothing to prevent having multiple references to the same String.
So what is the simplest way I could use to do this ? (I'd like to do this
with a sort of hashtable/map/set whatever to have an O(1) complexity).

HashSet is probably the most convenient to use. It ought to be
suitable. All of the above use the subject Objects' hashCodes for
comparison, without regard to object identity.
Do I have to make my own "string" class with a special hashcode method ?

String already has a suitable hashCode method. A class whose hashCode
method is inconsistent with its equals method is broken -- if two
objects are equal then they should have the same hashCode (but the
reverse is not necessarilly true). You will find that all the official
Java classes have this property.


John Bollinger
(e-mail address removed)
 
W

William Brogden

Jean Lutrin said:
Hi all,

I'd like to sort through a huge list of Strings and keep only
unique Strings. There can not be in my result two String a and b for
which : a.equals(b) == true.

So my first idea was to simply put each String in a HashMap, which
would automatically discard duplicates (and this is really easy to
implement).

However, nothing garantees that two Strings being equals have the same
object reference (which is why we have to use .equals() btw !?).

So what is the simplest way I could use to do this ? (I'd like to do this
with a sort of hashtable/map/set whatever to have an O(1) complexity).

Sounds like a job for TreeSet to me. Unique and sorted already.
Or HashSet if you don't want to be able to iterate through in sorted order.

Bill
 
D

Dale King

Gordon Beaton said:
You've answered your own question: use a HashMap. What makes you think
that HashMap cares about identical object references?


Or if you don't care about actually mapping them to anything, just use a
HashSet instead. Note that doesn't actually save any memory as HashSet is
implemented with a HashMap, but it might be the more correct abstraction.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,149
Latest member
Vinay Kumar Nevatia0
Top