removing duplicates from a string array

F

Fred

Hi all,

I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.
Any help ...

Fred
 
S

Sharp

Hi all,
I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.
Any help ...

String[] str = new String[size];
str.removeDuplicates();

Sharp
 
R

Rogan Dawes

Sharp said:
Hi all,

I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.
Any help ...


String[] str = new String[size];
str.removeDuplicates();

Sharp

Is this perhaps C "Sharp"?

I don't know of any removeDuplicates method in Java's array implementation.

To the OP:

Your suggestion is basically the simplest way to do it.

Otherwise, sort your array, then iterate over it, skipping any entries
that are identical to the previous one.

You may end up iterating over it twice, once to calculate how many items
there are (to size the new array), the second time to copy the entries
over.

Alternatively, add each unique string to a List, then use the (String[])
list.toArray(new String[list.size()]); method to get back to an array of
Strings.

Regards,

Rogan
 
J

John C. Bollinger

Fred said:
I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.

Why? To remove duplicates you must compare every element to every
other element. A good hashing implementation makes this fairly easy,
and HashSet has the further advantage of already providing the required
equality comparisons in the case of hash collisions. The approach you
suggest is O(N).

You might consider it somewhat more direct to perform a nested iteration
over the array elements, or a sort-and-iterate, as another respondent
observed. The former is O(N*N), and the latter O(NlogN) (for an
O(NlogN) sort). There is no way to change the size of an existing
array, so you are going to have to create a new one whenever there are
any duplicates.
 
H

HK

Fred said:
Hi all,

I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

The above is probably not the worst you can do. Depending on
how many collisions you generate in the hash, the complexity
can be nearly linear in the number of elements. For a start
you would take a generously preallocated HashSet.

A n*log(n) approach would involve sorting the array. Then you
need only compare adjecent elements for equality.

Harald.
 
P

Patricia Shanahan

Fred said:
Hi all,

I am trying to find the most efficient way
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.
Any help ...

Fred

Note that java.util.Arrays has a method asList, so you don't
need an explicit loop to add the array contents to the set.
Similarly, HashSet has a method toArray.

The HashSet approach will leave the strings in an arbitrary
order. Is that a problem?

Patricia
 
E

Eric Sosman

Fred said:
Hi all,

I am trying to find the most efficient way

Why insist on "the most efficient way?" How big is
the array, and what reason do you have to believe that
the efficiency of this removal operation is important to
the overall program?
to remove duplicate from a string array.
Obviously I can do it by

- iterating through all array values and putting them into a haset keys
- retriving the keys
- putting them back into a new array

but there must be a more direct way.

That's pretty direct. It's a two-liner:

Set set = new HashSet(Arrays.asList(array));
String[] array2 = (String[])(set.toArray(new String[set.size()]));
 
A

Archer

package run;

import java.util.Arrays;

public class RunDuplicate
{
public RunDuplicate()
{
super();
}

public static void main(String[] args)
{
String[] duplicates = new String[] {"mary",
"had",
"a",
"little",
"lamb",
"its",
"fleece",
"was",
"white",
"as",
"snow",
"mary"};

Arrays.sort(duplicates);

int k = 0;

for (int i = 0; i < duplicates.length; i++)
{
if (i > 0 && duplicates.equals(duplicates[i -1]))
continue;

duplicates[k++] = duplicates;
}

String[] unique = new String[k];

System.arraycopy(duplicates, 0, unique, 0, k);

//test that unique contains no duplicate strings
for (int i = 0; i < unique.length; i++)
System.out.println(unique);
}
}

Regards,

Archer.
 
A

Archer

package run;

import java.util.Arrays;

public class RunRemoveDuplicates
{
public RunRemoveDuplicates()
{
super();
}

public static void main(String[] args)
{
String[] duplicates = new String[] {"mary",
"had",
"a",
"a",
"little",
"lamb",
"its",
"fleece",
"was",
"white",
"as",
"snow",
"mary"};

Arrays.sort(duplicates);

int k = 1;

for (int i = 1; i < duplicates.length; i++)
{
if (! duplicates.equals(duplicates[i -1]))
duplicates[k++] = duplicates;
}

String[] unique = new String[k];

System.arraycopy(duplicates, 0, unique, 0, k);

//test that unique contains no duplicate strings
for (int i = 0; i < unique.length; i++)
System.out.println(unique);
}
}

Regards,

Archer.
 
A

Archer

package run;

import java.util.Arrays;

public class RunRemoveDuplicates
{
public RunRemoveDuplicates()
{
super();
}

public static void main(String[] args)
{
String[] duplicates = new String[] {"mary",
"had",
"a",
"little",
"lamb",
"its",
"fleece",
"was",
"white",
"as",
"snow",
"mary"};

Arrays.sort(duplicates);

int k = 1;

for (int i = 1; i < duplicates.length; i++)
{
if (! duplicates.equals(duplicates[i -1]))
duplicates[k++] = duplicates;
}

String[] unique = new String[k];

System.arraycopy(duplicates, 0, unique, 0, k);

//test that unique contains no duplicate strings
for (int i = 0; i < unique.length; i++)
System.out.println(unique);
}
}
 
A

Archer

package run;

import java.util.Arrays;

public class RunRemoveDuplicates
{
public RunRemoveDuplicates()
{
super();
}

public static void main(String[] args)
{
String[] duplicates = new String[] {"mary",
"had",
"a",
"little",
"lamb",
"its",
"fleece",
"was",
"very",
"very",
"white",
"like",
"snow"};

Arrays.sort(duplicates);

int k = 1;

for (int i = 1; i < duplicates.length; i++)
{
if (!duplicates.equals(duplicates[i - 1]))
duplicates[k++] = duplicates;
}

String[] unique = new String[k];

System.arraycopy(duplicates, 0, unique, 0, k);

// test that unique contains no duplicate strings
for (int i = 0; i < unique.length; i++)
System.out.println(unique);
}
}
 
J

John C. Bollinger

Patricia said:
The HashSet approach will leave the strings in an arbitrary
order. Is that a problem?

If it is, then using LinkedHashSet instead will do the job without
changing the retained elements' relative order. There is a cost
involved, so use HashSet when the order doesn't matter.
 
E

Eric Sosman

Archer said:
package run;

import java.util.Arrays;
[...]

Perhaps the most interesting about this multi-multi-posted
replacement for the obvious two-liner is that it's slower than
the two lines it replaces, at least on the system I used. Details:

- Array of 1000 randomly-generated alphabetic strings,
with lengths between 20 and 99 characters.

- The basic test: System.arraycopy() the prototype
array to a "victim" array, then remove the duplicates
with Archer's method or with the two-liner. (The
array is copied each time because Archer's method
modifies the original; the two-liner doesn't, but the
array was copied anyhow for fairness.)

- Run each test 500 times to "warm up" the JVM.

- Run each test another 500 times and measure the total
time. Scale the number of repetitions up to get
roughly forty seconds of elapsed time.

- The payoff: Run each test 7707 times. The two-liner
took just under eleven seconds; Archer's method took
just under twenty-nine.

YAV4KISS! (Yet Another Victory for "Keep It Simple, Stupid!")
 
A

Archer

Eric said:
Archer said:
package run;

import java.util.Arrays;
[...]

Perhaps the most interesting about this multi-multi-posted
replacement for the obvious two-liner is that it's slower than
the two lines it replaces, at least on the system I used. Details:

- Array of 1000 randomly-generated alphabetic strings,
with lengths between 20 and 99 characters.

- The basic test: System.arraycopy() the prototype
array to a "victim" array, then remove the duplicates
with Archer's method or with the two-liner. (The
array is copied each time because Archer's method
modifies the original; the two-liner doesn't, but the
array was copied anyhow for fairness.)

- Run each test 500 times to "warm up" the JVM.

- Run each test another 500 times and measure the total
time. Scale the number of repetitions up to get
roughly forty seconds of elapsed time.

- The payoff: Run each test 7707 times. The two-liner
took just under eleven seconds; Archer's method took
just under twenty-nine.

YAV4KISS! (Yet Another Victory for "Keep It Simple, Stupid!")


Hi Eric,

Thankyou for your post and especially for you test results, most
interesting!! Could you please run the following code to see if it
improves things.


package run;

import java.util.Arrays;

public class RunRemoveDuplicates
{
public RunRemoveDuplicates()
{
super();
}

public static void main(String[] args)
{
String[] duplicates = new String[] { "mary", "had", "a",
"little",
"lamb", "its", "fleece", "was", "very", "very",
"white",
"like", "snow" };

Arrays.sort(duplicates);

int k = 1;
int len = duplicates.length;

for (int i = 1; i < len; i++)
{
if (! duplicates.equals(duplicates[i - 1]))
duplicates[k++] = duplicates;
}

String[] unique = new String[k];

System.arraycopy(duplicates, 0, unique, 0, k);

// test that unique contains no duplicate strings
for (int i = 0; i < unique.length; i++)
System.out.println(unique);
}
}

Regards,

Archer.

PS. Apologies for the duplicated posts. I was attempting to send the
above but the text kept misaligning and I thougth it necessary that it
be easily read. Also, it was Friday evening here and I had had a
glass of Ozwine (or was that two).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top