deduping algorithm

Roedy Green · Jul 22, 2004

The typical way you dedup an array is to sort and compare pairwise.

However, often you don't want to disturb the original order, just
dedup.

I wonder if there are clever ways to do this.

The obvious solution, sort the items back in original order after
dedup requires creating new objects with an original order field, then
stripping it off again. Ouch.

Roedy Green · Jul 22, 2004

However, often you don't want to disturb the original order, just
dedup.

If your objects have an equals and hash method defined in the same way
you define dups, then you could enter them into a HashMap and detect
dups that way.

You could sort a copy of the array and put the dups into a HashMap.
Then you go through the original array looking up elts it the dup
HashMap. You delete all but the first one. This means you need
special hashed objects where you can maintain the first boolean.
This uses more ram than the first, is slower, and more complicated to
code and still needs equals and hash to match your dup definition.

xarax · Jul 22, 2004

Roedy Green said:
If your objects have an equals and hash method defined in the same way
you define dups, then you could enter them into a HashMap and detect
dups that way.

You could sort a copy of the array and put the dups into a HashMap.
Then you go through the original array looking up elts it the dup
HashMap. You delete all but the first one. This means you need
special hashed objects where you can maintain the first boolean.
This uses more ram than the first, is slower, and more complicated to
code and still needs equals and hash to match your dup definition.

A simple way is to create an array of Integer and
a Comparator singleton that extracts the intValue()
from each operand and indexes into the corresponding
array of objects to determine the relative sort.

Then walk through the "sorted" Integer[] to delete
the duplicates (by adding the non-dups to a new
array).

public class MyDataComparator
implements java.util.Comparator
{
public int compare(final Object obj1, final Object obj2)
{
final Integer ndx1;
final Integer ndx2;
final int x1;
final int x2;
final MyData md1;
final MyData md2;
final int result;

ndx1 = (Integer) obj1;
ndx2 = (Integer) obj2;

x1 = ndx1.intValue();
x2 = ndx2.intValue();

md1 = myDataArray[x1];
md2 = myDataArray[x2];

result = md1.compareTo(md2);
return result;
}
}

This assumes that your data object, MyData, implements
the Comparable interface. You would use the
Arrays.sort(Object[],java.util.Comparator) method to "sort"
the Integer array. Then walk through the "sorted" array
to get the indeces for the sorted elements in the original
array (myDataArray[]).

I would probably use an ArrayList initialized to the size of
the original array to accumulate the non-duplicate elements,
then use its toArray() method to build the new array.

I'll leave the rest as an exercise for the reader.

--
----------------------------
Jeffrey D. Smith
Farsight Systems Corporation
24 BURLINGTON DRIVE
LONGMONT, CO 80501-6906
http://www.farsight-systems.com
z/Debug debugs your Systems/C programs running on IBM z/OS for FREE!

Roedy Green · Jul 22, 2004

A simple way is to create an array of Integer and
a Comparator singleton that extracts the intValue()
from each operand and indexes into the corresponding
array of objects to determine the relative sort.

I'd like to encapsulate to dedup any Arraylist of objects, just given
a Comparator to define what you mean by dup.

The problem with your approach is the client must write a fairly
complicated Comparator. Perhaps there would be a way of using a
simple Comparator to write your Comparator.

xarax · Jul 22, 2004

Roedy Green said:
I'd like to encapsulate to dedup any Arraylist of objects, just given
a Comparator to define what you mean by dup.

The problem with your approach is the client must write a fairly
complicated Comparator. Perhaps there would be a way of using a
simple Comparator to write your Comparator.

My comparator was about as simple as it can be. The actual
comparison is passed off to the object's compareTo() method,
which is good OO design. So, it's not really necessary to
downcast from Object to MyData, but rather downcast from
Object to Comparable. That would make my comparator very
reusable. All that is needed is the loop that compares
the "sorted" Integer array to weed out the pair-wise dupes.

I am sure you could whip something together in a few
minutes and stick into a utility class somewhere for
everyone to use.

Cheers

Roedy Green · Jul 22, 2004

I am sure you could whip something together in a few
minutes and stick into a utility class somewhere for
everyone to use.

Soon as my brain clears I will.

Rogan Dawes · Jul 23, 2004

Roedy said:
The typical way you dedup an array is to sort and compare pairwise.

However, often you don't want to disturb the original order, just
dedup.

I wonder if there are clever ways to do this.

The obvious solution, sort the items back in original order after
dedup requires creating new objects with an original order field, then
stripping it off again. Ouch.

Following a suggestion earlier to use a set/map:

void dedup(List list) {
Set set = new HashSet(list);
Iterator it = list.iterator();
while (it.hasNext()) {
Object element = it.next();
// if we have already removed the entry from the set,
// this element is a duplicate. Remove it from the list.
// the iterator returned by the list must support remove()
// otherwise, we could implement this using list.remove(index)
if (set.remove(element) == null) it.remove();
}
}

Benefits:

Short, sweet, understandable ;-)

Objects do not need to implement Comparable, just provide a meaningful
equals()

Rogan

Disclaimer: This is not a SSCCE ;-) It has not even been syntax checked.

Rogan Dawes · Jul 23, 2004

Roedy said:
The typical way you dedup an array is to sort and compare pairwise.

However, often you don't want to disturb the original order, just
dedup.

One question that this leaves is WHICH duplicate to retain? Do you keep
the first one you come to? From which direction, 0..n or n..0?

When the array is sorted, it is irrelevant which one you remove, as the
order of the elements remains the same. When the elements are unsorted,
or sorted according to some external order, you effectively change this
order by removing the dupes.

Rogan

Roedy Green · Jul 23, 2004

Soon as my brain clears I will.

I did a little work on it last night, and realized that the user
likely wants to be informed of what was removed, perhaps to generate
error messages. Fine, I can create a little ArrayList of removed dups
and return that. However, in the process I also create a new
ArrayList of non-removed dups. What I want to do when I am done is
replace the array inside the original ArrayList with my the array
inside by new ArrayList so that my new one will effectively replace
the caller's old copy.

I could not see how to do that efficiently. In non oo days it would
have been a simple matter to just change the reference and the size.

The solution Java seems to push me toward is ugly -- sliding all the
elements every time I find a dup with delete.

I amazes me how complicated code can be to do something you can
explain in English in a sentence.

The problem with writing code like this is I want to do it optimally.
I have no idea the context in which the code will ultimately be used,
so I feel obligated to do it somewhere close to perfect.

xarax · Jul 23, 2004

Roedy Green said:
I did a little work on it last night, and realized that the user
likely wants to be informed of what was removed, perhaps to generate
error messages. Fine, I can create a little ArrayList of removed dups
and return that. However, in the process I also create a new
ArrayList of non-removed dups. What I want to do when I am done is
replace the array inside the original ArrayList with my the array
inside by new ArrayList so that my new one will effectively replace
the caller's old copy.

I could not see how to do that efficiently. In non oo days it would
have been a simple matter to just change the reference and the size.

The solution Java seems to push me toward is ugly -- sliding all the
elements every time I find a dup with delete.

I amazes me how complicated code can be to do something you can
explain in English in a sentence.

The problem with writing code like this is I want to do it optimally.
I have no idea the context in which the code will ultimately be used,
so I feel obligated to do it somewhere close to perfect.

Seems like you're stuck with clear() the original
ArrayList, then iteratively call add() to copy the
elements of the de-duped array into the original
ArrayList. The clear() method won't resize the
internal array, so the add() calls will be fast
(won't have to incrementally resize the internal
array).

There is also the issue of whether caller knows/cares
whether his specified ArrayList will be modified and
whether other threads know/care about that. The garbage
collector will take care of forgotten references.

I think a simpler design is to just return new ArrayList
objects to the client. Then the client can decide whether
to keep both references, or discard one of them. As for
returning yet another ArrayList of non-removed dupes, I
would probably make an optional input parameter for that.
If the parameter is null, then don't bother. Otherwise,
put the non-removed dupes into the specified ArrayList.

2 cents worth. Your mileage may vary.

Eric Sosman · Jul 23, 2004

Roedy said:
I did a little work on it last night, and realized that the user
likely wants to be informed of what was removed, perhaps to generate
error messages. Fine, I can create a little ArrayList of removed dups
and return that. However, in the process I also create a new
ArrayList of non-removed dups. What I want to do when I am done is
replace the array inside the original ArrayList with my the array
inside by new ArrayList so that my new one will effectively replace
the caller's old copy.

I could not see how to do that efficiently. In non oo days it would
have been a simple matter to just change the reference and the size.

The solution Java seems to push me toward is ugly -- sliding all the
elements every time I find a dup with delete.

Move each surviving element once (at most) to its
position in the final list. Rough outline:

int survivors = 0;
for (int i = 0; i < list.size(); ++i) {
Object obj = list.get(i);
if (its_a_duplicate(obj)) {
loser_list.add(obj);
}
else {
if (survivors < i) // optimization
list.set(survivors, obj);
++survivors;
}
}

// There's no truncate() method, alas. If you're
// extending ArrayList you can use the protected
// removeRange() instead of this loop.
// Top-to-bottom sweep avoids useless "sliding."
for (int i = list.size(); --i >= survivors; )
list.remove(i);

list.trimToSize(); // optional

Will Hartung · Jul 23, 2004

Roedy Green said:
If your objects have an equals and hash method defined in the same way
you define dups, then you could enter them into a HashMap and detect
dups that way.

You could sort a copy of the array and put the dups into a HashMap.
Then you go through the original array looking up elts it the dup
HashMap. You delete all but the first one. This means you need
special hashed objects where you can maintain the first boolean.
This uses more ram than the first, is slower, and more complicated to
code and still needs equals and hash to match your dup definition.

Here ya go, preserves order and everyhing.

public Collection dedup(Collection c)
{
LinkedHashSet result = new LinkedHashSet();
result.addAll(c);
return result;
}

Regards,

Will Hartung
([email protected])

Roedy Green · Jul 23, 2004

LinkedHashSet result = new LinkedHashSet();
result.addAll(c);
return result;

But if your definition of dup does not match equals, you are hosed.

For example, in my case I am looking for dups in lists of email
addresses. Caps don't matter.

Does your implementation preserve order?

Will Hartung · Jul 23, 2004

Roedy Green said:
But if your definition of dup does not match equals, you are hosed.

For example, in my case I am looking for dups in lists of email
addresses. Caps don't matter.

Then use an equals and hashCode that does. It's up to the object to define
those semantics properly, not the HashSets.

public class EmailAddress
{
private String address;

public EmailAddress(String address)
{
this.address = address;
}

public String getAddress()
{
return address;
}

public boolean equals(Object o)
{
if (!o instanceof EmailAddress) {
return false;
}
EmailAddress that = (EmailAddress)o;
return this.getAddress().equalsIgnoreCase(that.getAddress());
}

public int hashCode()
{
return address.toLowerCase().hashCode();
}
}

Does your implementation preserve order?

Yes, that's the primary purpose of the LinkedHashSet.

A TreeSet won't work (you might consider it because you can specify a
Comparator), because it want's to actually sort your collection, which is
not what you want. You want to preserve order.

Regards,

Will Hartung
([email protected])

Roedy Green · Jul 23, 2004

Then use an equals and hashCode that does. It's up to the object to define
those semantics properly, not the HashSets.

To do that you then have to subclass, create new equals/hashcode,
write an unmaintainable copy constructor to convert the objects, then
dedup, and perhaps even put them back to their ordinary types.

You might look for various types of dup, in which case you need to do
this several times.

If the definition of equals is correct for deduping in the original
object, you are golden. If not, this gets pretty clumsy.

Did the sort do anything?	79	May 10, 2011
Did the sort do anything?	12	Nov 7, 2011
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Handling different implementations of the same algorithm	1	Aug 10, 2011
Using <algorithm> with null-terminated arrays	4	Dec 18, 2010
Knight's tour Warndorff's algorithm problem	4	Mar 10, 2010
cannot find symbol?	1	May 5, 2013
indirect sort	14	May 12, 2009

deduping algorithm

Roedy Green

Roedy Green

xarax

Roedy Green

xarax

Roedy Green

Rogan Dawes

Rogan Dawes

Roedy Green

xarax

Eric Sosman

Will Hartung

Roedy Green

Will Hartung

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads