deduping algorithm

Discussion in 'Java' started by Roedy Green, Jul 22, 2004.

  1. Roedy Green

    Roedy Green Guest

    The typical way you dedup an array is to sort and compare pairwise.

    However, often you don't want to disturb the original order, just
    dedup.

    I wonder if there are clever ways to do this.

    The obvious solution, sort the items back in original order after
    dedup requires creating new objects with an original order field, then
    stripping it off again. Ouch.
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 22, 2004
    #1
    1. Advertising

  2. Roedy Green

    Roedy Green Guest

    On Thu, 22 Jul 2004 16:23:08 GMT, Roedy Green
    <> wrote or quoted :

    >However, often you don't want to disturb the original order, just
    >dedup.


    If your objects have an equals and hash method defined in the same way
    you define dups, then you could enter them into a HashMap and detect
    dups that way.

    You could sort a copy of the array and put the dups into a HashMap.
    Then you go through the original array looking up elts it the dup
    HashMap. You delete all but the first one. This means you need
    special hashed objects where you can maintain the first boolean.
    This uses more ram than the first, is slower, and more complicated to
    code and still needs equals and hash to match your dup definition.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 22, 2004
    #2
    1. Advertising

  3. Roedy Green

    xarax Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Thu, 22 Jul 2004 16:23:08 GMT, Roedy Green
    > <> wrote or quoted :
    >
    > >However, often you don't want to disturb the original order, just
    > >dedup.

    >
    > If your objects have an equals and hash method defined in the same way
    > you define dups, then you could enter them into a HashMap and detect
    > dups that way.
    >
    > You could sort a copy of the array and put the dups into a HashMap.
    > Then you go through the original array looking up elts it the dup
    > HashMap. You delete all but the first one. This means you need
    > special hashed objects where you can maintain the first boolean.
    > This uses more ram than the first, is slower, and more complicated to
    > code and still needs equals and hash to match your dup definition.


    A simple way is to create an array of Integer and
    a Comparator singleton that extracts the intValue()
    from each operand and indexes into the corresponding
    array of objects to determine the relative sort.

    Then walk through the "sorted" Integer[] to delete
    the duplicates (by adding the non-dups to a new
    array).

    public class MyDataComparator
    implements java.util.Comparator
    {
    public int compare(final Object obj1, final Object obj2)
    {
    final Integer ndx1;
    final Integer ndx2;
    final int x1;
    final int x2;
    final MyData md1;
    final MyData md2;
    final int result;

    ndx1 = (Integer) obj1;
    ndx2 = (Integer) obj2;

    x1 = ndx1.intValue();
    x2 = ndx2.intValue();

    md1 = myDataArray[x1];
    md2 = myDataArray[x2];

    result = md1.compareTo(md2);
    return result;
    }
    }

    This assumes that your data object, MyData, implements
    the Comparable interface. You would use the
    Arrays.sort(Object[],java.util.Comparator) method to "sort"
    the Integer array. Then walk through the "sorted" array
    to get the indeces for the sorted elements in the original
    array (myDataArray[]).

    I would probably use an ArrayList initialized to the size of
    the original array to accumulate the non-duplicate elements,
    then use its toArray() method to build the new array.

    I'll leave the rest as an exercise for the reader.

    --
    ----------------------------
    Jeffrey D. Smith
    Farsight Systems Corporation
    24 BURLINGTON DRIVE
    LONGMONT, CO 80501-6906
    http://www.farsight-systems.com
    z/Debug debugs your Systems/C programs running on IBM z/OS for FREE!
     
    xarax, Jul 22, 2004
    #3
  4. Roedy Green

    Roedy Green Guest

    On Thu, 22 Jul 2004 18:50:44 GMT, "xarax" <> wrote or
    quoted :

    >
    >A simple way is to create an array of Integer and
    >a Comparator singleton that extracts the intValue()
    >from each operand and indexes into the corresponding
    >array of objects to determine the relative sort.


    I'd like to encapsulate to dedup any Arraylist of objects, just given
    a Comparator to define what you mean by dup.

    The problem with your approach is the client must write a fairly
    complicated Comparator. Perhaps there would be a way of using a
    simple Comparator to write your Comparator.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 22, 2004
    #4
  5. Roedy Green

    xarax Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Thu, 22 Jul 2004 18:50:44 GMT, "xarax" <> wrote or
    > quoted :
    >
    > >
    > >A simple way is to create an array of Integer and
    > >a Comparator singleton that extracts the intValue()
    > >from each operand and indexes into the corresponding
    > >array of objects to determine the relative sort.

    >
    > I'd like to encapsulate to dedup any Arraylist of objects, just given
    > a Comparator to define what you mean by dup.
    >
    > The problem with your approach is the client must write a fairly
    > complicated Comparator. Perhaps there would be a way of using a
    > simple Comparator to write your Comparator.


    My comparator was about as simple as it can be. The actual
    comparison is passed off to the object's compareTo() method,
    which is good OO design. So, it's not really necessary to
    downcast from Object to MyData, but rather downcast from
    Object to Comparable. That would make my comparator very
    reusable. All that is needed is the loop that compares
    the "sorted" Integer array to weed out the pair-wise dupes.

    I am sure you could whip something together in a few
    minutes and stick into a utility class somewhere for
    everyone to use.

    ;)

    Cheers
     
    xarax, Jul 22, 2004
    #5
  6. Roedy Green

    Roedy Green Guest

    On Thu, 22 Jul 2004 21:19:04 GMT, "xarax" <> wrote or
    quoted :

    >I am sure you could whip something together in a few
    >minutes and stick into a utility class somewhere for
    >everyone to use.


    Soon as my brain clears I will.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 22, 2004
    #6
  7. Roedy Green

    Rogan Dawes Guest

    Roedy Green wrote:

    > The typical way you dedup an array is to sort and compare pairwise.
    >
    > However, often you don't want to disturb the original order, just
    > dedup.
    >
    > I wonder if there are clever ways to do this.
    >
    > The obvious solution, sort the items back in original order after
    > dedup requires creating new objects with an original order field, then
    > stripping it off again. Ouch.


    Following a suggestion earlier to use a set/map:

    void dedup(List list) {
    Set set = new HashSet(list);
    Iterator it = list.iterator();
    while (it.hasNext()) {
    Object element = it.next();
    // if we have already removed the entry from the set,
    // this element is a duplicate. Remove it from the list.
    // the iterator returned by the list must support remove()
    // otherwise, we could implement this using list.remove(index)
    if (set.remove(element) == null) it.remove();
    }
    }

    Benefits:

    Short, sweet, understandable ;-)

    Objects do not need to implement Comparable, just provide a meaningful
    equals()

    Rogan

    Disclaimer: This is not a SSCCE ;-) It has not even been syntax checked.
    --
    Rogan Dawes

    *ALL* messages to will be dropped, and added
    to my blacklist. Please respond to "nntp AT dawes DOT za DOT net"
     
    Rogan Dawes, Jul 23, 2004
    #7
  8. Roedy Green

    Rogan Dawes Guest

    Roedy Green wrote:

    > The typical way you dedup an array is to sort and compare pairwise.
    >
    > However, often you don't want to disturb the original order, just
    > dedup.


    One question that this leaves is WHICH duplicate to retain? Do you keep
    the first one you come to? From which direction, 0..n or n..0?

    When the array is sorted, it is irrelevant which one you remove, as the
    order of the elements remains the same. When the elements are unsorted,
    or sorted according to some external order, you effectively change this
    order by removing the dupes.

    Rogan
    --
    Rogan Dawes

    *ALL* messages to will be dropped, and added
    to my blacklist. Please respond to "nntp AT dawes DOT za DOT net"
     
    Rogan Dawes, Jul 23, 2004
    #8
  9. Roedy Green

    Roedy Green Guest

    On Thu, 22 Jul 2004 22:30:24 GMT, Roedy Green
    <> wrote or quoted :

    >>I am sure you could whip something together in a few
    >>minutes and stick into a utility class somewhere for
    >>everyone to use.

    >
    >Soon as my brain clears I will.


    I did a little work on it last night, and realized that the user
    likely wants to be informed of what was removed, perhaps to generate
    error messages. Fine, I can create a little ArrayList of removed dups
    and return that. However, in the process I also create a new
    ArrayList of non-removed dups. What I want to do when I am done is
    replace the array inside the original ArrayList with my the array
    inside by new ArrayList so that my new one will effectively replace
    the caller's old copy.

    I could not see how to do that efficiently. In non oo days it would
    have been a simple matter to just change the reference and the size.

    The solution Java seems to push me toward is ugly -- sliding all the
    elements every time I find a dup with delete.

    I amazes me how complicated code can be to do something you can
    explain in English in a sentence.

    The problem with writing code like this is I want to do it optimally.
    I have no idea the context in which the code will ultimately be used,
    so I feel obligated to do it somewhere close to perfect.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 23, 2004
    #9
  10. Roedy Green

    xarax Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Thu, 22 Jul 2004 22:30:24 GMT, Roedy Green
    > <> wrote or quoted :
    >
    > >>I am sure you could whip something together in a few
    > >>minutes and stick into a utility class somewhere for
    > >>everyone to use.

    > >
    > >Soon as my brain clears I will.

    >
    > I did a little work on it last night, and realized that the user
    > likely wants to be informed of what was removed, perhaps to generate
    > error messages. Fine, I can create a little ArrayList of removed dups
    > and return that. However, in the process I also create a new
    > ArrayList of non-removed dups. What I want to do when I am done is
    > replace the array inside the original ArrayList with my the array
    > inside by new ArrayList so that my new one will effectively replace
    > the caller's old copy.
    >
    > I could not see how to do that efficiently. In non oo days it would
    > have been a simple matter to just change the reference and the size.
    >
    > The solution Java seems to push me toward is ugly -- sliding all the
    > elements every time I find a dup with delete.
    >
    > I amazes me how complicated code can be to do something you can
    > explain in English in a sentence.
    >
    > The problem with writing code like this is I want to do it optimally.
    > I have no idea the context in which the code will ultimately be used,
    > so I feel obligated to do it somewhere close to perfect.


    Seems like you're stuck with clear() the original
    ArrayList, then iteratively call add() to copy the
    elements of the de-duped array into the original
    ArrayList. The clear() method won't resize the
    internal array, so the add() calls will be fast
    (won't have to incrementally resize the internal
    array).

    There is also the issue of whether caller knows/cares
    whether his specified ArrayList will be modified and
    whether other threads know/care about that. The garbage
    collector will take care of forgotten references.

    I think a simpler design is to just return new ArrayList
    objects to the client. Then the client can decide whether
    to keep both references, or discard one of them. As for
    returning yet another ArrayList of non-removed dupes, I
    would probably make an optional input parameter for that.
    If the parameter is null, then don't bother. Otherwise,
    put the non-removed dupes into the specified ArrayList.

    2 cents worth. Your mileage may vary.
     
    xarax, Jul 23, 2004
    #10
  11. Roedy Green

    Eric Sosman Guest

    Roedy Green wrote:
    > On Thu, 22 Jul 2004 22:30:24 GMT, Roedy Green
    > <> wrote or quoted :
    >
    >
    >>>I am sure you could whip something together in a few
    >>>minutes and stick into a utility class somewhere for
    >>>everyone to use.

    >>
    >>Soon as my brain clears I will.

    >
    >
    > I did a little work on it last night, and realized that the user
    > likely wants to be informed of what was removed, perhaps to generate
    > error messages. Fine, I can create a little ArrayList of removed dups
    > and return that. However, in the process I also create a new
    > ArrayList of non-removed dups. What I want to do when I am done is
    > replace the array inside the original ArrayList with my the array
    > inside by new ArrayList so that my new one will effectively replace
    > the caller's old copy.
    >
    > I could not see how to do that efficiently. In non oo days it would
    > have been a simple matter to just change the reference and the size.
    >
    > The solution Java seems to push me toward is ugly -- sliding all the
    > elements every time I find a dup with delete.


    Move each surviving element once (at most) to its
    position in the final list. Rough outline:

    int survivors = 0;
    for (int i = 0; i < list.size(); ++i) {
    Object obj = list.get(i);
    if (its_a_duplicate(obj)) {
    loser_list.add(obj);
    }
    else {
    if (survivors < i) // optimization
    list.set(survivors, obj);
    ++survivors;
    }
    }

    // There's no truncate() method, alas. If you're
    // extending ArrayList you can use the protected
    // removeRange() instead of this loop.
    // Top-to-bottom sweep avoids useless "sliding."
    for (int i = list.size(); --i >= survivors; )
    list.remove(i);

    list.trimToSize(); // optional

    --
     
    Eric Sosman, Jul 23, 2004
    #11
  12. Roedy Green

    Will Hartung Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Thu, 22 Jul 2004 16:23:08 GMT, Roedy Green
    > <> wrote or quoted :
    >
    > >However, often you don't want to disturb the original order, just
    > >dedup.

    >
    > If your objects have an equals and hash method defined in the same way
    > you define dups, then you could enter them into a HashMap and detect
    > dups that way.
    >
    > You could sort a copy of the array and put the dups into a HashMap.
    > Then you go through the original array looking up elts it the dup
    > HashMap. You delete all but the first one. This means you need
    > special hashed objects where you can maintain the first boolean.
    > This uses more ram than the first, is slower, and more complicated to
    > code and still needs equals and hash to match your dup definition.


    Here ya go, preserves order and everyhing.

    public Collection dedup(Collection c)
    {
    LinkedHashSet result = new LinkedHashSet();
    result.addAll(c);
    return result;
    }

    Regards,

    Will Hartung
    ()
     
    Will Hartung, Jul 23, 2004
    #12
  13. Roedy Green

    Roedy Green Guest

    On Fri, 23 Jul 2004 12:14:26 -0700, "Will Hartung" <>
    wrote or quoted :

    > LinkedHashSet result = new LinkedHashSet();
    > result.addAll(c);
    > return result;


    But if your definition of dup does not match equals, you are hosed.

    For example, in my case I am looking for dups in lists of email
    addresses. Caps don't matter.

    Does your implementation preserve order?

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 23, 2004
    #13
  14. Roedy Green

    Will Hartung Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Fri, 23 Jul 2004 12:14:26 -0700, "Will Hartung" <>
    > wrote or quoted :
    >
    > > LinkedHashSet result = new LinkedHashSet();
    > > result.addAll(c);
    > > return result;

    >
    > But if your definition of dup does not match equals, you are hosed.
    >
    > For example, in my case I am looking for dups in lists of email
    > addresses. Caps don't matter.


    Then use an equals and hashCode that does. It's up to the object to define
    those semantics properly, not the HashSets.

    public class EmailAddress
    {
    private String address;

    public EmailAddress(String address)
    {
    this.address = address;
    }

    public String getAddress()
    {
    return address;
    }

    public boolean equals(Object o)
    {
    if (!o instanceof EmailAddress) {
    return false;
    }
    EmailAddress that = (EmailAddress)o;
    return this.getAddress().equalsIgnoreCase(that.getAddress());
    }

    public int hashCode()
    {
    return address.toLowerCase().hashCode();
    }
    }

    > Does your implementation preserve order?


    Yes, that's the primary purpose of the LinkedHashSet.

    A TreeSet won't work (you might consider it because you can specify a
    Comparator), because it want's to actually sort your collection, which is
    not what you want. You want to preserve order.

    Regards,

    Will Hartung
    ()
     
    Will Hartung, Jul 23, 2004
    #14
  15. Roedy Green

    Roedy Green Guest

    On Fri, 23 Jul 2004 13:11:26 -0700, "Will Hartung" <>
    wrote or quoted :

    >Then use an equals and hashCode that does. It's up to the object to define
    >those semantics properly, not the HashSets.


    To do that you then have to subclass, create new equals/hashcode,
    write an unmaintainable copy constructor to convert the objects, then
    dedup, and perhaps even put them back to their ordinary types.

    You might look for various types of dup, in which case you need to do
    this several times.

    If the definition of equals is correct for deduping in the original
    object, you are golden. If not, this gets pretty clumsy.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 23, 2004
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Coyne  Gaijin42

    Word wrap line break code and algorithm for c#

    Jason Coyne Gaijin42, Apr 8, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    24,391
    Jason Coyne Gaijin42
    Apr 8, 2004
  2. Ahmed Moustafa
    Replies:
    0
    Views:
    817
    Ahmed Moustafa
    Nov 15, 2003
  3. Bapaiah Katepalli
    Replies:
    1
    Views:
    1,533
    Mike Treseler
    Jun 23, 2006
  4. Roedy Green

    Deduping quotations

    Roedy Green, Nov 30, 2009, in forum: Java
    Replies:
    2
    Views:
    377
    Tom Anderson
    Nov 30, 2009
  5. dirknbr

    deduping

    dirknbr, Jun 21, 2010, in forum: Python
    Replies:
    5
    Views:
    301
    Paul Rubin
    Jun 21, 2010
Loading...

Share This Page