Find duplicates

Stern · Oct 20, 2003

Hello,

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

Thanx,

Stern.

Michael Meyer · Oct 20, 2003

Hi,

I cannot give you a quicksort for your problem but maybe you can use
java.util.Set. A Set does not allow duplicate entries.

Michael

Matt Humphrey · Oct 20, 2003

Stern said:
Hello,

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

I don't think there's a quick way, unless your pre-index your vector so that
you already have the objects indexed by name (e.g. a Map that maps a name to
a list of those items that share that name.) Then you just drop the lists
that have > 1 item.

To do it on-the-fly, create the following data structures:
1) a hash map "uniqueNames" that will map a string to an item for an item
that has been seen for the first time.
2) a set "multiNames" that will contain the names of items that have been
seen more than once
3) a list "deleteItems" that will contain items to be deleted.

For each item in the vector {
if the item's name is in the multiNames set,
add the item to the deleteItems list.
else if the item's name is a key in the uniqueNames hash map,
add the item to the deleteItems list
remove the item name from the uniqueNames hash table
add the item's name ot the multiNames set
else // the item has never been seen
add the name / item to the uniqueNames hash set
}
Walk the deleteItems list to delete the items.

If you have to do this alot I would really recommend just keeping a
name-based index to the list of shared items as outlined above.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/

Chris Smith · Oct 20, 2003

Matt said:
To do it on-the-fly, create the following data structures:
1) a hash map "uniqueNames" that will map a string to an item for an item
that has been seen for the first time.
2) a set "multiNames" that will contain the names of items that have been
seen more than once
3) a list "deleteItems" that will contain items to be deleted.

For each item in the vector {
if the item's name is in the multiNames set,
add the item to the deleteItems list.
else if the item's name is a key in the uniqueNames hash map,
add the item to the deleteItems list
remove the item name from the uniqueNames hash table
add the item's name ot the multiNames set
else // the item has never been seen
add the name / item to the uniqueNames hash set
}
Walk the deleteItems list to delete the items.

To make this a but simpler, recall that Vector implements List, and
provides an Iterator. An Iterator can be used to remove elements of the
Vector without disrupting the traversal. So the process becomes:

Create the following data structures:

1) A set "names" that will contain the names that have been seen

For each item in the vector (using an iterator) {
if the item's name is in the "names" set,
remove the item using the iterator
else
add the item's name to the "names" set
}

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Chris Smith · Oct 20, 2003

Chris said:
To make this a but simpler, recall that Vector implements List, and
provides an Iterator. An Iterator can be used to remove elements of the
Vector without disrupting the traversal. So the process becomes:

[...]

Actually, I didn't realize what you were doing. You were deleting *all*
copies of duplicates, where I assumed that the task was to delete the
duplicate copies but leave one copy in. I guess the OP needs to decide
which of us has it right, and implement the appropriate task either way.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Marco Schmidt · Oct 20, 2003

Stern:

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

Sort the Vector with Collections.sort, then iterate over the entries
and remove duplicates. It's easy with the sorted Vector because
duplicates will obviously be adjacent. Something like this (v is your
Vector):

Iterator iter = v.iterator();
Object prev = null;
while (iter.hasNext()) {
Object obj = iter.next();
if (prev != null && obj.equals(prev)) {
iter.remove();
} else {
prev = obj;
}
}

Regards,
Marco

Filip Larsen · Oct 21, 2003

Stern wrote

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

If we assume you want to remove duplicated entries only (that is, if an
object with same name appears twice, only the second object is removed), and
that the equals and hashCode method of the objects are defined mapped to the
name of the object, then you should be able to use

Vector x = ...
Vector xWithoutDuplicates = new Vector(new LinkedHashSet(x));

If you don't want a copy, but rather have the x vector be replaced you could
do

Vector x = ...
Collection uniques = new LinkedHashSet(x);
x.clear();
x.addAll(uniques);

Both of these should execute reasonably fast (comparable to the other
methods proposed in this thread, although I haven't made an actual test
run), but uses slightly more memory than needed due to use of LinkedHashSet.
If the order of the elements in the vector doesn't matter, you can use
HashSet instead and save some space.

Regards,

Roedy Green · Oct 21, 2003

I don't know which names these are.
Is there any quick solution for the problem?

deDupKeepingFirst or deDupKeepingLast

See http://mindprod.com/products.html#SORTED

Stern · Oct 21, 2003

Actually, I didn't realize what you were doing. You were deleting *all*

copies of duplicates, where I assumed that the task was to delete the
duplicate copies but leave one copy in. I guess the OP needs to decide
which of us has it right, and implement the appropriate task either way.

I really want to leave one copy in, so your approach is OK. I'll try
yours and that one of Filip Larsen. Thanx.

Phil... · Oct 22, 2003

How do you determine duplicate? Is it the name
is a string and a duplicate entry has the same string?
Or is it that the name references of the two entries
are equal?

Michael Borgwardt · Oct 22, 2003

Phil... said:
How do you determine duplicate?

By using the equals() method that every object has. Note that depending
on the Set implementation class, a consistent implementation of the
compareTo() or hashCode() methods may also be necessary.

So if your objects implement these methods correctly (by delegating
them to the name String object), it should work as intended.

Roedy Green · Oct 22, 2003

By using the equals() method that every object has.

Deduping in not quite the same as removing equal elements. Sometimes
your definition for deduping would require a different equal method,
..e.g. Dedup files with equal names, equal names ignoring case, or
dedup files with equal names and timestamp,
or equal sizes ...

The way I handle it in deDupKeepingFirst is to use a variable
comparator, the same one I use to sort the elements before checking
adjacents for equality.

see http://mindprod.com/jgloss/products.html#SORTED
for source.

jyotidayal · Dec 5, 2010

I ran into a problem of duplicate files and someone told me about duplicates-finder.com

remove duplicates?	4	Sep 5, 2011
TF-IDF	1	Aug 19, 2021
Find and count strings of text from multiple files	17	Dec 16, 2021
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
duplicates	15	May 29, 2007
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
cannot find symbol?	1	May 5, 2013

Find duplicates

Stern

Michael Meyer

Matt Humphrey

Chris Smith

Chris Smith

Marco Schmidt

Filip Larsen

Roedy Green

Stern

Phil...

Michael Borgwardt

Roedy Green

jyotidayal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads