Find duplicates

S

Stern

Hello,

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

Thanx,

Stern.
 
M

Michael Meyer

Hi,

I cannot give you a quicksort for your problem but maybe you can use
java.util.Set. A Set does not allow duplicate entries.

Michael
 
M

Matt Humphrey

Stern said:
Hello,

I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

I don't think there's a quick way, unless your pre-index your vector so that
you already have the objects indexed by name (e.g. a Map that maps a name to
a list of those items that share that name.) Then you just drop the lists
that have > 1 item.

To do it on-the-fly, create the following data structures:
1) a hash map "uniqueNames" that will map a string to an item for an item
that has been seen for the first time.
2) a set "multiNames" that will contain the names of items that have been
seen more than once
3) a list "deleteItems" that will contain items to be deleted.

For each item in the vector {
if the item's name is in the multiNames set,
add the item to the deleteItems list.
else if the item's name is a key in the uniqueNames hash map,
add the item to the deleteItems list
remove the item name from the uniqueNames hash table
add the item's name ot the multiNames set
else // the item has never been seen
add the name / item to the uniqueNames hash set
}
Walk the deleteItems list to delete the items.

If you have to do this alot I would really recommend just keeping a
name-based index to the list of shared items as outlined above.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
C

Chris Smith

Matt said:
To do it on-the-fly, create the following data structures:
1) a hash map "uniqueNames" that will map a string to an item for an item
that has been seen for the first time.
2) a set "multiNames" that will contain the names of items that have been
seen more than once
3) a list "deleteItems" that will contain items to be deleted.

For each item in the vector {
if the item's name is in the multiNames set,
add the item to the deleteItems list.
else if the item's name is a key in the uniqueNames hash map,
add the item to the deleteItems list
remove the item name from the uniqueNames hash table
add the item's name ot the multiNames set
else // the item has never been seen
add the name / item to the uniqueNames hash set
}
Walk the deleteItems list to delete the items.

To make this a but simpler, recall that Vector implements List, and
provides an Iterator. An Iterator can be used to remove elements of the
Vector without disrupting the traversal. So the process becomes:

Create the following data structures:

1) A set "names" that will contain the names that have been seen

For each item in the vector (using an iterator) {
if the item's name is in the "names" set,
remove the item using the iterator
else
add the item's name to the "names" set
}

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris Smith

Chris said:
To make this a but simpler, recall that Vector implements List, and
provides an Iterator. An Iterator can be used to remove elements of the
Vector without disrupting the traversal. So the process becomes:
[...]

Actually, I didn't realize what you were doing. You were deleting *all*
copies of duplicates, where I assumed that the task was to delete the
duplicate copies but leave one copy in. I guess the OP needs to decide
which of us has it right, and implement the appropriate task either way.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Marco Schmidt

Stern:
I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

Sort the Vector with Collections.sort, then iterate over the entries
and remove duplicates. It's easy with the sorted Vector because
duplicates will obviously be adjacent. Something like this (v is your
Vector):

Iterator iter = v.iterator();
Object prev = null;
while (iter.hasNext()) {
Object obj = iter.next();
if (prev != null && obj.equals(prev)) {
iter.remove();
} else {
prev = obj;
}
}

Regards,
Marco
 
F

Filip Larsen

Stern wrote
I have a Vector of objects, each of them having a name. Now I need to
find in this Vector objects with the same names, get the list of them
and remove them from the vector. I don't know which names these are.
Is there any quick solution for the problem?

If we assume you want to remove duplicated entries only (that is, if an
object with same name appears twice, only the second object is removed), and
that the equals and hashCode method of the objects are defined mapped to the
name of the object, then you should be able to use

Vector x = ...
Vector xWithoutDuplicates = new Vector(new LinkedHashSet(x));

If you don't want a copy, but rather have the x vector be replaced you could
do

Vector x = ...
Collection uniques = new LinkedHashSet(x);
x.clear();
x.addAll(uniques);

Both of these should execute reasonably fast (comparable to the other
methods proposed in this thread, although I haven't made an actual test
run), but uses slightly more memory than needed due to use of LinkedHashSet.
If the order of the elements in the vector doesn't matter, you can use
HashSet instead and save some space.


Regards,
 
S

Stern

Actually, I didn't realize what you were doing. You were deleting *all*
copies of duplicates, where I assumed that the task was to delete the
duplicate copies but leave one copy in. I guess the OP needs to decide
which of us has it right, and implement the appropriate task either way.

I really want to leave one copy in, so your approach is OK. I'll try
yours and that one of Filip Larsen. Thanx.
 
P

Phil...

How do you determine duplicate? Is it the name
is a string and a duplicate entry has the same string?
Or is it that the name references of the two entries
are equal?
 
M

Michael Borgwardt

Phil... said:
How do you determine duplicate?

By using the equals() method that every object has. Note that depending
on the Set implementation class, a consistent implementation of the
compareTo() or hashCode() methods may also be necessary.

So if your objects implement these methods correctly (by delegating
them to the name String object), it should work as intended.
 
R

Roedy Green

By using the equals() method that every object has.

Deduping in not quite the same as removing equal elements. Sometimes
your definition for deduping would require a different equal method,
..e.g. Dedup files with equal names, equal names ignoring case, or
dedup files with equal names and timestamp,
or equal sizes ...

The way I handle it in deDupKeepingFirst is to use a variable
comparator, the same one I use to sort the elements before checking
adjacents for equality.

see http://mindprod.com/jgloss/products.html#SORTED
for source.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top