Finding duplicates in a file (Newbie question)

B

Benz

Hi!

Is there a smart way of finding duplicates in a large file.

This is how the file will look:
Col1 Col2 Col3 Col4
1/02/2005 20:06:10.870^F^l0091nd^F^5591^F^793423423^R
1/02/2005 21:06:15.533^F^l0091f3^F^5591^F^793423324^R
1/02/2005 22:12:14.653^F^l0031d6^F^5591^F^793423324^R

The ^F^ is the file seperator. The file could have upto 140,000 lines
and I need to find if there are duplicates in Col4.

Iam tryin to do this the convetional way, and this is how far I got..
- by reading the file using BufferedReader
- tokenizing the line and going to col 4
- take the value in col4 .
Would appreciate if there are pointers...

- TIA Ben
 
D

digidigo

You could then iterate over each value and put it into a TreeSet. And
before inserting the nxt value test to see if the TreeSet already
contains the value:

TreeSet set= new TreeSet()

foreach ( foo in COL4 ) {
if ( set.contains(foo)
print("Duplicate found" + foo);
else
set.put(foo)
}

Something like that.
 
G

Gerbrand

digidigo schreef:
You could then iterate over each value and put it into a TreeSet. And
before inserting the nxt value test to see if the TreeSet already
contains the value:

TreeSet set= new TreeSet()

foreach ( foo in COL4 ) {
if ( set.contains(foo)
print("Duplicate found" + foo);
else
set.put(foo)
}

foreach doesn't exist, but I think it's pretty clear for the OP.
There's Java 1.5 syntax with collons, unfortunately I forgot the exact
syntaxt (somehing like Object o: COL4)

Instead of if (set.contains(foo)
print dup found
you can also use
if (set.put(foo))
System.out.println(..)
It's slightly shorter and faster, since put would do a check as well.

Also for the Treeset, equals and hashcode() have to be defined. If you
use Strings that's already the case, otherwise you have to implement them.
 
M

Mark Murphy

If your not limited to Java, Perl has a sweet little trick for doing it.
You may be able to re-code something like this in Java.

In Perl you have Hashes which are similar to what I understand a Java
map is. Basicly a key value pair. What you can do in perl is make the
value you want to count or find repeats and use them as keys. Then
increment the value of the keys each time you run into it. This gives
you a list of unique strings and how many times they occur.

I whipped something like this up to find word frequencies once. This was
before expanding my mind to Java. People often complain one way or
another, but I think Java and Perl compliment each other well, you just
have to be flexible enough to accept they do things differently. (What
good would choices be if they all do the same thing.)

If your not limited to a Java solution then let me know I can send you
the perl. If you are going to do it in java I would be interested in
seeing how you do it.

Mark M
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,525
Members
44,997
Latest member
mileyka

Latest Threads

Top