Finding duplicates in a file (Newbie question)

Benz · Feb 1, 2005

Hi!

Is there a smart way of finding duplicates in a large file.

This is how the file will look:
Col1 Col2 Col3 Col4
1/02/2005 20:06:10.870^F^l0091nd^F^5591^F^793423423^R
1/02/2005 21:06:15.533^F^l0091f3^F^5591^F^793423324^R
1/02/2005 22:12:14.653^F^l0031d6^F^5591^F^793423324^R

The ^F^ is the file seperator. The file could have upto 140,000 lines
and I need to find if there are duplicates in Col4.

Iam tryin to do this the convetional way, and this is how far I got..
- by reading the file using BufferedReader
- tokenizing the line and going to col 4
- take the value in col4 .
Would appreciate if there are pointers...

- TIA Ben

digidigo · Feb 2, 2005

You could then iterate over each value and put it into a TreeSet. And
before inserting the nxt value test to see if the TreeSet already
contains the value:

TreeSet set= new TreeSet()

foreach ( foo in COL4 ) {
if ( set.contains(foo)
print("Duplicate found" + foo);
else
set.put(foo)
}

Something like that.

Gerbrand · Feb 2, 2005

digidigo schreef:

You could then iterate over each value and put it into a TreeSet. And
before inserting the nxt value test to see if the TreeSet already
contains the value:

TreeSet set= new TreeSet()

foreach ( foo in COL4 ) {
if ( set.contains(foo)
print("Duplicate found" + foo);
else
set.put(foo)
}

foreach doesn't exist, but I think it's pretty clear for the OP.
There's Java 1.5 syntax with collons, unfortunately I forgot the exact
syntaxt (somehing like Object o: COL4)

Instead of if (set.contains(foo)
print dup found
you can also use
if (set.put(foo))
System.out.println(..)
It's slightly shorter and faster, since put would do a check as well.

Also for the Treeset, equals and hashcode() have to be defined. If you
use Strings that's already the case, otherwise you have to implement them.

Mark Murphy · Feb 3, 2005

If your not limited to Java, Perl has a sweet little trick for doing it.
You may be able to re-code something like this in Java.

In Perl you have Hashes which are similar to what I understand a Java
map is. Basicly a key value pair. What you can do in perl is make the
value you want to count or find repeats and use them as keys. Then
increment the value of the keys each time you run into it. This gives
you a list of unique strings and how many times they occur.

I whipped something like this up to find word frequencies once. This was
before expanding my mind to Java. People often complain one way or
another, but I think Java and Perl compliment each other well, you just
have to be flexible enough to accept they do things differently. (What
good would choices be if they all do the same thing.)

If your not limited to a Java solution then let me know I can send you
the perl. If you are going to do it in java I would be interested in
seeing how you do it.

Mark M

How does this storage of a string in a multi-dimensional int array work	6	Mar 8, 2011
finding a tag in a binary file	5	Feb 23, 2011
Revised Question on File Processing	2	Jan 27, 2013
JDBC: Checking if a unique column already exists during insertion	1	Oct 5, 2006
finding a tag in a binary file	11	Feb 27, 2011
Defining a table in CSS	14	Jul 28, 2009
Finding all the links in a Unix file/directory path	3	May 12, 2009
Highlighting / Selecting columns in an HTML table	0	Oct 10, 2006

Finding duplicates in a file (Newbie question)

Benz

digidigo

Gerbrand

Mark Murphy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads