Finding duplicates in a file (Newbie question)

Discussion in 'Java' started by Benz, Feb 1, 2005.

  1. Benz

    Benz Guest

    Hi!

    Is there a smart way of finding duplicates in a large file.

    This is how the file will look:
    Col1 Col2 Col3 Col4
    1/02/2005 20:06:10.870^F^l0091nd^F^5591^F^793423423^R
    1/02/2005 21:06:15.533^F^l0091f3^F^5591^F^793423324^R
    1/02/2005 22:12:14.653^F^l0031d6^F^5591^F^793423324^R

    The ^F^ is the file seperator. The file could have upto 140,000 lines
    and I need to find if there are duplicates in Col4.

    Iam tryin to do this the convetional way, and this is how far I got..
    - by reading the file using BufferedReader
    - tokenizing the line and going to col 4
    - take the value in col4 .
    Would appreciate if there are pointers...

    - TIA Ben
     
    Benz, Feb 1, 2005
    #1
    1. Advertising

  2. Benz

    digidigo Guest

    You could then iterate over each value and put it into a TreeSet. And
    before inserting the nxt value test to see if the TreeSet already
    contains the value:

    TreeSet set= new TreeSet()

    foreach ( foo in COL4 ) {
    if ( set.contains(foo)
    print("Duplicate found" + foo);
    else
    set.put(foo)
    }

    Something like that.
     
    digidigo, Feb 2, 2005
    #2
    1. Advertising

  3. Benz

    Gerbrand Guest

    digidigo schreef:
    > You could then iterate over each value and put it into a TreeSet. And
    > before inserting the nxt value test to see if the TreeSet already
    > contains the value:
    >
    > TreeSet set= new TreeSet()
    >
    > foreach ( foo in COL4 ) {
    > if ( set.contains(foo)
    > print("Duplicate found" + foo);
    > else
    > set.put(foo)
    > }
    >


    foreach doesn't exist, but I think it's pretty clear for the OP.
    There's Java 1.5 syntax with collons, unfortunately I forgot the exact
    syntaxt (somehing like Object o: COL4)

    Instead of if (set.contains(foo)
    print dup found
    you can also use
    if (set.put(foo))
    System.out.println(..)
    It's slightly shorter and faster, since put would do a check as well.

    Also for the Treeset, equals and hashcode() have to be defined. If you
    use Strings that's already the case, otherwise you have to implement them.
     
    Gerbrand, Feb 2, 2005
    #3
  4. Benz

    Mark Murphy Guest

    If your not limited to Java, Perl has a sweet little trick for doing it.
    You may be able to re-code something like this in Java.

    In Perl you have Hashes which are similar to what I understand a Java
    map is. Basicly a key value pair. What you can do in perl is make the
    value you want to count or find repeats and use them as keys. Then
    increment the value of the keys each time you run into it. This gives
    you a list of unique strings and how many times they occur.

    I whipped something like this up to find word frequencies once. This was
    before expanding my mind to Java. People often complain one way or
    another, but I think Java and Perl compliment each other well, you just
    have to be flexible enough to accept they do things differently. (What
    good would choices be if they all do the same thing.)

    If your not limited to a Java solution then let me know I can send you
    the perl. If you are going to do it in java I would be interested in
    seeing how you do it.

    Mark M
     
    Mark Murphy, Feb 3, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. William F. Robertson, Jr.

    Re: Removing duplicates from a DropdownList

    William F. Robertson, Jr., Aug 4, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    589
    brian richards
    Aug 4, 2003
  2. Timo Nentwig

    xpath finding duplicates

    Timo Nentwig, Dec 25, 2004, in forum: XML
    Replies:
    0
    Views:
    437
    Timo Nentwig
    Dec 25, 2004
  3. VP
    Replies:
    2
    Views:
    438
  4. basi
    Replies:
    4
    Views:
    176
    Wayne Vucenic
    Aug 1, 2005
  5. Replies:
    8
    Views:
    124
    Marcin Mielżyński
    Oct 11, 2008
Loading...

Share This Page