comparing files without using arrays

Discussion in 'Java' started by saiji.raman@gmail.com, Feb 16, 2007.

  1. Guest

    hi all,

    got a problem with comparing two large files(A file with 8000
    words and another file of size 8 MB with some content in it.. the
    frequency of the 8000 words in the other file must be found)

    we tried it using arrays..

    but we get array out of bounds exception even after increasing the
    heap space(-Xmx512m) in run time..

    can someone help us out with feasible methods for solving this
    problem..

    Thanks..
    , Feb 16, 2007
    #1
    1. Advertising

  2. denJs Guest

    You can use two arrays of any size as buffers for both files,
    fill the buffers and compare the content in a loop.
    denJs, Feb 16, 2007
    #2
    1. Advertising

  3. On 15 Feb 2007 22:45:48 -0800, wrote:
    > we get array out of bounds exception even after increasing the heap
    > space(-Xmx512m) in run time..


    If you're getting an array bounds exception then your problems aren't
    due to lack of memory, they're caused by attempting to access beyond
    the end of the array (regardless of its size).

    An array of size N has indices 0 to N-1. Attempting to use an index
    outside this range will cause an array bounds error.

    /gordon

    --
    [ don't email me support questions or followups ]
    g o r d o n + n e w s @ b a l d e r 1 3 . s e
    Gordon Beaton, Feb 16, 2007
    #3
  4. On 15 Feb 2007 22:45:48 -0800, wrote:
    > got a problem with comparing two large files(A file with 8000 words
    > and another file of size 8 MB with some content in it.. the
    > frequency of the 8000 words in the other file must be found)


    I've addressed your array bounds exception in another post, but
    thought I'd suggest a couple of things anyway...

    Only the word list needs to be kept in memory, for example using a
    Hashmap or other structure that lets you easily map words to counters.

    The other file can be read and processed one line at a time. You never
    need to hold more than one line's worth of information. From your
    problem description, I can't see a need to read the entire file into
    any structure if that's what you're doing now.

    Read one line, tokenize it into words, then look up each word to
    update the corresponding counter. Repeat with the next line, etc.

    /gordon

    --
    [ don't email me support questions or followups ]
    g o r d o n + n e w s @ b a l d e r 1 3 . s e
    Gordon Beaton, Feb 16, 2007
    #4
  5. Chris Dollin Guest

    wrote:

    > got a problem with comparing two large files(A file with 8000
    > words and another file of size 8 MB with some content in it.. the
    > frequency of the 8000 words in the other file must be found)
    >
    > we tried it using arrays..
    >
    > but we get array out of bounds exception even after increasing the
    > heap space(-Xmx512m) in run time..


    (a) The array-indexing-exception should tell you /where/ it happened.
    Look at that piece of code.

    (b) The exception happens because you're trying to us an index i that
    isn't 0 <= i < array.length. (Note: /less-than/ array.length.)
    Make sure the index makes sense.

    (c) This has nothing to do with how much heap space you've allocated.
    If there wasn't enough room for the arrays, you would have got an
    OutOfMemory error.

    (d) Your summary of what you want to do suggests /strongly/ that arrays
    are not the data structure you want, since you need to associate
    with each word from the word-file another value viz the count of
    times it occurs in the other-file. This to me suggests a different
    data type.

    --
    Chris "electric hedgehog" Dollin
    "It's just the beginning we've seen" - Colosseum, /Tomorrow's Blues/
    Chris Dollin, Feb 16, 2007
    #5
  6. Eric Sosman Guest

    wrote:
    > hi all,
    >
    > got a problem with comparing two large files(A file with 8000
    > words and another file of size 8 MB with some content in it.. the
    > frequency of the 8000 words in the other file must be found)
    >
    > we tried it using arrays..
    >
    > but we get array out of bounds exception even after increasing the
    > heap space(-Xmx512m) in run time..


    You are chasing the wrong fox. ArrayIndexOutOfBoundsException
    means that somewhere in your program there is an array of N things
    (ints, Strings, whatever) and that you have tried to use it with
    an index that is <0 or >=N. It does not mean that your program is
    running low on memory; that one produces OutOfMemoryError.

    > can someone help us out with feasible methods for solving this
    > problem..


    Read the 8000-word file and enter each of its words as
    the key in a Map<String,Integer> with the Integer value zero.
    Then read the other file, word by word. For each word, do
    a get(word) on the Map. If get(word) returns null, the word
    is not one of the 8000 you care about, so ignore it. On the
    other hand if get(word) returns an Integer, add one to that
    value and update the Map: put(word, biggerInteger). At the
    end, traverse the entire Map to retrieve the results.

    --
    Eric Sosman
    lid
    Eric Sosman, Feb 16, 2007
    #6
  7. Eric Sosman wrote:
    ....
    > Read the 8000-word file and enter each of its words as
    > the key in a Map<String,Integer> with the Integer value zero.
    > Then read the other file, word by word. For each word, do
    > a get(word) on the Map. If get(word) returns null, the word
    > is not one of the 8000 you care about, so ignore it. On the
    > other hand if get(word) returns an Integer, add one to that
    > value and update the Map: put(word, biggerInteger). At the
    > end, traverse the entire Map to retrieve the results.
    >


    As an alternative to using Integer, I sometimes use a small Counter
    class that wraps an int with increment() and get() operations. get() is
    defined to return the number of times the Counter's increment() method
    has been called.

    Patricia
    Patricia Shanahan, Feb 16, 2007
    #7
  8. Mark Rafn Guest

    <> wrote:
    > but we get array out of bounds exception even after increasing the
    >heap space(-Xmx512m) in run time..


    ArrayIndexOutOfBoundsException won't be fixed by increasing heap space. It's
    a bug in your code.
    --
    Mark Rafn <http://www.dagon.net/>
    Mark Rafn, Feb 16, 2007
    #8
  9. Alex Hunsley Guest

    denJs wrote:
    > You can use two arrays of any size as buffers for both files,
    > fill the buffers and compare the content in a loop.


    It's not a direct file comparison the OP is after ....

    >
    >
    >
    >
    Alex Hunsley, Feb 16, 2007
    #9
  10. Alex Hunsley Guest

    wrote:
    > hi all,
    >
    > got a problem with comparing two large files(A file with 8000
    > words and another file of size 8 MB with some content in it.. the
    > frequency of the 8000 words in the other file must be found)
    >
    > we tried it using arrays..
    >
    > but we get array out of bounds exception even after increasing the
    > heap space(-Xmx512m) in run time..
    >
    > can someone help us out with feasible methods for solving this
    > problem..


    As the others have said, you have a bug in your program - you're trying
    to access beyond the end of the array, at a guess.

    As for a very basic strategy for your problem, read the 'word' file into
    memory. Put the words into a Set (e.g. HashSet)[1]. Then parse the 8meg
    file - read 10k chunks at a time, for example (more efficient this way),
    splitting each chunk into words, and check if each word in a chunk is in
    the HashSet. If so, increase its count. (So you may want a HashMap, not
    a HashSet, because then it would be easy to map from each word to its
    frequency count.)

    [1] Using HashSet this way gives you much better efficiency than the
    naive way of storing the words in a simple array, and comparing every
    word in the array each time!

    lex
    Alex Hunsley, Feb 16, 2007
    #10
  11. Guest

    thnx for ur suggestions... even i had the plan of opting for hashset
    before.. Actually after finding the freqency array i got to apply
    SVD(singular value decomposition), which needs a 2D array as an
    input.

    and what is that mapset all about??can i convert that into a 2d array
    later by any means???

    for now i hav used the array list concept..and converted that into 2d
    array and hav applied svd(but i cant do this for large files)...
    , Feb 17, 2007
    #11
  12. Guest

    sorry but i would like to know more about what you are trying to do,
    you mentioned a compersion and a word freq. count
    by compersion do you mean a lexigraphic compersion (char by char)? or
    just compering the freq. of occurence of each word?

    for checking the freq. of each word a HashTable(or any other
    datastructure which provide you with the ability to set your own index
    type an value with a corresponding value object) would be the best for
    you
    why? cause you can set the Index of each "Cell" as the Word itself
    (and by that earning efficency)while the value is an Integer which
    represents the number of occurnces (i recommend making a class that
    contains an int memeber and have a method increment() instead of
    creating an Integer instance for each incremetation. as said here
    before).

    inserting - put(Object key, Object value)
    Hashtable numbers = new Hashtable();
    numbers.put("one", new Integer(1));
    numbers.put("two", new Integer(2));
    numbers.put("three", new Integer(3));

    retriving - get(Objectt key)

    Integer n = (Integer)numbers.get("two");

    for a help with the compersion itself i would require more info about
    what you are trying to do and what are your limitations (unless all
    the compersion is of the freq. of the words in both your files)

    here is a ref to hashtable class doc
    http://java.sun.com/j2se/1.5.0/docs/api/java/util/Hashtable.html
    , Feb 17, 2007
    #12
  13. Lew Guest

    wrote:
    > for checking the freq. of each word a HashTable


    if you are in J2ME, HashMap for JSE.

    - Lew
    Lew, Feb 17, 2007
    #13
  14. Lew Guest

    wrote:
    > retriving - get(Objectt key)
    >
    > Integer n = (Integer)numbers.get("two");


    Good exposition. Suggestion: use generics.

    Map<String, Integer> freqs = new HashMap<String, Integer>();
    // or use Map<String, Wrapper> as Boaz.Jan suggested
    ...
    int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");

    With the Wrapper you can use clever idioms like

    Wrapper w;
    if ( (w = freqs.get( word )) == null )
    {
    freqs.put( word, new Wrapper() );
    }
    else
    {
    w.increment();
    }

    - Lew
    Lew, Feb 17, 2007
    #14
  15. Guest

    On Feb 17, 6:07 pm, Lew <> wrote:
    > wrote:
    > > retriving - get(Objectt key)

    >
    > > Integer n = (Integer)numbers.get("two");

    >
    > Good exposition. Suggestion: use generics.
    >
    > Map<String, Integer> freqs = new HashMap<String, Integer>();
    > // or use Map<String, Wrapper> as Boaz.Jan suggested
    > ...
    > int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");
    >
    > With the Wrapper you can use clever idioms like
    >
    > Wrapper w;
    > if ( (w = freqs.get( word )) == null )
    > {
    > freqs.put( word, new Wrapper() );
    > }
    > else
    > {
    > w.increment();
    > }
    >
    > - Lew


    didnt want to enter the concept of generics on a post clearly abit
    less advanced
    so copied the non generic example from the jse api docs about
    hashtable
    and sorry about the Hashtable - old habbit :) (the Dictionary class is
    now obsolete and therefor so is hashtable... sorry again :)http://
    java.sun.com/j2se/1.5.0/docs/api/java/util/Dictionary.html)
    , Feb 17, 2007
    #15
  16. Lew Guest

    wrote:
    > didnt want to enter the concept of generics on a post clearly abit
    > less advanced
    > so copied the non generic example from the jse api docs about


    One could argue that generics are just as easy and less error-prone compared
    to the type casts needed for raw types.

    - Lew
    Lew, Feb 17, 2007
    #16
  17. Chris Uppal Guest

    Lew wrote:

    > > didnt want to enter the concept of generics on a post clearly abit
    > > less advanced
    > > so copied the non generic example from the jse api docs about

    >
    > One could argue that generics are just as easy and less error-prone
    > compared to the type casts needed for raw types.


    It seems reasonable to assume that someone (the OP) who doesn't yet know about
    HashMap and its friends will not yet know about Java generics either.

    -- chris
    Chris Uppal, Feb 17, 2007
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David

    comparing 2 arrays

    David, Oct 11, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    473
    David
    Oct 11, 2005
  2. kj7ny
    Replies:
    2
    Views:
    299
    Yu-Xi Lim
    Dec 29, 2006
  3. Replies:
    4
    Views:
    463
  4. Philipp
    Replies:
    21
    Views:
    1,123
    Philipp
    Jan 20, 2009
  5. sree
    Replies:
    2
    Views:
    81
Loading...

Share This Page