comparing files without using arrays

S

saiji.raman

hi all,

got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

we tried it using arrays..

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

can someone help us out with feasible methods for solving this
problem..

Thanks..
 
D

denJs

You can use two arrays of any size as buffers for both files,
fill the buffers and compare the content in a loop.
 
G

Gordon Beaton

we get array out of bounds exception even after increasing the heap
space(-Xmx512m) in run time..

If you're getting an array bounds exception then your problems aren't
due to lack of memory, they're caused by attempting to access beyond
the end of the array (regardless of its size).

An array of size N has indices 0 to N-1. Attempting to use an index
outside this range will cause an array bounds error.

/gordon
 
G

Gordon Beaton

got a problem with comparing two large files(A file with 8000 words
and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

I've addressed your array bounds exception in another post, but
thought I'd suggest a couple of things anyway...

Only the word list needs to be kept in memory, for example using a
Hashmap or other structure that lets you easily map words to counters.

The other file can be read and processed one line at a time. You never
need to hold more than one line's worth of information. From your
problem description, I can't see a need to read the entire file into
any structure if that's what you're doing now.

Read one line, tokenize it into words, then look up each word to
update the corresponding counter. Repeat with the next line, etc.

/gordon
 
C

Chris Dollin

got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

we tried it using arrays..

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

(a) The array-indexing-exception should tell you /where/ it happened.
Look at that piece of code.

(b) The exception happens because you're trying to us an index i that
isn't 0 <= i < array.length. (Note: /less-than/ array.length.)
Make sure the index makes sense.

(c) This has nothing to do with how much heap space you've allocated.
If there wasn't enough room for the arrays, you would have got an
OutOfMemory error.

(d) Your summary of what you want to do suggests /strongly/ that arrays
are not the data structure you want, since you need to associate
with each word from the word-file another value viz the count of
times it occurs in the other-file. This to me suggests a different
data type.
 
E

Eric Sosman

hi all,

got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

we tried it using arrays..

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

You are chasing the wrong fox. ArrayIndexOutOfBoundsException
means that somewhere in your program there is an array of N things
(ints, Strings, whatever) and that you have tried to use it with
can someone help us out with feasible methods for solving this
problem..

Read the 8000-word file and enter each of its words as
the key in a Map<String,Integer> with the Integer value zero.
Then read the other file, word by word. For each word, do
a get(word) on the Map. If get(word) returns null, the word
is not one of the 8000 you care about, so ignore it. On the
other hand if get(word) returns an Integer, add one to that
value and update the Map: put(word, biggerInteger). At the
end, traverse the entire Map to retrieve the results.
 
P

Patricia Shanahan

Eric Sosman wrote:
....
Read the 8000-word file and enter each of its words as
the key in a Map<String,Integer> with the Integer value zero.
Then read the other file, word by word. For each word, do
a get(word) on the Map. If get(word) returns null, the word
is not one of the 8000 you care about, so ignore it. On the
other hand if get(word) returns an Integer, add one to that
value and update the Map: put(word, biggerInteger). At the
end, traverse the entire Map to retrieve the results.

As an alternative to using Integer, I sometimes use a small Counter
class that wraps an int with increment() and get() operations. get() is
defined to return the number of times the Counter's increment() method
has been called.

Patricia
 
M

Mark Rafn

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

ArrayIndexOutOfBoundsException won't be fixed by increasing heap space. It's
a bug in your code.
 
A

Alex Hunsley

denJs said:
You can use two arrays of any size as buffers for both files,
fill the buffers and compare the content in a loop.

It's not a direct file comparison the OP is after ....
 
A

Alex Hunsley

hi all,

got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

we tried it using arrays..

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

can someone help us out with feasible methods for solving this
problem..

As the others have said, you have a bug in your program - you're trying
to access beyond the end of the array, at a guess.

As for a very basic strategy for your problem, read the 'word' file into
memory. Put the words into a Set (e.g. HashSet)[1]. Then parse the 8meg
file - read 10k chunks at a time, for example (more efficient this way),
splitting each chunk into words, and check if each word in a chunk is in
the HashSet. If so, increase its count. (So you may want a HashMap, not
a HashSet, because then it would be easy to map from each word to its
frequency count.)

[1] Using HashSet this way gives you much better efficiency than the
naive way of storing the words in a simple array, and comparing every
word in the array each time!

lex
 
S

saiji.raman

thnx for ur suggestions... even i had the plan of opting for hashset
before.. Actually after finding the freqency array i got to apply
SVD(singular value decomposition), which needs a 2D array as an
input.

and what is that mapset all about??can i convert that into a 2d array
later by any means???

for now i hav used the array list concept..and converted that into 2d
array and hav applied svd(but i cant do this for large files)...
 
B

Boaz.Jan

sorry but i would like to know more about what you are trying to do,
you mentioned a compersion and a word freq. count
by compersion do you mean a lexigraphic compersion (char by char)? or
just compering the freq. of occurence of each word?

for checking the freq. of each word a HashTable(or any other
datastructure which provide you with the ability to set your own index
type an value with a corresponding value object) would be the best for
you
why? cause you can set the Index of each "Cell" as the Word itself
(and by that earning efficency)while the value is an Integer which
represents the number of occurnces (i recommend making a class that
contains an int memeber and have a method increment() instead of
creating an Integer instance for each incremetation. as said here
before).

inserting - put(Object key, Object value)
Hashtable numbers = new Hashtable();
numbers.put("one", new Integer(1));
numbers.put("two", new Integer(2));
numbers.put("three", new Integer(3));

retriving - get(Objectt key)

Integer n = (Integer)numbers.get("two");

for a help with the compersion itself i would require more info about
what you are trying to do and what are your limitations (unless all
the compersion is of the freq. of the words in both your files)

here is a ref to hashtable class doc
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Hashtable.html
 
L

Lew

retriving - get(Objectt key)

Integer n = (Integer)numbers.get("two");

Good exposition. Suggestion: use generics.

Map<String, Integer> freqs = new HashMap<String, Integer>();
// or use Map<String, Wrapper> as Boaz.Jan suggested
...
int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");

With the Wrapper you can use clever idioms like

Wrapper w;
if ( (w = freqs.get( word )) == null )
{
freqs.put( word, new Wrapper() );
}
else
{
w.increment();
}

- Lew
 
B

Boaz.Jan

Good exposition. Suggestion: use generics.

Map<String, Integer> freqs = new HashMap<String, Integer>();
// or use Map<String, Wrapper> as Boaz.Jan suggested
...
int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");

With the Wrapper you can use clever idioms like

Wrapper w;
if ( (w = freqs.get( word )) == null )
{
freqs.put( word, new Wrapper() );
}
else
{
w.increment();
}

- Lew

didnt want to enter the concept of generics on a post clearly abit
less advanced
so copied the non generic example from the jse api docs about
hashtable
and sorry about the Hashtable - old habbit :) (the Dictionary class is
now obsolete and therefor so is hashtable... sorry again :)http://
java.sun.com/j2se/1.5.0/docs/api/java/util/Dictionary.html)
 
L

Lew

didnt want to enter the concept of generics on a post clearly abit
less advanced
so copied the non generic example from the jse api docs about

One could argue that generics are just as easy and less error-prone compared
to the type casts needed for raw types.

- Lew
 
C

Chris Uppal

Lew said:
One could argue that generics are just as easy and less error-prone
compared to the type casts needed for raw types.

It seems reasonable to assume that someone (the OP) who doesn't yet know about
HashMap and its friends will not yet know about Java generics either.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top