kubek said:
Happy Easter, by the way
And to you too ;-)
No, I didn't drop this thread, problem is still not solved and I'm
still looking for the easiest and quite optimal solution
Just a word of warning. There are several possible good reasons for wanting to
do this yourself, rather than using some spelling checking package that you can
find on the Web -- you may want to learn, for instance, or it may be that the
pre-existing packages don't work the way you want them to. But if your reasons
are that you haven't looked to see what people have already made available, or
that you have looked but it seems too much effort to learn how to use them,
then those are Very Bad Reasons Indeed -- this is not going to be very easy no
matter how much we try to keep it simple. You just have too much data to deal
with for simple solutions to work.
Another warning: you are reading (I guess) in comp.lang.java.help, but I'm
writing in comp.lang.java.programmer -- where more experienced programmers hang
out. So if anything I say doesn't make sense, it's because I'm not very good
at simple explanations, so you'll have to ask for clarification.
A third warning: this is going to be a long post...
Could You elaborate this solution with byte[] array and offset array
with a little more details? Because I didn't get the idea...
OK. Have you got the idea that your main problem is that you can't afford to
create 2.6 million String objects /at the same time/ ? If not then we'll have
to take a step back, but assuming you're happy with that, the problem becomes:
how do we represent the same /data/ as would be contained in all those objects,
but in a smaller space ? One way is to pack all the information into a single
array.
Instead of having lots of String objects like
[cat]
[dog]
[squid]
[toad]
which are all separate objects, we could just have one array with the same
characters in it.
[cat_dog_squid_toad_]
(for now just assume that the array contains characters). I'm using an
underscore (_) to mark the end of each word. In reality there are better
choices of marker, such as using a zero (not the character '0', but the
numerical value zero). That solves our problem with having too many String
objects, but it has a different problem. Namely that we don't know what the
words are in it -- which is rather awkward if we are going to use it for
spell-checking!
We (as people) can see what the words are, but the computer can't. So we need
a way to tell it. So we create a second array, this one contains the index of
the start of each word in the big array.
[0 4 8 14]
Now, if we want to know what the third word is, then we find the third entry in
the array of starting positions (which is 8), and find the corresponding data
in the big array (which is an 's') and scan along till we find the _ marking
the end of that word. So we have found the characters:
s q u i d
from which we could create a String if we wanted to.
BTW, you may have noticed that the end-markers are unnecessary here, since we
already know (from our array of starting positions) where the next word starts.
So we could actually represent the data as:
[catdogsquidtoad]
[0 3 6 11]
That would be more compact, but it makes something we'll come to later a little
more awkward, so we won't do that here (though it's still a good idea).
I've been glossing over whether the big array contains chars or bytes. In fact
(in this case) it would be better for it to contain bytes, since that will only
take half the space. How to do that ? Well this is where we need to know how
the data is stored in the file of words. (I did ask before but you didn't
answer). I think it's likely (but not certain) that it's stored with one byte
per character (rather than as some form of Unicode encoding). I hope so,
because that makes things simpler, and that's what I'm going to assume until
you tell me differently.
If the words are stored in that way, then we don't need to do any conversion --
we want to represent each character as 1 byte, and that's how they are in the
file.
On that assumption, you can just open a FileInputStream on it, and use that to
fill in a byte array. Something roughly like:
File file = new File("big-word-list.txt");
InputStream in = new FileInputStream(file);
long size = file.length();
int todo = (int)size;
byte[] bytes = new byte[todo];
int done = 0;
while (todo > 0)
{
int got = in.read(bytes, done, todo);
if (got < 0)
// Eek!
todo -= got;
done += got;
}
file.close();
(With error checking added, sanity checks on the file size and so on).
Normally I would say that measuring the file size and then filling in the array
in separate steps is a bad idea, but in this case (when we are dealing with so
much data), I think it's a worthwhile shortcut to avoid creating /two/ 30 MByte
arrays temporarily.
Once it's read in, the bytes with values the same as the numeric values of
chars '\n' and '\r' are the line terminators. And you can use that to create
your array of word positions.
You could wrap it all up in an object which (to the outside world) looked like
a java.util.List<String> if you want. Even if you don't mess with the
Collections interfaces, it still a good idea to make it into an object. E.g,
at a very minimum:
class WordList
{
WordList(String filename) { .... }
int size() { ... }
String get(int n) { ... }
boolean contains(String word) { ... }
}
To convert a word in the byte array, look into the String constructor:
public String(byte[] bytes,
int offset,
int length,
String charsetName)
which you would use to create a temporary String to be the value returned by
WordList.get(). The charset name depends on how the wordlist is stored.
There's a fair chance that it would be "ISO-8859-2" or perhaps "windows 1250"
(which I /think/ are the code pages (or charsets) used for Polish)
Similarly you would use the String method:
public byte[] getBytes(String charsetName)
to find the bytes corresponding to the argument string in WordList.contains().
With what we've got so far, the only way to check would be to scan each word in
the list, which is obviously not feasible, but it's a good first step while you
get the other stuff sorted out.
The next step is to take that framework and make contains() run fast, but I
think we'd better leave that for later...
-- chris