Loading big files into memory

K

kubek

Hi:)
No, I didn't drop this thread, problem is still not solved and I'm
still looking for the easiest and quite optimal solution :(
Could You elaborate this solution with byte[] array and offset array
with a little more details? Because I didn't get the idea...
How should the data be loaded into this byte[] array and what for do I
need the offset array??
Also when loading it into memory, should I read a text file (reading
strings) or convert somehow my file into binary file and then read it
byte by byte??
Thank You all for Your answers and tips. I hope I will find the
solution that is easy to implement ASAP.
Happy Easter, by the way ;)

Kubek
 
C

Chris Uppal

kubek said:
Happy Easter, by the way ;)

And to you too ;-)

No, I didn't drop this thread, problem is still not solved and I'm
still looking for the easiest and quite optimal solution :(

Just a word of warning. There are several possible good reasons for wanting to
do this yourself, rather than using some spelling checking package that you can
find on the Web -- you may want to learn, for instance, or it may be that the
pre-existing packages don't work the way you want them to. But if your reasons
are that you haven't looked to see what people have already made available, or
that you have looked but it seems too much effort to learn how to use them,
then those are Very Bad Reasons Indeed -- this is not going to be very easy no
matter how much we try to keep it simple. You just have too much data to deal
with for simple solutions to work.

Another warning: you are reading (I guess) in comp.lang.java.help, but I'm
writing in comp.lang.java.programmer -- where more experienced programmers hang
out. So if anything I say doesn't make sense, it's because I'm not very good
at simple explanations, so you'll have to ask for clarification.

A third warning: this is going to be a long post...

Could You elaborate this solution with byte[] array and offset array
with a little more details? Because I didn't get the idea...

OK. Have you got the idea that your main problem is that you can't afford to
create 2.6 million String objects /at the same time/ ? If not then we'll have
to take a step back, but assuming you're happy with that, the problem becomes:
how do we represent the same /data/ as would be contained in all those objects,
but in a smaller space ? One way is to pack all the information into a single
array.

Instead of having lots of String objects like

[cat]
[dog]
[squid]
[toad]

which are all separate objects, we could just have one array with the same
characters in it.

[cat_dog_squid_toad_]

(for now just assume that the array contains characters). I'm using an
underscore (_) to mark the end of each word. In reality there are better
choices of marker, such as using a zero (not the character '0', but the
numerical value zero). That solves our problem with having too many String
objects, but it has a different problem. Namely that we don't know what the
words are in it -- which is rather awkward if we are going to use it for
spell-checking!

We (as people) can see what the words are, but the computer can't. So we need
a way to tell it. So we create a second array, this one contains the index of
the start of each word in the big array.

[0 4 8 14]

Now, if we want to know what the third word is, then we find the third entry in
the array of starting positions (which is 8), and find the corresponding data
in the big array (which is an 's') and scan along till we find the _ marking
the end of that word. So we have found the characters:
s q u i d
from which we could create a String if we wanted to.

BTW, you may have noticed that the end-markers are unnecessary here, since we
already know (from our array of starting positions) where the next word starts.
So we could actually represent the data as:

[catdogsquidtoad]
[0 3 6 11]

That would be more compact, but it makes something we'll come to later a little
more awkward, so we won't do that here (though it's still a good idea).

I've been glossing over whether the big array contains chars or bytes. In fact
(in this case) it would be better for it to contain bytes, since that will only
take half the space. How to do that ? Well this is where we need to know how
the data is stored in the file of words. (I did ask before but you didn't
answer). I think it's likely (but not certain) that it's stored with one byte
per character (rather than as some form of Unicode encoding). I hope so,
because that makes things simpler, and that's what I'm going to assume until
you tell me differently.

If the words are stored in that way, then we don't need to do any conversion --
we want to represent each character as 1 byte, and that's how they are in the
file.

On that assumption, you can just open a FileInputStream on it, and use that to
fill in a byte array. Something roughly like:
File file = new File("big-word-list.txt");
InputStream in = new FileInputStream(file);
long size = file.length();
int todo = (int)size;
byte[] bytes = new byte[todo];
int done = 0;
while (todo > 0)
{
int got = in.read(bytes, done, todo);
if (got < 0)
// Eek!
todo -= got;
done += got;
}
file.close();

(With error checking added, sanity checks on the file size and so on).
Normally I would say that measuring the file size and then filling in the array
in separate steps is a bad idea, but in this case (when we are dealing with so
much data), I think it's a worthwhile shortcut to avoid creating /two/ 30 MByte
arrays temporarily.

Once it's read in, the bytes with values the same as the numeric values of
chars '\n' and '\r' are the line terminators. And you can use that to create
your array of word positions.

You could wrap it all up in an object which (to the outside world) looked like
a java.util.List<String> if you want. Even if you don't mess with the
Collections interfaces, it still a good idea to make it into an object. E.g,
at a very minimum:

class WordList
{
WordList(String filename) { .... }
int size() { ... }
String get(int n) { ... }
boolean contains(String word) { ... }
}

To convert a word in the byte array, look into the String constructor:
public String(byte[] bytes,
int offset,
int length,
String charsetName)
which you would use to create a temporary String to be the value returned by
WordList.get(). The charset name depends on how the wordlist is stored.
There's a fair chance that it would be "ISO-8859-2" or perhaps "windows 1250"
(which I /think/ are the code pages (or charsets) used for Polish)

Similarly you would use the String method:
public byte[] getBytes(String charsetName)
to find the bytes corresponding to the argument string in WordList.contains().
With what we've got so far, the only way to check would be to scan each word in
the list, which is obviously not feasible, but it's a good first step while you
get the other stuff sorted out.

The next step is to take that framework and make contains() run fast, but I
think we'd better leave that for later...

-- chris
 
K

kubek

Thank You very much ;)
I have obtained solution You suggested - I keep all words in a byte[]
array and position of each word in separate int[] array...
But now looking for all words, that contain specified word takes
"hours"...
If I want to have all words, let's say, that contain substring "og",
then I should obtain a list like:
- dog;
- frog;
- frogman;
etc...
How would You do this? I have done in as follows:

String test = "og";
int count = 0;
for (int i = 0; i < (counter - 1); i++) {
String s = new String(words, offsets, offsets[i+1] - offsets);
if (s.indexOf(test) != -1) {
count++;
}
}

but as I said, it takes hours...
Thanks
 
C

Chris Uppal

kubek said:
I have obtained solution You suggested - I keep all words in a byte[]
array and position of each word in separate int[] array...
But now looking for all words, that contain specified word takes
"hours"...

That's to be expected. It's a matter of taking one step at a time. You've got
the data /in/ memory now, so the next step is how to look things up quickly.

Unfortunately, here we hit a snag -- you've changed the rules of the game ;-)
If I want to have all words, let's say, that contain substring "og",
then I should obtain a list like:
- dog;
- frog;
- frogman;

Looking for strings containing a substring is a very different problem from
looking to see whether a given collection of strings contains /exactly/ a
specific string. So before getting into that (and I warn you that fast text
searching is a big field, and one that I don't know a lot about), I want to
check that it really is what you want to do. I thought you wanted to use this
data for spell-checking, and I don't immediately see how the above list is
relevant to that.


Anyway, while we're here....
String test = "og";
int count = 0;
for (int i = 0; i < (counter - 1); i++) {
String s = new String(words, offsets, offsets[i+1] - offsets);
if (s.indexOf(test) != -1) {
count++;
}
}


You can probably speed this up quite a lot (minutes instead of hours) by not
creating a new String object each time. Convert the input string into a byte
array and do your own comparison with the bytes in the main array. That is
nowhere near giving you an /efficient/ algorithm, but it's a step towards being
confortable handling the binary representation of your data.

BTW, as one example of how different the substring problem is, you could just
scan the big byte array looking for occurences of the bytes corresponding to
"og", and ignore the index array entirely. Whenever you find an occurence you
scan backwar to the start of the word (looking for the word separator), and
forwards to the end of the word. Still very inefficient, though ;-)

-- chris
 
P

Patricia Shanahan

Chris said:
kubek wrote:

I have obtained solution You suggested - I keep all words in a byte[]
array and position of each word in separate int[] array...
But now looking for all words, that contain specified word takes
"hours"...
....
Looking for strings containing a substring is a very different problem from
looking to see whether a given collection of strings contains /exactly/ a
specific string. So before getting into that (and I warn you that fast text
searching is a big field, and one that I don't know a lot about), I want to
check that it really is what you want to do. I thought you wanted to use this
data for spell-checking, and I don't immediately see how the above list is
relevant to that.
....

I think the OP started out, in the earlier thread, with an assumption
that there is such a thing as a good way, or even the best way, to store
a large file in memory. I'm not sure that assumption has entirely gone away.

The reality is that a data structure is never inherently good. Data
structures, for non-trivial amounts of data, have to be designed taking
into account the mix of operations, and issues such as the relative cost
of time and memory.

Patricia
 
M

Martin Gregorie

Chris said:
kubek said:
I have obtained solution You suggested - I keep all words in a byte[]
array and position of each word in separate int[] array...
But now looking for all words, that contain specified word takes
"hours"...
If you want to look up whole words faster you need to sort the array and
then use a binary chop search. Note that you can sort by leaving the
words where they are and rearranging the contents of your integer
pointer array. If you do many lookups once the array is loaded and
sorted, than a simple replacement sort is quite probably good enough.

As Chris said, forget about Strings and code all your comparisons in
terms of
byte string comparisons.
 
C

Chris Uppal

Martin said:
If you do many lookups once the array is loaded and
sorted, than a simple replacement sort is quite probably good enough.

Better still to pre-sort the external array.

Actually I would use a hashtable layout myself rather than a binary chop. Not
only would I expect it to be faster, but it'd be easier to change the
implementation so that the entire array does not have to be held in memory all
at once (albeit at the cost of ~1 disk seek+read per lookup, which might or
might not be acceptable).

-- chris
 
C

Chris Uppal

Patricia said:
I think the OP started out, in the earlier thread, with an assumption
that there is such a thing as a good way, or even the best way, to store
a large file in memory. I'm not sure that assumption has entirely gone
away.

You may be right, and maybe Kubek's finding that he's bitten off more than he
wants to chew. If so, there's no shame in that. We'll see ;-)

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top