Loading big files into memory

K

kubek

Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much :)
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful :)
Thanks in advance
Chris
 
O

Oliver Wong

kubek said:
Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much :)
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful :)
Thanks in advance

Sort your dictionary file. Instead of loading the entire dictionary
file, do a sort of binary search through the file to see if the word you're
checking against exists in there. Don't actually ever load the whole file in
memory at once. A clever OS should detect your read patterns and cache parts
of the file when possible.

- Oliver
 
K

kubek

Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?
Can You see my point?
Chris
 
M

Monique Y. Mudama

["Followup-To:" header set to comp.lang.java.help.] On 2006-04-05,
kubek penned:
Ok, let's say this is some solution. But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point? Chris

Sure, but that's not the question you originally asked.

For every solution, there will be a situation in which that approach
is suboptimal.
 
E

Eric Sosman

kubek wrote On 04/05/06 17:53,:
Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?

Funny you should mention Scrabble ... Look for

"The world's fastest Scrabble program," A.W. Appel
and G.J. Jacobson, Communications of the ACM Volume
31, Issue 5 (May 1988)

.... which you can find on-line if hard copy isn't easy to
come by. Their program used a data structure they called
a Directed Acyclic Word Graph, or DAWG, that allowed them
to store a large dictionary in a small space and search it
rapidly.

How "small" and how "rapidly?" I don't recall the values
from their article, but I decided to implement their method
for my own amusement and was able to produce a program that
played Scrabble better than I do. Its dictionary held about
twenty-five or thirty thousand words, amounting to perhaps a
quarter megabyte in raw form.

You may well be raising your eyebrows at numbers like
"thirty thousand" and "a quarter meg," and thinking that this
DAWG thing isn't impressive at all. But you haven't heard
the punch line yet: The machine I used had 64kB of memory
(with about 10kB taken by the O/S), an eight-bit CPU operating
at 4MHz, and two floppy drives accommodating 390kB each -- and
that's all. The DAWG was good enough to let me fit my quarter
megabyte into less than one fifth its raw size and still search
it quickly, so I think it qualifies as top DAWG.
 
R

Roedy Green

Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in

There is another way to do spell checking that was popular back in the
DOS days before we had enough RAM to wash our feet in.

You sort your dictionary and your text file (of word/offset pairs)
alphabetically. Then merge them with a sequential pass through both.
You create a file of errors which are just the offset of the word in
the document. Then sort that by offset and merge back with the
original document.

You may need an external sort. See
http://mindprod.com/jgloss/sort.html


Other than the sort, you need only a few K of RAM for the two buffers.
 
C

Chris Uppal

kubek said:
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines).

One simple first step would be to represent your text data as binary. The
chances are that you don't need full Unicode support, but can represent all
your character data with 1 byte per character (using the appropriate code page
(Windows 1250 or ISO 8859-2 perhaps).

Beyond that, you should consider the algorithm you are using. You don't state
what you are currently doing so it's not possible to suggest improvements, but
there are many clever algorithms available. You might want to look at Bentley
and Sedwick's paper "Fast algorithms for sorting and searching strings" (which
you'll find only at http://www.cs.princeton.edu/~rs/strings/) for one example.

-- chris
 
I

Ian Wilson

kubek said:
Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much :)
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful :)

30MB seems quite large for a word-list:

$ ls -l /usr/share/dict/linux.words
-rw-r--r-- 1 root root 409305 Jun 24 2002
/usr/share/dict/linux.words

$ wc -l /usr/share/dict/linux.words
45427 /usr/share/dict/linux.words

So 45427 English words fits into under half a MB.

Does your dictionary include definitions? If so, maybe you can discard
those out as they will not be needed for spell-checking.
 
K

kubek

No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?
Chris
 
O

Oliver Wong

kubek said:
No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?

Didn't you see Eric Sosman's post? What's wrong with the DAWG he
described in his post?

- Oliver
 
A

alexandre_paterson

kubek said:
No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea...

Indeed...

I wrote a custom spellchecker in Java years ago. Using a very basic
Huffman encoding / binary tree I got down to 22 bits per word,
including very fast lookup tables (accesible by binary search). With
such an encoding you could fit your whole dictionary in memory.

Using such an encoding would probably get you below 7 Mb of
memory for your 2.5 million words (I'd guess there are much
more similarities in you 2.5 million words than in the
french/english dictionaries I used, so you'd probably need even
a little less than that).

That was for the efficient data structure (dwarfing by order of
magnitude anything based on String objects). Note that this is not
anywhere near a record... Much more sophisticated
algorithms can represent dictionary words using less than
10 bits per word.

For the spellchecking part I used several techniques (including, but
not limited to: double-metaphones, Levenhstein's edit distance, etc.).

I got a huge french dictionary entirely fitting in around 2 Mb... Way
smaller and faster than anything Jazzy (open source spellchecker for
Java) could ever achieve back in the days (didn't check up Jazzy since
then to see if it was still so bad/ memory hungry/ slow/ etc.).

Can anyone help me with that?

Eric Sosman proposed you a link for a paper describing an
algorithm very efficient memory-wise and performance-wise
for a scrabble-type application.

If you want simply a spellchecker (or a scrabble game based
on a simple spellchecker, which must be doable even if not
as efficient as the nice technique proposed by Eric), a basic
Huffman encoded binary tree + lookup tables will do too.

In any case you'll have to do some coding and some "low-level"
bit manipulation.

For such a purpose you ain't gonna achieve reasonable memory
usage / loading time / etc. using String objects / regular charset
encodings / etc.

So my "help" would be to recommend using basic algorithms and
data structures to solve your problem.

Dictionary encoding / representation is a common problem and
Google will help you find lots of information on the subject.

If you're really into developing a Scrabble game Eric already
gave you a link and I'm sure Google could also help locate
additional information.

Alex


P.S: if you're controlling the systems where your program is
going to be deployed (eg an online Scrabble game where the
logic would be done on the server-side on some Un*x system),
you could also "cheat" by simply wrapping calls to ispell/aspell
for the (fast/memory efficient) spellchecking functionalities of
your app.
 
C

Chris Uppal

Oliver said:
Didn't you see Eric Sosman's post? What's wrong with the DAWG he
described in his post?

Maybe DAWG is too difficult to program[*] and/or the OP wants to see how far he
can get with simpler techniques first. A sensible approach, I'd say, even if
he does end up being forced into DAWGing[**] it in the end.

-- chris

[*] the paper doesn't give an implementation of the hard bit.
[**] or using some other "advanced" algorithm.
 
C

Chris Uppal

kubek said:
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(

What format is the existing dictionary file stored in ?

I.e. when you read it in as Java Strings (I assume that's what you are
currently doing), what character encoding do you specify in your
InputStreamReader ?

-- chris
 
C

Chris Uppal

Eric said:
"The world's fastest Scrabble program," A.W. Appel
and G.J. Jacobson, Communications of the ACM Volume
31, Issue 5 (May 1988)

Thanks for the pointer, it's interesting stuff (even though I care little for
Scrabble ;-)

How "small" and how "rapidly?" I don't recall the values
from their article, but I decided to implement their method
for my own amusement

How did you do the FSA minimisation, the standard textbook algorithm ? The
paper itself just says that "it is easy to find the minimum-size finite state
recognizer quickly", but the ref they cite is to a tree balancing algorithm
which didn't seem (to me) connected.

and was able to produce a program that
played Scrabble better than I do. Its dictionary held about
twenty-five or thirty thousand words, amounting to perhaps a
quarter megabyte in raw form.

Since the OP's dealing with about 2.6 million words, it seems that the FSA
would have a fairly large number of states. If each state is represented as,
say, an object plus a couple of arrays then that's a fair bit of overhead -- I
make it about 32 bytes /overhead/ per state. I think a non-object
representation might be worth considering.

The same would apply to any other "clever" algorithm, it's not DAWG-specific.
Indeed it might hit other data structures a lot harder -- the nice thing about
DAWG is that the number of states is significantly less than the number of
words.

-- chris
 
E

Eric Sosman

Chris Uppal wrote On 04/07/06 05:34,:
Eric Sosman wrote:




Thanks for the pointer, it's interesting stuff (even though I care little for
Scrabble ;-)


How did you do the FSA minimisation, the standard textbook algorithm ? The
paper itself just says that "it is easy to find the minimum-size finite state
recognizer quickly", but the ref they cite is to a tree balancing algorithm
which didn't seem (to me) connected.

That became an interesting project in itss own right,
what with only ~40KB of RAM available (after subtracting
space for CP/M, the program code, and so on) and less than
a megabyte of external storage (some of which held the
original raw word list). I didn't have access to the
cited reference, so I just dreamed up my own method (I don't
know what "the standard textbook algorithm" is, but this
one seems fairly obvious):

1. I sorted all the words, using an external sort utility
I'd written for an earlier project (reverse-engineering
"Adventure," IIRC).

2. Taking each word in turn, in alphabetical order, I
added its letters to one branch of an ordinary trie. Because
the words were already alphabetized, I only needed to keep
the "front" of the trie in memory -- for example, once I'd
encountered "academy" I knew I'd have no further need for
the subtrie at "academi...".

3. When I found that a subtrie wouldn't be added to any more,
I checked a hash table to see whether an identical subtrie had
already been generated. If so, I discarded the new one in
favor of the old, and adjusted the link in its parent. If the
subtrie was unique, I wrote it to the DAWG on disk and added
it to the hash table. (Managing the hash table was a challenge
in the limited memory space. I no longer recall the details --
this was almost twenty years ago -- but I think the in-memory
table stored only the pointers to the on-disk subtries, so I
wound up going to disk whenever the hash table indicated a
potential duplication. With a little more memory I might have
stored a digest value to save a lot of I/O.)
Since the OP's dealing with about 2.6 million words, it seems that the FSA
would have a fairly large number of states. If each state is represented as,
say, an object plus a couple of arrays then that's a fair bit of overhead -- I
make it about 32 bytes /overhead/ per state. I think a non-object
representation might be worth considering.

Agreed: The whole business belongs in a big byte[] array.
One nice feature of the DAWG is that its compression improves as
it sees more and more words, because it's more and more likely
to find repeated suffixes -- which amount to existing subtries
it can simply link to rather than replicate. Having processed
"ammoniac" it can reuse "-iac" for "cardiac" and "demoniac" and
"elegiac" and ... In fact, when it gets to "megalomaniac" it
finds that it can reuse the entire word "maniac" as a suffix.
The same would apply to any other "clever" algorithm, it's not DAWG-specific.
Indeed it might hit other data structures a lot harder -- the nice thing about
DAWG is that the number of states is significantly less than the number of
words.

The O.P. is interested in Polish, which he describes as a
highly inflected language. I know only the few words of Polish
that have slipped into English plus a few proper names, so this
is speculation, but if Polish is inflected in ways that are
similar to Latin, Italian, or German, there will be a *lot* of
reusable suffixes and sets of suffixes. With so much suffix-
sharing I'd expect the DAWG to work even better than it does
for English.
 
C

Chris Uppal

[I meant to come back to this earlier, but I wanted to do some experiments
first, and that took longer than I anticipated...]

Eric Sosman wrote:

[me:]
How did you do the FSA minimisation, the standard textbook algorithm ?
[...]

That became an interesting project in itss own right,
what with only ~40KB of RAM available (after subtracting
space for CP/M, the program code, and so on) and less than
a megabyte of external storage (some of which held the
original raw word list). I didn't have access to the
cited reference, so I just dreamed up my own method (I don't
know what "the standard textbook algorithm" is, but this
one seems fairly obvious):

Hmm, I'm not sure that "fairly obvious" is doing it justice. Pretty clever,
I'd say ;-)

BTW, by "the standard textbook algorithm" I meant the DFA minimisation
algorithm given in books about DFAs NFAs, etc. E.g. the Dragon Book (I'm
assuming you know all about that). It has the disadvantage (compared to your
method) that it requires the entire DFA to be defined before minimising -- it
doesn't seem to have a natural incremental variant.

Since the OP's dealing with about 2.6 million words, it seems that the
FSA would have a fairly large number of states. If each state is
represented as, say, an object plus a couple of arrays then that's a
fair bit of overhead -- I make it about 32 bytes /overhead/ per state.
I think a non-object representation might be worth considering.

Agreed: The whole business belongs in a big byte[] array.

I got interested enough to add the DFA minimisation algorithm to a little
DFA/NFA package that I had hanging around, and tried using it naively. That
package works with "normal" objects -- each state is a full object -- so that
might have been a tad ambitious, but it's always interesting to explore the
limits ;-)

I tried pushing a list of about 100K words (mean of 8.6 ASCII character per
word) through it. My process started thrashing when it had grown to >
400MBytes, and slowed to a stop :-( So that won't work. I hadn't even started
the minimisation algorithm, just building the initial DFA/trie took too much
space. So revisited my DFA package and changed some features that "don't
scale" (such as using a 256-way table in every state object). Some other
tweaks later and could build the initial 244055-state[*] DFA/trie. Ran the
minimisation algorithm. That ran out of memory too. Re-implemented that to
use a lot less space. Tied again. It ran! Reduces the state count to 42213.
Only problem was -- it only /just/ ran. The process still consumed (at peak)
several hundred MBytes more than its final state.

So, the only way to make it work in reasonable space (assuming a natural
object-based representation of states) is to do it incrementally. One simple
way is to build the DFA/trie and minimise it whenever, say, 20K more states
have been added since the last minimise. Of course that slows it down a bit...

The moral is: "don't represent DAWG states as objects"...

([*] I'm not sure why my lexicon of 100K words produces a significantly bigger
initial trie than the 117150 that the paper derived from a similar-sized
list -- the wordlist does include plurals and the like (dog, dogs, doggie...).
Presumably the larger initial size also explains why my final DAWG is also
about twice as big as theirs.)

Curiously, the "dot" program (part of Graphvis) was unable to create a
PostScript diagram of my 42213-state DFA. I can't think why...

-- chris
 
C

Chris Uppal

kubek said:
I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines).

Kubek seems to have dropped out of this thread, but I still think it's
interesting to revisit these numbers and do a few sums.

We have about 2.6 million words. Assuming that Kubek's word list is one word
per line, with 8-bits per character, and uses Windows-style line-endings (two
bytes), that means that the average word is about 11.6 characters long. How
much space does the same data take up when represented as Stings in a Java
program ?

Assuming a recent 32-bit Sun JVM. Each String consists of an 8-byte header,
plus 4 4-byte instance fields. One of those fields is a reference to a char[]
array where the character data is held. Char[] arrays have a 12-byte header,
plus two bytes for each character (and are probably rounded up to a multiple of
4 or 8 bytes in internal memory representation, but we'll ignore that). So
each String takes up[*]:
8 + (4*4) + 12 + (2*11.6)
bytes. That's 59.2 bytes on average. There are 2.6 million words, so the 2.6
million Strings will total around 154 million bytes.

([*] Assuming that the Strings don't all share one big char[] array between
them, if they do then the total drips to about 123 MBytes.)

That's compared with just reading the data into a single 30M byte[] array. Add
a table of offsets (2.6 million entries in an int[] array) giving the start of
each word (and the end of the last), and the total goes up to about 40 MBytes.
Still manageable, even without using compression. And fast lookup can be added
without using extra memory (by arranging the words in hashtable order or by
sorting it and using a binary chop).

Objects get expensive if you have lots of 'em...

-- chris
 
E

Eric Sosman

Chris Uppal wrote On 04/11/06 04:54,:
[I meant to come back to this earlier, but I wanted to do some experiments
first, and that took longer than I anticipated...]

Eric Sosman wrote:

[me:]
How did you do the FSA minimisation, the standard textbook algorithm ?
[...]

That became an interesting project in itss own right,
what with only ~40KB of RAM available (after subtracting
space for CP/M, the program code, and so on) and less than
a megabyte of external storage (some of which held the
original raw word list). I didn't have access to the
cited reference, so I just dreamed up my own method (I don't
know what "the standard textbook algorithm" is, but this
one seems fairly obvious):


Hmm, I'm not sure that "fairly obvious" is doing it justice. Pretty clever,
I'd say ;-)

BTW, by "the standard textbook algorithm" I meant the DFA minimisation
algorithm given in books about DFAs NFAs, etc. E.g. the Dragon Book (I'm
assuming you know all about that). It has the disadvantage (compared to your
method) that it requires the entire DFA to be defined before minimising -- it
doesn't seem to have a natural incremental variant.

The small size of my system ruled out approaches that
operated on the graph in its entirety; it simply wasn't an
available option.

Doing things incrementally, though effective on my small
machine, probably didn't produce an optimal DAWG because it
couldn't do "global" rearrangements. When you get right
down to it, all my method really did was coalesce repeated
sets of suffixes. As a practical matter, though, even that
fairly limited compression was enough to get the word list
into a small amount of memory -- and to offer me the lovely
prospect of losing Scrabble games to a stupid computer ...

Thanks for the report on your experiments. They show,
I think, that the price one pays for the benefits of O-O can
sometimes be too steep -- not always, but sometimes. A good
engineer must straddle an ever-moving fence.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top