Patricia trie vs binary search.

G

Gene Wirchenko

On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts

[snip]
Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.
A modern desktop has more than enough memory to easily handle a quarter
*billion* words, which is a 100 times greater than your guessed upper limit.

And that's *without* compression.

Sure. If that is all that it does. My main (and older) desktop
box has 1.5 GB. I have trouble with not enough memory at times.
Adding another app might break its back.

Sincerely,

Gene Wirchenko
 
D

Daniel Pitts

On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts

[snip]
Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.
A modern desktop has more than enough memory to easily handle a quarter
*billion* words, which is a 100 times greater than your guessed upper limit.

And that's *without* compression.

Sure. If that is all that it does. My main (and older) desktop
box has 1.5 GB. I have trouble with not enough memory at times.
Adding another app might break its back.

Sincerely,

Gene Wirchenko

Moving the goalpost again are we?
 
J

Jeff Higgins

On 5/29/12 10:37 AM, Jeff Higgins wrote:
On 05/29/2012 12:16 PM, Daniel Pitts wrote:

Most of these entries aren't even in Thunderbirds spell-check
dictionary.

How many seven letter words can I construct from the twenty-six letters
A through Z? How many of these words are defined English words? What
are/are not the common affixes in this set of English words?

How about you download the Moby word list yourself and do these
analyses?
I was asking you.
It is a very simple grep for the first question.
egrep "\b[a-zA-Z]{7}\b" *.txt
Nope.
I'll be glad to for a consultant fee. $150 per hour or any partial hour.

bad breath
 
J

Jeff Higgins

On 05/29/2012 01:49 PM, Daniel Pitts wrote:
On 5/29/12 10:37 AM, Jeff Higgins wrote:
On 05/29/2012 12:16 PM, Daniel Pitts wrote:

Most of these entries aren't even in Thunderbirds spell-check
dictionary.

How many seven letter words can I construct from the twenty-six
letters
A through Z? How many of these words are defined English words? What
are/are not the common affixes in this set of English words?

How about you download the Moby word list yourself and do these
analyses?
I was asking you.
It is a very simple grep for the first question.
egrep "\b[a-zA-Z]{7}\b" *.txt
Nope.
I'll be glad to for a consultant fee. $150 per hour or any partial hour.

bad breath
Note to self: Apply sentence commpression to posts from cljp.
 
L

Lew

Gene said:
Daniel Pitts wrote:

[snip]
Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.

With no numbers that differ from Daniel's to back up your claim.

You called my numbers "made up", but it turned out they were
*larger* than the real numbers.

You cite "a quarter of a million" words. Daniel counted roughly
*150%* of that in his word base.

My numbers were generous. Yours are not even significantly different,
and in fact are smaller than his. The numbers do show there is not
much problem, yet somehow you claim with no logic or reasoning or
different data that they do show a problem.

Clearly you are mistaken.

Daniel showed evidence from experimentation. His numbers jibe with
yours. Without compression, his data occupy roughly 5 MiB of memory.

Show the problem or withdraw the claim.
Sure. If that is all that it does. My main (and older) desktop
box has 1.5 GB. I have trouble with not enough memory at times.
Adding another app might break its back.

Again, how much damage will < 5 MiB of data do to that system?

How about 50 MiB? That's *ten times* the number of words you might need to handle.
Without any compression.

You haven't shown a problem. You just chant, "The sky is falling!"
 
G

Gene Wirchenko

On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts

[snip]
Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.
A modern desktop has more than enough memory to easily handle a quarter
*billion* words, which is a 100 times greater than your guessed upper limit.

And that's *without* compression.

Sure. If that is all that it does. My main (and older) desktop
box has 1.5 GB. I have trouble with not enough memory at times.
Adding another app might break its back.
Moving the goalpost again are we?

No. Someone claimed that the amount of memory needed was low and
that modern systems have many times that amount. Sure, they do, but
they also typically run more than one app. Assuming that your app can
use all of memory is not very friendly.

Sincerely,

Gene Wirchenko
 
G

Gene Wirchenko

Gene said:
Daniel Pitts wrote:

[snip]
Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.

With no numbers that differ from Daniel's to back up your claim.

You called my numbers "made up", but it turned out they were
*larger* than the real numbers.

You cite "a quarter of a million" words. Daniel counted roughly
*150%* of that in his word base.

Ah, selective reading.

For root forms, it was 1/4 million. With affixes -- and remember
that my first question was about them -- the figure was 3/4 million.
This is double what Daniel counted, and 3/4 million does not include
technical words, etc. Take a look at the *full* paragraph that I
quoted, not just the lowest number.
My numbers were generous. Yours are not even significantly different,
and in fact are smaller than his. The numbers do show there is not
much problem, yet somehow you claim with no logic or reasoning or
different data that they do show a problem.

Take another look at that paragraph I quoted. Really.
Clearly you are mistaken.

Daniel showed evidence from experimentation. His numbers jibe with
yours. Without compression, his data occupy roughly 5 MiB of memory.

Show the problem or withdraw the claim.

Read my statement of the problem.
Again, how much damage will < 5 MiB of data do to that system?

How about 50 MiB? That's *ten times* the number of words you might need to handle.
Without any compression.

Fine. My system is currently using 1547 MB of memory. It only
has 1536 MB. With some intensive uses, I have seen the commit go to
as high as about 2200 MB. My system crawls then. Using 50 MB more in
those circumstances would not be good.
You haven't shown a problem. You just chant, "The sky is falling!"

I believe that I have. Handwaving the possibility away does not
make it go away.

Sincerely,

Gene Wirchenko
 
D

Daniel Pitts

Gene said:
Daniel Pitts wrote:

[snip]

Are you arguing that a modern system can't handle that number of words?

No. I simply stating that the real size of the problem is much
bigger.

With no numbers that differ from Daniel's to back up your claim.

You called my numbers "made up", but it turned out they were
*larger* than the real numbers.

You cite "a quarter of a million" words. Daniel counted roughly
*150%* of that in his word base.

Ah, selective reading.

For root forms, it was 1/4 million. With affixes -- and remember
that my first question was about them -- the figure was 3/4 million.
This is double what Daniel counted, and 3/4 million does not include
technical words, etc. Take a look at the *full* paragraph that I
quoted, not just the lowest number.
My numbers were generous. Yours are not even significantly different,
and in fact are smaller than his. The numbers do show there is not
much problem, yet somehow you claim with no logic or reasoning or
different data that they do show a problem.

Take another look at that paragraph I quoted. Really.
Clearly you are mistaken.

Daniel showed evidence from experimentation. His numbers jibe with
yours. Without compression, his data occupy roughly 5 MiB of memory.

Show the problem or withdraw the claim.

Read my statement of the problem.
Again, how much damage will< 5 MiB of data do to that system?

How about 50 MiB? That's *ten times* the number of words you might need to handle.
Without any compression.

Fine. My system is currently using 1547 MB of memory. It only
has 1536 MB. With some intensive uses, I have seen the commit go to
as high as about 2200 MB. My system crawls then. Using 50 MB more in
those circumstances would not be good.
You haven't shown a problem. You just chant, "The sky is falling!"

I believe that I have. Handwaving the possibility away does not
make it go away.

Sincerely,

Gene Wirchenko


Test program:

package net.virtualinfinity.moby;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;

public class WordTree {
WordNode start = new WordNode(-1);

public void addWord(String word) {
start.addWord(word);
}

public static void main(String[] args) throws IOException {
WordTree tree = new WordTree();
int wordsLoaded = 0;
for (String filename: args) {
System.out.println("Loading " + filename);
BufferedReader reader = new BufferedReader(new
FileReader(filename), 1024*256);
String line;
while ((line = reader.readLine()) != null) {
++wordsLoaded;
if ((wordsLoaded & 127) == 0) {
printStatus(wordsLoaded);
}
tree.addWord(line.trim());
}
System.gc();
printStatus(wordsLoaded);
System.out.println();
}
}

private static void printStatus(int wordsLoaded) {
System.out.print("\rWords loaded: " + wordsLoaded + ", Total
Memory used: " + Runtime.getRuntime().totalMemory() + ". ");
}
}


class WordNode {
private boolean terminal;
private final int value;
private WordNode[] next;

public WordNode(int ch) {
value = ch;
}

public void addWord(String word) {
if ("".equals(word)) {
terminal = true;
return;
}
final int ch = word.codePointAt(0);
getNext(ch).addWord(word.substring(Character.charCount(ch)));
}

private WordNode getNext(int ch) {
if (next == null) {
next = new WordNode[1];
next[0] = new WordNode(ch);
}
for (WordNode node: next) {
if (node.value == ch) {
return node;
}
}
next = Arrays.copyOf(next, next.length +1);
final WordNode newNode = new WordNode(ch);
next[next.length-1] = newNode;
return newNode;
}
}

------- Output -----

Loading compound-words.txt
Words loaded: 256772, Total Memory used: 120909824.
Loading often-mispelled.txt
Words loaded: 257138, Total Memory used: 121106432.
Loading english-most-frequent.txt
Words loaded: 258141, Total Memory used: 121106432.
Loading male-names.txt
Words loaded: 262038, Total Memory used: 121106432.
Loading female-names.txt
Words loaded: 266984, Total Memory used: 121106432.
Loading common-names.txt
Words loaded: 288970, Total Memory used: 121106432.
Loading common-dictionary.txt
Words loaded: 363520, Total Memory used: 127373312.
Loading official-scrabble-1st-edition.txt
Words loaded: 477329, Total Memory used: 129957888.
Loading official-scrabble-2nd-edition-delta.txt
Words loaded: 481489, Total Memory used: 129957888.
 
D

Daniel Pitts

Gene Wirchenko wrote:
Daniel Pitts wrote:

[snip]

Are you arguing that a modern system can't handle that number of
words?

No. I simply stating that the real size of the problem is much
bigger.

With no numbers that differ from Daniel's to back up your claim.

You called my numbers "made up", but it turned out they were
*larger* than the real numbers.

You cite "a quarter of a million" words. Daniel counted roughly
*150%* of that in his word base.

Ah, selective reading.

For root forms, it was 1/4 million. With affixes -- and remember
that my first question was about them -- the figure was 3/4 million.
This is double what Daniel counted, and 3/4 million does not include
technical words, etc. Take a look at the *full* paragraph that I
quoted, not just the lowest number.
My numbers were generous. Yours are not even significantly different,
and in fact are smaller than his. The numbers do show there is not
much problem, yet somehow you claim with no logic or reasoning or
different data that they do show a problem.

Take another look at that paragraph I quoted. Really.
Clearly you are mistaken.

Daniel showed evidence from experimentation. His numbers jibe with
yours. Without compression, his data occupy roughly 5 MiB of memory.

Show the problem or withdraw the claim.

Read my statement of the problem.
A modern desktop has more than enough memory to easily handle a
quarter
*billion* words, which is a 100 times greater than your guessed
upper limit.

And that's *without* compression.

Sure. If that is all that it does. My main (and older) desktop
box has 1.5 GB. I have trouble with not enough memory at times.
Adding another app might break its back.

Again, how much damage will< 5 MiB of data do to that system?

How about 50 MiB? That's *ten times* the number of words you might
need to handle.
Without any compression.

Fine. My system is currently using 1547 MB of memory. It only
has 1536 MB. With some intensive uses, I have seen the commit go to
as high as about 2200 MB. My system crawls then. Using 50 MB more in
those circumstances would not be good.
You haven't shown a problem. You just chant, "The sky is falling!"

I believe that I have. Handwaving the possibility away does not
make it go away.

Sincerely,

Gene Wirchenko


Test program:

package net.virtualinfinity.moby;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;

public class WordTree {
WordNode start = new WordNode(-1);

public void addWord(String word) {
start.addWord(word);
}

public static void main(String[] args) throws IOException {
WordTree tree = new WordTree();
int wordsLoaded = 0;
for (String filename: args) {
System.out.println("Loading " + filename);
BufferedReader reader = new BufferedReader(new FileReader(filename),
1024*256);
String line;
while ((line = reader.readLine()) != null) {
++wordsLoaded;
if ((wordsLoaded & 127) == 0) {
printStatus(wordsLoaded);
}
tree.addWord(line.trim());
}
System.gc();
printStatus(wordsLoaded);
System.out.println();
}
}

private static void printStatus(int wordsLoaded) {
System.out.print("\rWords loaded: " + wordsLoaded + ", Total Memory
used: " + Runtime.getRuntime().totalMemory() + ". ");
}
}


class WordNode {
private boolean terminal;
private final int value;
private WordNode[] next;

public WordNode(int ch) {
value = ch;
}

public void addWord(String word) {
if ("".equals(word)) {
terminal = true;
return;
}
final int ch = word.codePointAt(0);
getNext(ch).addWord(word.substring(Character.charCount(ch)));
}

private WordNode getNext(int ch) {
if (next == null) {
next = new WordNode[1];
next[0] = new WordNode(ch);
}
for (WordNode node: next) {
if (node.value == ch) {
return node;
}
}
next = Arrays.copyOf(next, next.length +1);
final WordNode newNode = new WordNode(ch);
next[next.length-1] = newNode;
return newNode;
}
}

------- Output -----

Loading compound-words.txt
Words loaded: 256772, Total Memory used: 120909824.
Loading often-mispelled.txt
Words loaded: 257138, Total Memory used: 121106432.
Loading english-most-frequent.txt
Words loaded: 258141, Total Memory used: 121106432.
Loading male-names.txt
Words loaded: 262038, Total Memory used: 121106432.
Loading female-names.txt
Words loaded: 266984, Total Memory used: 121106432.
Loading common-names.txt
Words loaded: 288970, Total Memory used: 121106432.
Loading common-dictionary.txt
Words loaded: 363520, Total Memory used: 127373312.
Loading official-scrabble-1st-edition.txt
Words loaded: 477329, Total Memory used: 129957888.
Loading official-scrabble-2nd-edition-delta.txt
Words loaded: 481489, Total Memory used: 129957888.

BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB

So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.

Extrapolating that data to an extreme 2 million words, that would be
less than 200MB in memory.

My gut feeling beats your gut feeling, and my science proves it true. If
you are going to reply with a counter argument, please provide a
reproducible experiment to prove your argument. Otherwise, this
conversation is over.
 
G

Gene Wirchenko

On Tue, 29 May 2012 15:39:16 -0700, Daniel Pitts

[snip]
BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB

So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.

So about 100 bytes per word.
Extrapolating that data to an extreme 2 million words, that would be
less than 200MB in memory.

Or about 100 MB for my SWAG of one million words.
My gut feeling beats your gut feeling, and my science proves it true. If

It sure does. Lew's figures (up to 10 MB) were considerably
lower, and that is what I was objecting to. They seemed too low.
you are going to reply with a counter argument, please provide a
reproducible experiment to prove your argument. Otherwise, this
conversation is over.

Actually, I started with a question, namely
Including all affixes?
and I have been trying to get a reasonable answer to it. I knew that
your word counts were low compared to others that I have seen and
wanted to know a realistic memory use figure for the *whole* English
language. You have provided a sufficiently good approximation. Thank
you.

Sincerely,

Gene Wirchenko
 
L

Lew

Gene said:
Daniel Pitts wrote:

[snip]
BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB

So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.

We've been talking apples and oranges. You refer here to the memory
footprint if the entire word list were in memory as naive Strings. I've been
talking about the footprint in storage (e.g., the SD card of a smartphone).

So, Gene, you have been misinterpreting my point.

There is little reason to hold the dictionary in memory at all times if you
are in a memory-constrained environment. If one does need to hold it
in memory, and one were worried about memory footprint, one would not
hold it in a memory-intensive, naive structure, in RAM all at once.

Compression exceeding 90% is typical with text. Assuming 75% compression,
a very reasonable and safe estimate, the data would occupy about 10 MB, just
as I said. That's if you keep it in memory in compressed form.

A disk-based solution would likely occupy only a few KB at a time.

Nice try to distort the issue into a scary monster, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top