Patricia trie vs binary search.

Discussion in 'Java' started by markspace, May 25, 2012.

  1. markspace

    markspace Guest

    For some reason I was thinking about sub-string searches today. I
    looked up Patricia tries (a kind of radix tree) to see if they would
    help. While interesting, the radix tree seems to have a lot of overhead
    for large numbers of entries.

    The radix tree uses a bucket at each level to hold all children (and
    there could be quite a lot of children). Each child if present requires
    a pointer (an object in Java) to hold it. For the example given, this
    could be as much as one object per character in each string, plus the
    bucket to hold it and its siblings. If the number strings is very
    large, this could really result in an explosion of memory usage.

    <http://en.wikipedia.org/wiki/Radix_tree>

    So is there a better way for large data sets?

    For the example give, it appears to be a kind of incremental search.
    Those, first we find the "r" which is common to all the words in the
    example. Then if we're looking for, say, "romane" we'd find that the
    "om" is common to all words that match. Then we find that "an" matches
    both "romane" and "romanus". Finally we find the "e" which tells us
    "romane" exist in the tree.

    So I implemented a simple "incremental binary search". Given a string,
    the search tells you if there's any elements that begin with the
    parameter string. It also remembers the previous match, so that the
    next search picks up from that location. "ro" could be matched next
    along with "rom". If the search ever returns a non-match condition, you
    know you've hit the maximum extent. There are no further matches,
    regardless how many additional characters you append.

    This seems useful for "short-circuiting" long string matches, allowing
    you to pick out dictionary matches without inuring average n^2 time.

    My question is: does this really buy me anything? Or did I just
    duplicate some other well known algorithms that works just as well or
    better?

    /*
    * Copyright 2012 Brenden Towey. All rights reserved.
    */
    package com.techdarwinia.trie;

    import java.io.FileReader;
    import java.io.Reader;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.Scanner;

    /**
    *
    * @author Brenden Towey
    */
    public class IncrementalStringSearch
    {

    private int low;
    private int high;
    private String[] array;

    public IncrementalStringSearch( String[] array )
    {
    this.array = array;
    high = array.length - 1;
    }

    public boolean find( String findString )
    {
    if( low > high ) return false;
    for( ;; ) {
    int index = ( low + high ) / 2;
    String foundSub = array[index];
    if( array[index].length() > findString.length() )
    foundSub = array[index].substring( 0, findString.length() );
    else
    foundSub = array[index];
    if( foundSub.equals( findString ) )
    return true;
    if( foundSub.compareTo( findString ) > 0 ) {
    high = index - 1;
    if( high < low ) return false;
    } else {
    low = index + 1;
    if( low > high ) return false;
    }
    }
    }

    public static void main( String[] args ) throws Exception
    {
    ArrayList<String> passwords = new ArrayList<>();

    Reader ins = new FileReader( "test/com/techdarwinia/trie/bpw.txt" );
    Scanner scanner = new Scanner( ins );
    scanner.useDelimiter( ",?\\s+" );
    while( scanner.hasNext() ) {
    passwords.add( scanner.next() );
    }
    ins.close();
    Collections.sort( passwords );

    String test = "passxword ";
    for( int i = 0; i < test.length(); i++ ) {
    IncrementalStringSearch inc = new IncrementalStringSearch(
    passwords.toArray( new String[0] ) );
    for( int j = i + 1; j <= test.length(); j++ ) {
    String string = test.substring( i, j );
    if( !inc.find( string ) ){
    if( j > i+1 && passwords.contains( test.substring( i,
    j-1 ) ) ) {
    System.out.println( "Found: "+test.substring( i, j-1 ) );
    }
    break;
    }
    }
    }
    }
    }
     
    markspace, May 25, 2012
    #1
    1. Advertising

  2. markspace <-@.> wrote:
    > For some reason I was thinking about sub-string searches today. I
    > looked up Patricia tries (a kind of radix tree) to see if they would
    > help. While interesting, the radix tree seems to have a lot of overhead
    > for large numbers of entries.


    I am not so sure which substring problem you are interested in,
    but you might look at the Aho-Korasick algorithm.

    If you want to allow for insertions and/or deletions, look
    at dynamic programming.

    -- glen
     
    glen herrmannsfeldt, May 25, 2012
    #2
    1. Advertising

  3. markspace

    markspace Guest

    On 5/24/2012 4:39 PM, glen herrmannsfeldt wrote:
    > markspace<-@.> wrote:
    >> For some reason I was thinking about sub-string searches today. I
    >> looked up Patricia tries (a kind of radix tree) to see if they would
    >> help. While interesting, the radix tree seems to have a lot of overhead
    >> for large numbers of entries.

    >
    > I am not so sure which substring problem you are interested in,
    > but you might look at the Aho-Korasick algorithm.
    >
    > If you want to allow for insertions and/or deletions, look
    > at dynamic programming.



    Thanks for those pointers (pun not intended). The "sub-string search"
    was referring to was picking words out of a longer, undelimited string.

    The example I posted looked for bad passwords within a larger password.

    "passxword" contains at least two bad passwords, pass and word,
    according to the dictionary file I download from here (that's the
    bpw.txt file in the sample code):

    <http://www.godlikeproductions.com/badpasswords.php>

    The code I posted find pass, ass, word, and or in the string
    "passxword", which are all bad passwords according to that link above.
    (I deliberately avoid reporting single letters as bad).

    Practically speaking, hashing appears to be just as fast for a lookup,
    but hashing doesn't tell you when to stop, so it's always N^2 for this
    sort of problem. My algorithm should average less, hopefully much less
    if you have very long input strings.

    I really don't know what set me off looking at this; it's just a little
    random experiment I suppose.
     
    markspace, May 25, 2012
    #3
  4. markspace

    Lew Guest

    Password quality (Was: Patricia trie vs binary search.)

    markspace wrote:
    > Thanks for those pointers (pun not intended). The "sub-string search"
    > was referring to was picking words out of a longer, undelimited string.
    >
    > The example I posted looked for bad passwords within a larger password.
    >
    > "passxword" contains at least two bad passwords, pass and word,
    > according to the dictionary file I download from here (that's the
    > bpw.txt file in the sample code):
    >
    > <http://www.godlikeproductions.com/badpasswords.php>
    >
    > The code I posted find pass, ass, word, and or in the string
    > "passxword", which are all bad passwords according to that link above.
    > (I deliberately avoid reporting single letters as bad).


    I wonder about eliminating two-letter combinations. How much entropy does that add (or subtract) from passwords?

    It's practicable and arguably more reliable to use passphrases comprising
    all natural words whose entropy exceeds that of a fairly long Mxyzptlk®
    key. (Note: "Mxyzptlk" may well pass all your password checks, yet is highly
    guessable. Equally flawed are other stringlets that pass naive checks,
    like "XB-17", "UB40" and others.) See
    http://world.std.com/~reinhold/diceware.html
    for how to create high-entropy, highly memorable passphrases.

    > Practically speaking, hashing appears to be just as fast for a lookup,
    > but hashing doesn't tell you when to stop, so it's always N^2 for this
    > sort of problem. My algorithm should average less, hopefully much less
    > if you have very long input strings.
    >
    > I really don't know what set me off looking at this; it's just a little
    > random experiment I suppose.


    Your main question of space- and time-efficient substring matching is
    a fairly well-studied problem. I don't off the top of my head have better
    answers, but your approach to experiment is viable.

    You could instrument (profile) your code over different algorithms and
    input dataset profiles (that is, relative proportion of long, medium-length
    and short strings, relative proportion of "bad" substrings in the mix, and
    other factors that affect the "shape" of the data your algorithms process).
    Algorithm analysis should give you the big O, but not the constant factors
    or where they break for lower /n/.

    A lot of algorithm analysis critically depends on what constitutes "average"
    datasets.

    --
    Lew
     
    Lew, May 25, 2012
    #4
  5. markspace

    markspace Guest

    Re: Password quality (Was: Patricia trie vs binary search.)

    On 5/25/2012 9:41 AM, Lew wrote:
    >
    > I wonder about eliminating two-letter combinations. How much entropy
    > does that add (or subtract) from passwords?



    I was thinking the same thing. Also searching for the longest possible
    word, and not bactracking, might be sufficient. Once you find "word",
    why go back and scan for two or three letter combinations?


    >
    > It's practicable and arguably more reliable to use passphrases
    > comprising all natural words whose entropy exceeds that of a fairly
    > long Mxyzptlk® key. (Note: "Mxyzptlk" may well pass all your password
    > checks, yet is highly guessable. Equally flawed are other stringlets



    The idea was that if you increase the required length of a passphrase,
    users may defeat your requirement by just repeating a shorter bad
    password. "birdbirdbirdbird" is a pretty guessable 16 letter password,
    and "bird" appears in the bad password list. However,
    "birdaliceferretsalut" is a pretty decent password, even though each of
    its four component words appears in the bad password list. So if you
    can spot the individual component words and make sure they don't repeat,
    you've improved entropy a bit.


    > that pass naive checks, like "XB-17", "UB40" and others.) See
    > http://world.std.com/~reinhold/diceware.html for how to create
    > high-entropy, highly memorable passphrases.



    Yes, there should be other checks. Overall length, and let's say at
    least 5 to 7 different characters. So even though "UB40UB40UB40UB40" is
    16 characters and no sub-words appear on the bad password list, it only
    uses 4 different characters, which we might not want to allow.


    > Your main question of space- and time-efficient substring matching
    > is a fairly well-studied problem. I don't off the top of my head have
    > better answers, but your approach to experiment is viable.



    Right, although comparing algorithms can be hard too. I would have to
    implement each algorithm such that it was optimal, and I don't always
    have the skill or time to do that. "Many eyes" on a problem is often
    the more efficient solution. (This is also know as "research" and "not
    re-inventing the wheel".) Though certainly it wouldn't hurt to make the
    attempt.

    Glen's answer above was very helpful. Wikipedia has cross-referenced
    their string-matching algorithms so that finding one leads to all the
    others.

    <http://en.wikipedia.org/wiki/Category:String_matching_algorithms>

    This gives me something to chew on, at least.
     
    markspace, May 25, 2012
    #5
  6. markspace

    Daniel Pitts Guest

    On 5/24/12 4:07 PM, markspace wrote:
    > For some reason I was thinking about sub-string searches today. I looked
    > up Patricia tries (a kind of radix tree) to see if they would help.
    > While interesting, the radix tree seems to have a lot of overhead for
    > large numbers of entries.
    >
    > The radix tree uses a bucket at each level to hold all children (and
    > there could be quite a lot of children). Each child if present requires
    > a pointer (an object in Java) to hold it. For the example given, this
    > could be as much as one object per character in each string, plus the
    > bucket to hold it and its siblings. If the number strings is very large,
    > this could really result in an explosion of memory usage.


    I tend to use a Deterministic Finite State Automata for this. You can
    load the entire English dictionary fairly easily with that scheme. Yes,
    you use a bit of memory, but unless you're doing this on an embedded
    device, its probably not enough memory to be concerned about.

    Most of what I know about "searching" and "parsing", I've learned from
    "Parsing Techniques - A Practical Guide"
    <http://dickgrune.com/Books/PTAPG_1st_Edition/>. Free PDF or PS
    downloads on that page.

    Very worth a read. I'm sure parsing theory has been much extended since
    this book was written, however it is definitely a good introduction to
    the concepts in the space.

    HTH,
    Daniel.
     
    Daniel Pitts, May 27, 2012
    #6
  7. markspace

    markspace Guest

    That looks very interesting. Thanks! I'll have to check it out.
     
    markspace, May 27, 2012
    #7
  8. On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
    <> wrote:

    [snip]

    >I tend to use a Deterministic Finite State Automata for this. You can
    >load the entire English dictionary fairly easily with that scheme. Yes,

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >you use a bit of memory, but unless you're doing this on an embedded
    >device, its probably not enough memory to be concerned about.


    Including all affixes?

    [snip]

    Sincerely,

    Gene Wirchenko
     
    Gene Wirchenko, May 28, 2012
    #8
  9. markspace

    Daniel Pitts Guest

    On 5/27/12 6:44 PM, Gene Wirchenko wrote:
    > On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
    > <> wrote:
    >
    > [snip]
    >
    >> I tend to use a Deterministic Finite State Automata for this. You can
    >> load the entire English dictionary fairly easily with that scheme. Yes,

    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >> you use a bit of memory, but unless you're doing this on an embedded
    >> device, its probably not enough memory to be concerned about.

    >
    > Including all affixes?

    I suppose it depends on the particular dictionary, but we're only
    talking a few hundred thousand entries, at least with the Moby
    word-lists as a base:

    cat compound-words.txt often-mispelled.txt english-most-frequent.txt
    male-names.txt female-names.txt common-names.txt common-dictionary.txt
    official-scrabble-* | sort | uniq | wc -lc
    388997 4801599

    That's just over 4MB of strings, if stored as naively as possible.
    Certainly no problem for a modern day computer. It can actually be quite
    compressed when converted it to a DFSA, assuming reasonably optimized
    implementations.
    >
    > [snip]
    >
    > Sincerely,
    >
    > Gene Wirchenko
     
    Daniel Pitts, May 28, 2012
    #9
  10. markspace

    markspace Guest

    On 5/27/2012 10:00 PM, Daniel Pitts wrote:
    > the Moby
    > word-lists



    Moby word lists are neat, thanks for pointing that out.
     
    markspace, May 28, 2012
    #10
  11. On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
    <> wrote:

    >On 5/27/12 6:44 PM, Gene Wirchenko wrote:
    >> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
    >> <> wrote:
    >>
    >> [snip]
    >>
    >>> I tend to use a Deterministic Finite State Automata for this. You can
    >>> load the entire English dictionary fairly easily with that scheme. Yes,

    >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >>> you use a bit of memory, but unless you're doing this on an embedded
    >>> device, its probably not enough memory to be concerned about.

    >>
    >> Including all affixes?

    >I suppose it depends on the particular dictionary, but we're only
    >talking a few hundred thousand entries, at least with the Moby
    >word-lists as a base:


    Considering how many affixes can be applied to some words, I find
    that very questionable:
    self:
    *ish
    *ishly
    un*ish
    un*ishly
    *less
    *lessly
    un*less
    un*lessly
    position:
    *s
    *ed
    *al
    *ally
    re*
    re*s
    re*ed
    *less
    mis*
    *er
    *ers
    friend
    *s
    *ly
    *liness
    be*
    be*ing
    be*ed
    be*er
    be*ers
    These are not particularly extreme examples.

    [snip]

    Sincerely,

    Gene Wirchenko
     
    Gene Wirchenko, May 28, 2012
    #11
  12. markspace

    markspace Guest

    On 5/28/2012 8:20 AM, markspace wrote:
    > On 5/27/2012 10:00 PM, Daniel Pitts wrote:
    >> the Moby
    >> word-lists

    >
    >
    > Moby word lists are neat, thanks for pointing that out.



    While I'm thinking about it, here's a link I found from BYU with links
    to other corpus and word lists:

    http://corpus.byu.edu/
     
    markspace, May 28, 2012
    #12
  13. markspace

    Lew Guest

    On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
    > On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
    > <> wrote:
    >
    >> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
    >>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
    >>> <> wrote:
    >>>
    >>> [snip]
    >>>
    >>>> I tend to use a Deterministic Finite State Automata for this. You can
    >>>> load the entire English dictionary fairly easily with that scheme. Yes,
    >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >>>> you use a bit of memory, but unless you're doing this on an embedded
    >>>> device, its probably not enough memory to be concerned about.
    >>>
    >>> Including all affixes?

    >> I suppose it depends on the particular dictionary, but we're only
    >> talking a few hundred thousand entries, at least with the Moby
    >> word-lists as a base:

    >
    > Considering how many affixes can be applied to some words, I find
    > that very questionable:
    > self:
    > *ish
    > *ishly
    > un*ish
    > un*ishly
    > *less
    > *lessly
    > un*less
    > un*lessly
    > position:
    > *s
    > *ed
    > *al
    > *ally
    > re*
    > re*s
    > re*ed
    > *less
    > mis*
    > *er
    > *ers
    > friend
    > *s
    > *ly
    > *liness
    > be*
    > be*ing
    > be*ed
    > be*er
    > be*ers
    > These are not particularly extreme examples.


    It's not a question of how extreme the examples are but how many there are.

    Not all words can be legitimately affixed. Many can be affixed by algorithm,
    or by bitmaps as to which affixes apply, so you only store the root, the
    bitmap and perhaps one more form.

    I don't know how much memory expansion you think your factors will cause, as
    you only hand wave and say there will be some and act like it's a problem, but
    let's say it doubles the size of the dictionary. By Daniel's experiment
    upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
    Being text and all, that should compress to about 3 MiB or less.

    Now I am interested to hear what sort of trouble you assert that 3 MiB or so
    of storage will cause.

    --
    Lew
    Honi soit qui mal y pense.
    http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg
     
    Lew, May 29, 2012
    #13
  14. markspace

    Jeff Higgins Guest

    On 05/24/2012 07:07 PM, markspace wrote:
    > For some reason I was thinking about sub-string searches today. I
    > looked up Patricia tries (a kind of radix tree) to see if they would
    > help. While interesting, the radix tree seems to have a lot of overhead
    > for large numbers of entries.
    >
    > The radix tree uses a bucket at each level to hold all children (and
    > there could be quite a lot of children). Each child if present requires
    > a pointer (an object in Java) to hold it. For the example given, this
    > could be as much as one object per character in each string, plus the
    > bucket to hold it and its siblings. If the number strings is very large,
    > this could really result in an explosion of memory usage.
    >
    > <http://en.wikipedia.org/wiki/Radix_tree>
    >
    > So is there a better way for large data sets?
    >

    Different. Better?
    <http://www.pathcom.com/~vadco/cwg.html>
     
    Jeff Higgins, May 29, 2012
    #14
  15. On Mon, 28 May 2012 21:54:39 -0700, Lew <> wrote:

    >On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
    >> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
    >> <> wrote:
    >>
    >>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:


    [snip]

    >> These are not particularly extreme examples.

    >
    >It's not a question of how extreme the examples are but how many there are.


    I mentioned extremity just to point out that such base words are
    not that unusual. There are a lot of them.

    >Not all words can be legitimately affixed. Many can be affixed by algorithm,
    >or by bitmaps as to which affixes apply, so you only store the root, the
    >bitmap and perhaps one more form.


    There are multiple affixes. I suppose they could be combined
    into aggregate affixes. e.g. -less + -ness -> -lessness.

    >I don't know how much memory expansion you think your factors will cause, as
    >you only hand wave and say there will be some and act like it's a problem, but
    >let's say it doubles the size of the dictionary. By Daniel's experiment


    Let's not make up data. I would like to know how much of the
    English language actually is in his dataset.

    >upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
    >Being text and all, that should compress to about 3 MiB or less.
    >
    >Now I am interested to hear what sort of trouble you assert that 3 MiB or so
    >of storage will cause.


    Assuming your facts and then calling me on what you made up?

    http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language
    is short but interesting reading. Its final paragraph: "This suggests
    that there are, at the very least, a quarter of a million distinct
    English words, excluding inflections, and words from technical and
    regional vocabulary not covered by the OED, or words not yet added to
    the published dictionary, of which perhaps 20 per cent are no longer
    in current use. If distinct senses were counted, the total would
    probably approach three quarters of a million."

    Now, add on what they exclude. Is one million words out of line?
    Itis one thing to have some of a language encoded. It is quite
    another to try to handle everything. Exceptional cases tend to be
    more difficult to handle.

    Sincerely,

    Gene Wirchenko
     
    Gene Wirchenko, May 29, 2012
    #15
  16. markspace

    Daniel Pitts Guest

    On 5/28/12 9:54 PM, Lew wrote:
    > On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
    >> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
    >> <> wrote:
    >>
    >>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
    >>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
    >>>> <> wrote:
    >>>>
    >>>> [snip]
    >>>>
    >>>>> I tend to use a Deterministic Finite State Automata for this. You can
    >>>>> load the entire English dictionary fairly easily with that scheme.
    >>>>> Yes,
    >>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >>>>> you use a bit of memory, but unless you're doing this on an embedded
    >>>>> device, its probably not enough memory to be concerned about.
    >>>>
    >>>> Including all affixes?
    >>> I suppose it depends on the particular dictionary, but we're only
    >>> talking a few hundred thousand entries, at least with the Moby
    >>> word-lists as a base:

    >>
    >> Considering how many affixes can be applied to some words, I find
    >> that very questionable:
    >> self:
    >> *ish
    >> *ishly
    >> un*ish
    >> un*ishly
    >> *less
    >> *lessly
    >> un*less
    >> un*lessly
    >> position:
    >> *s
    >> *ed
    >> *al
    >> *ally
    >> re*
    >> re*s
    >> re*ed
    >> *less
    >> mis*
    >> *er
    >> *ers
    >> friend
    >> *s
    >> *ly
    >> *liness
    >> be*
    >> be*ing
    >> be*ed
    >> be*er
    >> be*ers
    >> These are not particularly extreme examples.

    >
    > It's not a question of how extreme the examples are but how many there are.
    >
    > Not all words can be legitimately affixed. Many can be affixed by
    > algorithm, or by bitmaps as to which affixes apply, so you only store
    > the root, the bitmap and perhaps one more form.
    >
    > I don't know how much memory expansion you think your factors will
    > cause, as you only hand wave and say there will be some and act like
    > it's a problem, but let's say it doubles the size of the dictionary. By
    > Daniel's experiment upthread, that would bring it to around 8 MiB, let's
    > round and say 10MiB. Being text and all, that should compress to about 3
    > MiB or less.
    >
    > Now I am interested to hear what sort of trouble you assert that 3 MiB
    > or so of storage will cause.
    >

    Actually, the word lists I used to come up with my figures include those
    inflections of the words that it is used for.

    For example "ishly" and "ness" suffixes:

    > grep "ishly\b" * | wc -l

    323

    Which includes:
    wordishly
    womanishly
    wolfishly
    wishly
    winterishly
    wildishly
    whorishly
    whoreishly

    Most of these entries aren't even in Thunderbirds spell-check dictionary.

    > grep "ness\b" * | wc -l

    11762

    Includes entries such as:
    woodlessness
    wonderfulness
    unmysteriousness
    unexperiencedness
    undeceptiveness
    nonvolatileness


    Again, most of these aren't spell-check.

    My numbers are accurate, but thanks for questioning them, so I could
    further explore and see for myself.
     
    Daniel Pitts, May 29, 2012
    #16
  17. markspace

    Daniel Pitts Guest

    On 5/29/12 9:14 AM, Gene Wirchenko wrote:
    > On Mon, 28 May 2012 21:54:39 -0700, Lew<> wrote:
    >
    >> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
    >>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
    >>> <> wrote:
    >>>
    >>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:

    >
    > [snip]
    >
    >>> These are not particularly extreme examples.

    >>
    >> It's not a question of how extreme the examples are but how many there are.

    >
    > I mentioned extremity just to point out that such base words are
    > not that unusual. There are a lot of them.
    >
    >> Not all words can be legitimately affixed. Many can be affixed by algorithm,
    >> or by bitmaps as to which affixes apply, so you only store the root, the
    >> bitmap and perhaps one more form.

    >
    > There are multiple affixes. I suppose they could be combined
    > into aggregate affixes. e.g. -less + -ness -> -lessness.
    >
    >> I don't know how much memory expansion you think your factors will cause, as
    >> you only hand wave and say there will be some and act like it's a problem, but
    >> let's say it doubles the size of the dictionary. By Daniel's experiment

    >
    > Let's not make up data. I would like to know how much of the
    > English language actually is in his dataset.

    Agreed, let's not make up data. Using the Moby word-list as a guide, it
    includes a significant number of those particular suffixes and prefixes
    you've mentioned. Granted, its not a complete word-list, but then again
    nothing is. It is "complete enough" for most purposes.

    >
    >> upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
    >> Being text and all, that should compress to about 3 MiB or less.
    >>
    >> Now I am interested to hear what sort of trouble you assert that 3 MiB or so
    >> of storage will cause.

    >
    > Assuming your facts and then calling me on what you made up?
    >
    > http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language
    > is short but interesting reading. Its final paragraph: "This suggests
    > that there are, at the very least, a quarter of a million distinct
    > English words, excluding inflections, and words from technical and
    > regional vocabulary not covered by the OED, or words not yet added to
    > the published dictionary, of which perhaps 20 per cent are no longer
    > in current use. If distinct senses were counted, the total would
    > probably approach three quarters of a million."
    >
    > Now, add on what they exclude. Is one million words out of line?
    > Itis one thing to have some of a language encoded. It is quite
    > another to try to handle everything. Exceptional cases tend to be
    > more difficult to handle.


    Are you arguing that a modern system can't handle that number of words?
    A modern desktop has more than enough memory to easily handle a quarter
    *billion* words, which is a 100 times greater than your guessed upper limit.

    And that's *without* compression.
     
    Daniel Pitts, May 29, 2012
    #17
  18. markspace

    Jeff Higgins Guest

    On 05/29/2012 12:16 PM, Daniel Pitts wrote:
    >
    > Most of these entries aren't even in Thunderbirds spell-check dictionary.
    >

    How many seven letter words can I construct from the twenty-six letters
    A through Z? How many of these words are defined English words? What
    are/are not the common affixes in this set of English words?
     
    Jeff Higgins, May 29, 2012
    #18
  19. markspace

    Daniel Pitts Guest

    On 5/29/12 10:37 AM, Jeff Higgins wrote:
    > On 05/29/2012 12:16 PM, Daniel Pitts wrote:
    >>
    >> Most of these entries aren't even in Thunderbirds spell-check dictionary.
    >>

    > How many seven letter words can I construct from the twenty-six letters
    > A through Z? How many of these words are defined English words? What
    > are/are not the common affixes in this set of English words?


    How about you download the Moby word list yourself and do these
    analyses? It is a very simple grep for the first question.
    egrep "\b[a-zA-Z]{7}\b" *.txt
     
    Daniel Pitts, May 29, 2012
    #19
  20. markspace

    Jeff Higgins Guest

    On 05/29/2012 01:49 PM, Daniel Pitts wrote:
    > On 5/29/12 10:37 AM, Jeff Higgins wrote:
    >> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
    >>>
    >>> Most of these entries aren't even in Thunderbirds spell-check
    >>> dictionary.
    >>>

    >> How many seven letter words can I construct from the twenty-six letters
    >> A through Z? How many of these words are defined English words? What
    >> are/are not the common affixes in this set of English words?

    >
    > How about you download the Moby word list yourself and do these
    > analyses?

    I was asking you.
    It is a very simple grep for the first question.
    > egrep "\b[a-zA-Z]{7}\b" *.txt

    Nope.
     
    Jeff Higgins, May 29, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green
    Replies:
    3
    Views:
    999
    George Cherry
    Apr 27, 2006
  2. chuck
    Replies:
    8
    Views:
    1,083
    chuck
    Oct 1, 2007
  3. Bogdan

    Binary tree search vs Binary search

    Bogdan, Oct 18, 2010, in forum: C Programming
    Replies:
    22
    Views:
    3,130
    Michael Angelo Ravera
    Oct 21, 2010
  4. jm
    Replies:
    5
    Views:
    175
  5. Justin To

    Trie search algorithm

    Justin To, Jun 13, 2008, in forum: Ruby
    Replies:
    0
    Views:
    124
    Justin To
    Jun 13, 2008
Loading...

Share This Page