Implementation ideas? Flexible string matching

Discussion in 'Ruby' started by Kirk Haines, Aug 12, 2004.

  1. Kirk Haines

    Kirk Haines Guest

    Here is the puzzle. You have a string. You have a list of strings and/or
    regular expressions. You want to find the match, if any, between your
    string and an element in the list.

    If the list is also just strings, you can create a hash with the strings as
    keys then do a hash lookup to do the match. That's fast, even if the list
    is very long.

    However, what is most of the list is string literals, but you want to be
    able to use regexps or some sort of wildcard matching for some of the
    matches, too. Does anyone have any bright ideas about how to do this as
    quickly as possible?

    The only idea that I have come up with is to put the literal matches in a
    hash, and then have the regular expressions in an array. If there isn't a
    literal match, then one has to accept the time consuming process of
    iterating through each regexp and checking it. Can anyone think of any
    other approaches that might be faster?


    Thanks,

    Kirk Haines
     
    Kirk Haines, Aug 12, 2004
    #1
    1. Advertising

  2. Kirk Haines <> wrote:

    > The only idea that I have come up with is to put the literal matches in a
    > hash, and then have the regular expressions in an array. If there isn't a
    > literal match, then one has to accept the time consuming process of
    > iterating through each regexp and checking it. Can anyone think of any
    > other approaches that might be faster?


    If you only need to know that there is a match, rather than what
    matched, how about combining all the regexps into one big regexp using
    ((re1)|(re2)|...)?

    martin
     
    Martin DeMello, Aug 12, 2004
    #2
    1. Advertising

  3. Kirk Haines

    Kirk Haines Guest

    On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote
    > Kirk Haines <> wrote:
    >
    > > The only idea that I have come up with is to put the literal matches in

    a
    > > hash, and then have the regular expressions in an array. If there isn't

    a
    > > literal match, then one has to accept the time consuming process of
    > > iterating through each regexp and checking it. Can anyone think of any
    > > other approaches that might be faster?

    >
    > If you only need to know that there is a match, rather than what
    > matched, how about combining all the regexps into one big regexp using
    > ((re1)|(re2)|...)?


    Hmmm. Interesting idea. I'll have to give it a try and see if there is a
    performance difference when doing that.


    Thanks,

    Kirk Haines
     
    Kirk Haines, Aug 12, 2004
    #3
  4. "Kirk Haines" <> schrieb im Newsbeitrag
    news:...
    > On Fri, 13 Aug 2004 01:41:10 +0900, Martin DeMello wrote
    > > Kirk Haines <> wrote:
    > >
    > > > The only idea that I have come up with is to put the literal matches

    in
    > a
    > > > hash, and then have the regular expressions in an array. If there

    isn't
    > a
    > > > literal match, then one has to accept the time consuming process of
    > > > iterating through each regexp and checking it. Can anyone think of

    any
    > > > other approaches that might be faster?

    > >
    > > If you only need to know that there is a match, rather than what
    > > matched, how about combining all the regexps into one big regexp using
    > > ((re1)|(re2)|...)?

    >
    > Hmmm. Interesting idea. I'll have to give it a try and see if there is a
    > performance difference when doing that.


    I was going to suggest the same: hash and a combined regexp. You could even
    create a single regexp that coevers all - if that doesn't blow up the regexp
    engine. I once made a script that takes a list of strings and creates a
    single regexp from them that is supposedly matched fast (i.e. doesn't need
    backtracking). If you're interested I can check at work tomorrow - I'm
    quite sure I still have it somewhere (please email as reminder).

    Kind regards

    robert
     
    Robert Klemme, Aug 12, 2004
    #4
  5. Kirk Haines

    Martin Kay Guest

    --Apple-Mail-5-218701513
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain;
    charset=US-ASCII;
    format=flowed

    In general, this is am intractable problem. After all, a single
    regular expression can match any substring and, in the worst case, they
    could all do this. The best solution I know to this problem needs a
    more powerful facility for handling regular expressions than is
    normally provided by a programming language.

    In what follows, I will talk about the finite-state automata (fsa)
    equivalent to regular expressions because that is what we need. So ...
    1. Make a fsa E = e1 | e2 ... en for the union of the set you are
    interested in.
    2. From this create F = Sigma* E by creating a loop for every character
    in the alphabet at the start state of E. Make F deterministic.
    3. Create R as the reverse of E, that is, its recognizes strings that
    match the regular expressions of interest, but read from right to
    left. We need to annotate the final states of this automaton with a
    list of the original expressions that are recognized when the
    automaton enters that state. R also needs to be deterministic.
    4. If we run F against the string, it will be in a final state just
    in case the character it has just examined ends a substring that
    matches (at least) one of the original expressions. Since F is
    deterministic, it is therefore possible to find all these places in
    linear time and, in fact, after examining each character exactly
    once. The trouble is that you cannot tell which expressions match
    at such a location, because, if you are k characters into the string,
    there could be k different places at which a match began.
    5. Starting at each location identified in step 4, run R backwards
    through the string. Every time it enters a final state, it
    identifies the starting point of a substring that ends
    where this backwards match started and, thanks to its special
    annotations, tells you which re(s) you have matched.
    This is, of course, intractable in perverse cases because step 5 can
    still start and finish everywhere. But it is close to linear in all
    but very worst cases. Notice, for what it is worth that, for each
    string position at which you run step 5, at least one match is
    guaranteed. The problem is that you cannot stop when you find the
    first one because there could be more.

    --Martin

    On Aug 12, 2004, at 6:11 PM, Kirk Haines wrote:

    > Here is the puzzle. You have a string. You have a list of strings
    > and/or
    > regular expressions. You want to find the match, if any, between your
    > string and an element in the list.
    >
    > If the list is also just strings, you can create a hash with the
    > strings as
    > keys then do a hash lookup to do the match. That's fast, even if the
    > list
    > is very long.
    >
    > However, what is most of the list is string literals, but you want to
    > be
    > able to use regexps or some sort of wildcard matching for some of the
    > matches, too. Does anyone have any bright ideas about how to do this
    > as
    > quickly as possible?
    >
    > The only idea that I have come up with is to put the literal matches
    > in a
    > hash, and then have the regular expressions in an array. If there
    > isn't a
    > literal match, then one has to accept the time consuming process of
    > iterating through each regexp and checking it. Can anyone think of any
    > other approaches that might be faster?
    >
    >
    > Thanks,
    >
    > Kirk Haines
    >
    >


    --Apple-Mail-5-218701513--
     
    Martin Kay, Aug 12, 2004
    #5
  6. Robert Klemme wrote:

    > I once made a script that takes a list of strings and creates a
    > single regexp from them that is supposedly matched fast (i.e. doesn't need
    > backtracking). If you're interested I can check at work tomorrow - I'm
    > quite sure I still have it somewhere (please email as reminder).


    Is this a trie optimization? I've written something similar for
    Regexp::English but the optimization itself is quite slow and takes up
    lots of resources. (Because it needs to build the trie.)

    It can do this:

    irb(main):006:0> re
    => /foobar|foobark|foobard|foobarqux|food|fool|foolish/
    irb(main):007:0> re.optimize
    => /foo(?:l(?:ish)?|bar(?:qux|[kd])?|d)/

    It was able to create an optimized Regexp that matches the first few
    hundred English words from a words file -- I've attached it. :)

    > Kind regards
    > robert


    More regards,
    Florian Gross


    /a(?:ki(?:mbo|n)|l(?:veol(?:ar|i|us)|k(?:al(?:eek:ids?|i(?:ne|s)?)|yl)|ways|l(?:ay(?:ed|s|ing)?|y(?:ing)?|o(?:w(?:a(?:bl(?:[ye])|nces?)|ed|s|ing)?|ys?|cat(?:able|ors?|e(?:[ds])?|i(?:ng|ons?))|phon(?:es?|ic)|t(?:ments?|s|t(?:e(?:[dr])|ing))?)|e(?:viat(?:e(?:[ds])?|i(?:ng|on))|les?|mande|y(?:ways?|s)?|rg(?:y|i(?:c|es))|g(?:ations?|or(?:y|i(?:c(?:al(?:ly)?)?|es))|e(?:d(?:ly)?|s)?|rettos?|i(?:ances?|ng)))|i(?:ances?|e(?:[ds])|gators?|terati(?:ve|ons?))|u(?:d(?:e(?:[ds])?|ing)|r(?:e(?:ment)?|ing)|si(?:ve(?:ness)?|ons?)))?|a(?:baster|crity|rm(?:ed|s|i(?:ng(?:ly)?|st))?|s)|m(?:a(?:nacs?)?|o(?:n(?:ds?|er)|st)|s(?:man)?|ighty)|b(?:a(?:core|tross)?|eit|um(?:s|in)?)|nico|c(?:eek:(?:ves?|hol(?:s|i(?:cs?|sm))?)|hemy)|o(?:n(?:e(?:ness)?|g(?:side)?)|of(?:ness)?|es?|ft|ha|ud)|der(?:m(?:an|en))?|p(?:ha(?:bet(?:s|i(?:c(?:al(?:ly)?|s)?|z(?:e(?:[ds])?|ing)))?|numeric)?|ine)|e(?:e|rt(?:ly|ness|e(?:d(?:ly)?|rs?)|s|ing)?)?|f(?:alfa|resco)|ready|g(?:a(?:e(?:cide)?)?|orithm(?:s|ic(?:ally)?)?|ebra(?:s|ic(?:ally)?)?|inate)|so|t(?:ars?|ogether|er(?:a(?:ble|tions?)|nat(?:eek:rs?|e(?:[ds]|ly)?|i(?:ve(?:ly|s)?|ng|ons?))|cations?|e(?:d|rs?)|s|ing)?|ruis(?:m|t(?:ic(?:ally)?)?)|hough|itudes?)|i(?:ve|ke|as(?:e(?:[ds])|ing)?|m(?:eek:ny|ents?)|bis?|en(?:at(?:e(?:[ds])?|i(?:ng|on))|s)?|g(?:n(?:ments?|ed|s|ing)?|ht))|u(?:m(?:n(?:ae?|i|us)|inum)?|ndum))|m(?:a(?:lgam(?:at(?:e(?:[ds])?|i(?:ng|on))|s)?|nuensis|z(?:e(?:ment|d(?:ly)?|rs?|s)?|ing(?:ly)?)|retto|ss(?:e(?:[ds])|ing)?|t(?:eek:ry|eur(?:s|is(?:m|h(?:ness)?))?)|in)|m(?:eek:(?:ni(?:ac?|um))?|unition)|b(?:l(?:e(?:[drs])?|ing)|assadors?|er|rosial|i(?:valen(?:ce|t(?:ly)?)|ance|dextrous(?:ly)?|ent|gu(?:eek:us(?:ly)?|it(?:y|ies))|tio(?:ns?|us(?:ly)?))|u(?:la(?:nces?|tory)|s(?:cade|h(?:e(?:[ds]))?)))|yl|nesty|o(?:k|ng(?:st)?|eba(?:[es])?|r(?:al(?:ity)?|ous|phous(?:ly)?|tiz(?:e(?:[ds])?|ing)|ist)|u(?:nt(?:e(?:d|rs?)|s|ing)?|r))|p(?:l(?:y|e|i(?:f(?:y(?:ing)?|i(?:cation|e(?:d|rs?|s)))|tudes?))|oules?|er(?:age|es?|sands?)|h(?:etamines?|i(?:b(?:eek:logy|i(?:ans?|ous(?:ly)?))|theaters?))|utat(?:e(?:[ds])?|ing))|e(?:liorat(?:ed?|i(?:ng|on))|n(?:able|orrhea|d(?:ments?|ed|s|ing)?|it(?:y|ies))?|ricium)|i(?:able|no|cabl(?:[ye])|d(?:e|st)?|go|ss|ty)|u(?:lets?|s(?:e(?:ments?|d(?:ly)?|rs?|s)?|ing(?:ly)?)))?|b(?:l(?:a(?:ze|t(?:e(?:[ds])?|i(?:ve|ng|on)))|y|e(?:r|st)?)|a(?:ndon(?:ment|ed|s|ing)?|ck|ft|s(?:e(?:ments?|d|s)?|h(?:e(?:[ds])|ing)?|ing)|t(?:e(?:ments?|d|r|s)?|ing))|b(?:eek:ts?|e(?:ys?)?|reviat(?:e(?:[ds])?|i(?:ng|ons?)))|ys(?:mal(?:ly)?|s(?:es)?)|normal(?:ly|it(?:y|ies))?|o(?:ve(?:mentioned|board|ground)?|li(?:sh(?:ments?|e(?:d|rs?|s)|ing)?|tion(?:ists?)?)|ard|mina(?:ble|te)|des?|r(?:t(?:ed|s|i(?:ve(?:ly)?|ng|ons?))?|igin(?:al|es?))|u(?:nd(?:ed|s|ing)?|t))|d(?:eek:m(?:ens?|inal)|uct(?:eek:rs?|ed|s|ions?)?)|e(?:yance|d|rra(?:nt|tions?)|t(?:s|t(?:e(?:[dr])|ing))?)|r(?:a(?:d(?:e(?:[ds])?|ing)|si(?:ve|ons?))|o(?:ad|gat(?:e(?:[ds])?|ing))|ea(?:ctions?|st)|idg(?:ment|e(?:[ds])?|ing)|upt(?:ly|ness)?)|s(?:c(?:eek:nd(?:ed|s|ing)?|ess(?:e(?:[ds]))?|issas?)|o(?:l(?:v(?:e(?:[ds])?|ing)|ut(?:e(?:ly|ness|s)?|ion))|r(?:b(?:e(?:n(?:cy|t)|d|r)|s|ing)?|pti(?:ve|ons?)))|en(?:ces?|t(?:ly|minded|e(?:d|e(?:s|ism)?)|s|i(?:a|ng))?)|t(?:ain(?:e(?:[dr])|s|ing)?|entions?|r(?:act(?:ly|ness|ors?|ed|s|i(?:ng|on(?:s|is(?:[mt]))?))?|use(?:ness)?)|inence)|inthe|urd(?:ly|it(?:y|ies))?)|hor(?:r(?:e(?:[dr]|nt)|ing)|s)?|i(?:lit(?:y|ies)|d(?:e(?:[ds])?|ing))|u(?:ndan(?:ce|t(?:ly)?)|s(?:e(?:[ds])?|i(?:ve|ng))|t(?:ment|s|t(?:e(?:d|rs?)|ing))?)|j(?:ect(?:ly|ness|ions?)?|ur(?:e(?:[ds])?|ing)))|n(?:kles?|a(?:l(?:y(?:z(?:able|e(?:d|rs?|s)?|ing)|s(?:es|ts?|is)|tic(?:al(?:ly)?|it(?:y|ies))?)|og(?:y|ous(?:ly)?|i(?:cal|es)|ues?)?)?|c(?:eek:ndas?|hronis(?:ms?|tically))|p(?:lasmosis|hor(?:a|ic(?:ally)?))|erobic|rch(?:y|i(?:c(?:al)?|s(?:m|ts?)))|grams?|stomo(?:s(?:es|is)|tic)|t(?:eek:m(?:y|ic(?:al(?:ly)?)?)|hema))|n(?:als?|o(?:y(?:ances?|e(?:d|rs?)|s|ing(?:ly)?)?|tat(?:e(?:[ds])?|i(?:ng|ons?))|unc(?:e(?:ments?|d|rs?|s)?|ing))|ex(?:ation|e(?:[ds])|ing)?|i(?:versar(?:y|ies)|hilat(?:e(?:[ds])?|i(?:ng|on)))|ual(?:ly|s)?)|c(?:est(?:eek:rs?|r(?:al|y))|ho(?:v(?:y|ies)|r(?:ages?|ed|s|i(?:ng|t(?:e|ism)))?)|i(?:llary|ent(?:ly|s)?))|d(?:ers|ing)?|e(?:w|m(?:eek:(?:met(?:ers?|ry)|ne)|i(?:[ac]))|c(?:dot(?:al|es?)|hoic)|sthe(?:sia|ti(?:c(?:ally|s)?|z(?:e(?:[ds])?|ing))))|g(?:l(?:e(?:d|rs?)?|ing)|e(?:l(?:s|ic)?|r(?:ed|s|ing)?)|r(?:y|i(?:ly|e(?:r|st)))|st(?:rom)?|iography|u(?:lar(?:ly)?|ish(?:ed)?))|hydrous(?:ly)?|i(?:line|m(?:a(?:ls?|t(?:eek:rs?|e(?:ly|ness|d(?:ly)?|s)?|i(?:ng|ons?)))|osity|i(?:zed|sm))|on(?:s|ic)?|s(?:eek:trop(?:y|ic)|e(?:ikonic)?)))?|c(?:knowledg(?:ments?|e(?:able|ments?|d|rs?|s)?|ing)|a(?:cia|dem(?:y|i(?:a|c(?:ally|s)?|es)))|me|yclic(?:ally)?|ne|c(?:l(?:a(?:mation|im(?:ed|s|ing)?)|imat(?:e(?:[ds])?|i(?:ng|z(?:ation|ed))))|o(?:lades?|m(?:modat(?:e(?:[ds])?|i(?:ng|ons?))|p(?:li(?:ces?|sh(?:ments?|e(?:d|rs?|s)|ing)?)|an(?:y(?:ing)?|i(?:ments?|e(?:[ds])|sts?))))|rd(?:ance|e(?:d|rs?)|s|i(?:ng(?:ly)?|ons?))?|st(?:ed|s|ing)?|unt(?:a(?:b(?:l(?:[ye])|ility)|n(?:cy|ts?))|ed|s|ing)?)|e(?:ler(?:at(?:eek:rs?|e(?:[ds])?|i(?:ng|ons?))|ometers?)|nt(?:ed|s|ing|ua(?:l|t(?:e(?:[ds])?|i(?:ng|on))))?|de(?:[ds])?|pt(?:a(?:b(?:l(?:[ye])|ility)|nces?)|ors?|e(?:d|rs?)|s|ing)?|ss(?:eek:r(?:[ys]|ies)|e(?:[ds])|i(?:b(?:l(?:[ye])|ility)|ng|ons?))?)|r(?:e(?:dit(?:ations?|ed)?|tions?)|u(?:e(?:[ds])?|ing))|ident(?:ly|al(?:ly)?|s)?|u(?:lturat(?:e(?:[ds])?|i(?:ng|on))|mulat(?:eek:rs?|e(?:[ds])?|i(?:ng|ons?))|r(?:a(?:c(?:y|ies)|te(?:ly|ness)?)|sed)|s(?:a(?:l|ti(?:ve|ons?))|e(?:[drs])?|tom(?:ed|s|ing)?|ing(?:ly)?)))|o(?:lytes?|rns?|ustic(?:al(?:ly)?|s|ian)?)|e(?:s|t(?:ate|ylene|one))?|qu(?:aint(?:ances?|ed|s|ing)?|i(?:esc(?:e(?:n(?:ce|t)|d|s)?|ing)|r(?:able|e(?:[ds])?|ing)|siti(?:ve(?:ness)?|ons?)|t(?:s|t(?:al|e(?:[dr])|ing))?))|r(?:ylic|o(?:bat(?:s|ics?)?|nyms?|polis|ss)|e(?:age|s)?|i(?:mon(?:y|ious)|d))|h(?:e(?:[ds])?|i(?:ng|ev(?:able|e(?:ments?|d|rs?|s)?|ing)))|t(?:eek:rs?|ed|ress(?:es)?|i(?:v(?:at(?:eek:rs?|e(?:[ds])?|i(?:ng|ons?))|e(?:ly)?|i(?:s(?:m|ts?)|t(?:y|ies)))|n(?:eek:meters?|g|ium)|ons?)|ua(?:l(?:ly|s|i(?:zation|t(?:y|ies)))?|rial(?:ly)?|t(?:eek:rs?|e(?:[ds])?|ing)))?|id(?:ly|s|i(?:c|t(?:y|ies))|ulous)?|u(?:men|te(?:ly|ness)?|ity))|d(?:v(?:an(?:c(?:e(?:ments?|d|s)?|ing)|tage(?:eek:us(?:ly)?|d|s)?)|oca(?:cy|t(?:e(?:[ds])?|ing))|e(?:nt(?:i(?:sts?|tious)|ur(?:eek:us|e(?:d|rs?|s)?|ing))?|r(?:b(?:s|ial)?|s(?:ar(?:y|ies)|e(?:ly)?|it(?:y|ies))|t(?:is(?:e(?:ments?|d|rs?|s)?|ing))?))|i(?:ce|s(?:ab(?:l(?:[ye])|ility)|or(?:[ys])?|e(?:ments?|d(?:ly)?|es?|rs?|s)?|ing)))|a(?:mant(?:ly)?|pt(?:a(?:b(?:le|ility)|tions?)|ors?|e(?:d|rs?)|s|i(?:ve(?:ly)?|ng))?|g(?:es?|ios?))|m(?:eek:ni(?:sh(?:ments?|e(?:[ds])|ing)?|tions?)|i(?:x(?:e(?:[ds])|ture)?|nist(?:er(?:ed|s|ings?)?|ra(?:ble|t(?:eek:rs?|e|i(?:ve(?:ly)?|ons?))))|r(?:a(?:l(?:s|ty)?|bl(?:[ye])|tions?)|e(?:d|rs?|s)?|ing(?:ly)?)|ssi(?:b(?:le|ility)|ons?)|t(?:s|t(?:ance|e(?:d(?:ly)?|rs?)|ing))?))|o(?:lescen(?:ce|ts?)|be|pt(?:e(?:d|rs?)|s|i(?:ve|ng|ons?))?|r(?:a(?:ble|tion)|n(?:ments?|ed|s)?|e(?:[ds])?))?|d(?:e(?:nd(?:a|um)?|d|rs?)|ress(?:ab(?:le|ility)|e(?:d|es?|rs?|s)|ing)?|s|i(?:ng|ct(?:ed|s|i(?:ng|ons?))?|ti(?:v(?:es?|ity)|on(?:al(?:ly)?|s)?))|uc(?:e(?:[ds])?|t(?:eek:r|ed|s|i(?:ng|on))?|i(?:ble|ng)))?|e(?:pt|qua(?:c(?:y|ies)|te(?:ly)?))|r(?:eek:it(?:ness)?|enal(?:ine)?|ift)|s(?:eek:r(?:b(?:ed|s|ing)?|ption))?|he(?:r(?:e(?:n(?:ce|ts?)|d|rs?|s)?|ing)|si(?:ves?|ons?))|i(?:abatic(?:ally)?|eu)|u(?:l(?:at(?:e|i(?:ng|on))|t(?:er(?:at(?:e(?:[ds])?|ing)|y|ous(?:ly)?|ers?)|s|hood)?)|mbrat(?:e(?:[ds])?|i(?:ng|on)))|j(?:acen(?:cy|t)|o(?:in(?:ed|s|ing)?|urn(?:ment|ed|s|ing)?)|ectives?|u(?:ncts?|d(?:g(?:e(?:[ds])?|ing)|icat(?:e(?:[ds])?|i(?:ng|ons?)))|r(?:e(?:[ds])?|ing)|st(?:abl(?:[ye])|ments?|ors?|e(?:d|rs?)|s|ing)?|tants?)))?|e(?:r(?:at(?:eek:rs?|e(?:[ds])?|i(?:ng|on))|o(?:acoustic|bics?|nautic(?:al|s)?|dynamics?|s(?:eek:l(?:s|ize)?|pace))|ials?)|gis|sthetic(?:ally|s)?)|f(?:l(?:ame|oat)|ar|o(?:eek:t|re(?:mentioned|said|thought)?|ul)|f(?:l(?:ict(?:ed|s|i(?:ve|ng|ons?))?|uen(?:ce|t))|a(?:ble|irs?)|ord(?:able|ed|s|ing)?|e(?:ct(?:ations?|ed|s|i(?:ve|ng(?:ly)?|on(?:ate(?:ly)?|s)?))?|rent)|r(?:eek:nt(?:ed|s|ing)?|i(?:cates?|ght))|i(?:liat(?:e(?:[ds])?|i(?:ng|ons?))|anced|x(?:e(?:[ds])|ing)?|nit(?:y|ies)|davits?|rm(?:ati(?:ve(?:ly)?|ons?)|ed|s|ing)?))|r(?:aid|esh)|t(?:er(?:wards?|life|m(?:ath|ost)|noons?|effect|glow|shocks?|thoughts?|image)?)?|i(?:cionado|eld|re))|g(?:l(?:eek:w|eam)|a(?:pe|r|tes?|in(?:st)?)|nostics?|o(?:n(?:y|i(?:z(?:e(?:[ds])?|ing(?:ly)?)|es))|g)?|e(?:less|n(?:c(?:y|ies)|das?|ts?)|d|rs?|s)?|r(?:arian|ee(?:abl(?:[ye])|ments?|d|rs?|s|ing)?|icultur(?:al(?:ly)?|e))|g(?:l(?:eek:merat(?:e(?:[ds])?|ion)|utin(?:at(?:e(?:[ds])?|i(?:ng|on))|ins?))|r(?:a(?:vat(?:e(?:[ds])?|ion)|ndize)|e(?:gat(?:e(?:[ds]|ly)?|i(?:ng|ons?))|ss(?:eek:rs?|i(?:ve(?:ly|ness)?|ons?)))|iev(?:e(?:[ds])?|ing)))|hast|i(?:l(?:e(?:ly)?|ity)|ng|tat(?:eek:rs?|e(?:[ds])?|i(?:ng|ons?)))|ue)|h(?:ead)?|i(?:l(?:ments?|erons?|ing)?|m(?:less(?:ly)?|e(?:d|rs?)|s|ing)?|d(?:ed?|s|ing)?|r(?:ways?|l(?:eek:cks?|ess|i(?:ne(?:[rs])?|fts?))|m(?:a(?:n|ils?)|en)|b(?:ags?|orne)|y|craft|drops?|p(?:lanes?|orts?)|e(?:d|rs?)|f(?:low|are|oils?|rames?|ields?)|s(?:p(?:ace|eed)|hips?|trips?)?|tight|i(?:ly|ngs?))?|sle)|jar)|A(?:k(?:ers|ron)|LGOL|l(?:v(?:a(?:rez)?|in)|l(?:a(?:[nh])|yn|e(?:n(?:dale|town)?|g(?:ra|hen(?:y|ies)))|state|is(?:eek:n)?)|a(?:m(?:eek:s?|eda)|bam(?:a(?:ns)?|ian)|n|ddin|r|s(?:kan?|tair))|maden|b(?:an(?:y|ia(?:ns?)?)|er(?:t(?:[ao])?|ich)|r(?:echt|ight)|uquerque)|yssa|c(?:mena|o(?:a|tt)|estis|ibiades)|d(?:e(?:baran|n)|rich)|p(?:ert|s|h(?:eek:nse|eratz))|e(?:x(?:and(?:er|r(?:a|e|i(?:a|ne)))|ei|is)?|ck?|ut(?:ian)?)|f(?:a|onso|redo?)|g(?:eek:(?:l|nqui(?:an|n))|e(?:nib|r(?:ian?)?)|iers)|s(?:atians?|op)|hambra|t(?:air|o(?:[ns])|haea)|i(?:c(?:e|ia)|s(?:eek:n|tair))?)?|ar(?:eek:n|hus)|m(?:a(?:nda|zons?|deus|rillo)|m(?:an|erman)|y|o(?:ntillado|co|s)|dahl|pex|e(?:lia|r(?:ada|ica(?:n(?:a|s|i(?:z(?:ations?|e(?:rs?|s)?)|sm))?|s)?)|s)|sterdam|h(?:aric|erst)|trak|iga)|b(?:aba|b(?:[ay]|ott)|yssinia(?:ns?)?|ner|os?|e(?:l(?:son|ian)?|r(?:nathy|deen))?|ra(?:m(?:s(?:eek:n)?)?|ham)|i(?:lene|djan|gail)|u)|n(?:kara|a(?:lects|b(?:aptists?|el)|creon|stasia|heim|tol(?:e|ian?))|n(?:a(?:list(?:ic)?|polis)?|e(?:tte)?|ie)?|d(?:alusia(?:ns?)?|y|o(?:ver|rra)|e(?:an|rs(?:eek:n|en)|s)|r(?:eek:m(?:ache|eda)|e(?:ws?|a|i)?))|g(?:l(?:eek:(?:ph(?:eek:bia|ilia))?|es|i(?:a|can(?:s|i(?:zes?|sm))?))|o(?:la|ra)|el(?:a|o|e(?:nos?|s)|i(?:n(?:[ae])|ca))|ie|us)|heuser|ita)|c(?:k(?:ley|erman)|a(?:dia|pulco)|cra|h(?:aeans?|illes)|t(?:a(?:eon)?|on|s))|d(?:kins|ler(?:ian)?|a(?:m(?:s(?:eek:n)?)?|ir)?|o(?:lph(?:us)?|nis)|d(?:ressograph|is(?:eek:n)?)|e(?:l(?:aide|e|ia)|n)|ri(?:a(?:n|tic)|enne)|irondacks?)|e(?:ne(?:as|id)|olus|robacter|gean|s(?:chylus|op))|f(?:ri(?:ka(?:ans|ners?)|ca(?:n(?:s|iz(?:ations?|e(?:[ds])?|ing))?)?)|ghan(?:s|istan)?)|g(?:way|a(?:memnon|tha)|ne(?:[ws])|ee|ricola|gies?)|hm(?:adabad|edabad)|i(?:ken|leen|nus?|d(?:a|es)|r(?:bus|e(?:dale|s))|tken)|jax)/
     
    Florian Gross, Aug 12, 2004
    #6
  7. "Florian Gross" <> schrieb im Newsbeitrag
    news:...
    > Robert Klemme wrote:
    >
    > > I once made a script that takes a list of strings and creates a
    > > single regexp from them that is supposedly matched fast (i.e. doesn't

    need
    > > backtracking). If you're interested I can check at work tomorrow -

    I'm
    > > quite sure I still have it somewhere (please email as reminder).

    >
    > Is this a trie optimization?


    Yep.

    > I've written something similar for
    > Regexp::English but the optimization itself is quite slow and takes up
    > lots of resources. (Because it needs to build the trie.)


    Well, it was ok in my case because the word list didn't change and I put
    the generated regexp into code. Although I'm not sure that it was really
    that slow. Lemmesee... IMHO Theoretically it should be around O(n*m)
    with n the number of words and m the average word length.

    > It can do this:
    >
    > irb(main):006:0> re
    > => /foobar|foobark|foobard|foobarqux|food|fool|foolish/
    > irb(main):007:0> re.optimize
    > => /foo(?:l(?:ish)?|bar(?:qux|[kd])?|d)/


    Exactly. Only that my solution took a set of words as input.

    > It was able to create an optimized Regexp that matches the first few
    > hundred English words from a words file -- I've attached it. :)


    Beautiful!

    Kind regards

    robert
     
    Robert Klemme, Aug 13, 2004
    #7
  8. Kirk Haines

    Kirk Haines Guest

    On Fri, 13 Aug 2004 17:36:11 +0900, Robert Klemme wrote

    > Yep.
    >
    > > I've written something similar for
    > > Regexp::English but the optimization itself is quite slow and takes up
    > > lots of resources. (Because it needs to build the trie.)

    >
    > Well, it was ok in my case because the word list didn't change and I
    > put the generated regexp into code. Although I'm not sure that it
    > was really that slow. Lemmesee... IMHO Theoretically it should be
    > around O(n*m) with n the number of words and m the average word length.


    I would love to see your script, Robert. I was flirting in my head with the
    thought that if one took all of the regexes and broken them into shared
    pieces and made a tree out of them, one could walk down the tree of pieces
    until one found the complete regex that made the match. Neat to see my
    brain wasn't totally off base with the idea. So I want to take the regexp
    your script builds and benchmark it against the current state of affairs
    with simple hash based fixed string matching as well as the basic match the
    fixed strings then iterate through the regexes approach and see how overall
    performance shakes out.


    Thanks!

    Kirk Haines
     
    Kirk Haines, Aug 13, 2004
    #8
  9. "Kirk Haines" <> schrieb im Newsbeitrag
    news:...
    > On Fri, 13 Aug 2004 17:36:11 +0900, Robert Klemme wrote
    >
    > > Yep.
    > >
    > > > I've written something similar for
    > > > Regexp::English but the optimization itself is quite slow and takes

    up
    > > > lots of resources. (Because it needs to build the trie.)

    > >
    > > Well, it was ok in my case because the word list didn't change and I
    > > put the generated regexp into code. Although I'm not sure that it
    > > was really that slow. Lemmesee... IMHO Theoretically it should be
    > > around O(n*m) with n the number of words and m the average word

    length.
    >
    > I would love to see your script, Robert. I was flirting in my head with

    the
    > thought that if one took all of the regexes and broken them into shared
    > pieces and made a tree out of them, one could walk down the tree of

    pieces
    > until one found the complete regex that made the match. Neat to see my
    > brain wasn't totally off base with the idea. So I want to take the

    regexp
    > your script builds and benchmark it against the current state of affairs
    > with simple hash based fixed string matching as well as the basic match

    the
    > fixed strings then iterate through the regexes approach and see how

    overall
    > performance shakes out.


    Ha,

    found it! Nothing gets lost on a good sorted and well sized hard disk.
    :))
    (see attachment, one is the script and it needs the tree implementation in
    the other file)

    Have fun!

    Kind regards

    robert
     
    Robert Klemme, Aug 13, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    Replies:
    5
    Views:
    1,316
    =?ISO-8859-1?Q?Martin_J=F8rgensen?=
    May 6, 2006
  2. Karthik Gurusamy

    Seeking ideas for a cron implementation

    Karthik Gurusamy, Aug 22, 2008, in forum: Python
    Replies:
    2
    Views:
    373
    Karthik Gurusamy
    Sep 7, 2008
  3. Bubba
    Replies:
    24
    Views:
    675
    Joe keane
    Jan 18, 2012
  4. Replies:
    94
    Views:
    1,308
    Steven D'Aprano
    Sep 4, 2012
  5. Replies:
    17
    Views:
    196
    Serhiy Storchaka
    Sep 11, 2013
Loading...

Share This Page