greedy match wanted

Discussion in 'Python' started by alexk, Mar 3, 2005.

  1. alexk

    alexk Guest

    Hi,

    I would like to request your help.

    My problem is as follows. I want to match urls, and therefore I have a
    group
    of long valid domain names in my regex:

    ..... (?:com|org|net|biz|info|ac|cc|gs|ms|
    sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
    com\.ag|ai|off\.ai|al|an|ao|aq|
    com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

    However, for a url like kuku.com.to it matches the kuku.com part,
    while I want it to match the whole kuku.com.to. Notice that both "com"
    and "com.to" are present in the group above.

    1. How do I give precedence for "com.to" over "com" in the above group
    ?
    Maybe I can somehow sort it by lexicographic order and then by length,
    or divide it to a set of sub-groups by length ?

    Thanks for any help,
    Alex.
     
    alexk, Mar 3, 2005
    #1
    1. Advertising

  2. alexk

    Kent Johnson Guest

    alexk wrote:
    > My problem is as follows. I want to match urls, and therefore I have a
    > group
    > of long valid domain names in my regex:
    >
    > .... (?:com|org|net|biz|info|ac|cc|gs|ms|
    > sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
    > com\.ag|ai|off\.ai|al|an|ao|aq|
    > com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...
    >
    > However, for a url like kuku.com.to it matches the kuku.com part,
    > while I want it to match the whole kuku.com.to. Notice that both "com"
    > and "com.to" are present in the group above.
    >
    > 1. How do I give precedence for "com.to" over "com" in the above group
    > ?
    > Maybe I can somehow sort it by lexicographic order and then by length,
    > or divide it to a set of sub-groups by length ?


    According to the docs for re:
    "As the target string is scanned, REs separated by "|" are tried from left to right. When one
    pattern completely matches, that branch is accepted. This means that once A matches, B will not be
    tested further, even if it would produce a longer overall match. In other words, the "|" operator is
    never greedy."

    So putting "com.to" before "com" does what you want.

    >>> import re
    >>> re.search(r'com|com\.to', 'kuku.com.to').group()

    'com'
    >>> re.search(r'com\.to|com', 'kuku.com.to').group()

    'com.to'

    Kent
     
    Kent Johnson, Mar 3, 2005
    #2
    1. Advertising

  3. alexk

    Paul McGuire Guest

    This is very similar to some common problems in developing pyparsing
    grammars. One of my early attempts at parsing CORBA IDL listed the
    valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
    can imagine, I never successfully matched an 'inout', since the 'in'
    match came first. (Pyparsing offers an alternative greedy alternation
    operator '^', which tests all alternatives and then selects the longest
    match - however, '^' can have performance issues.) Of course, the
    simplest recourse is just to change the definition to ('inout' | 'in' |
    'out'), but I wanted to provide a more general mechanism (since I was
    getting e-mails from pyparsing users also having this same problem).
    The simplest approach is to just sort the list by order, longest to
    shortest, but this removes any option the developer may want to try to
    front-load the list with options that are expected to occur more
    frequently (for example, if defining a list of potential relational
    operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
    ('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
    infrequently used operators).

    I ended up adding the oneOf helper to pyparsing, which takes a single
    string of space-delimited literals, and then returns a list of literals
    separated by '|' operators, with literals minimally reordered, that is,
    only reorderded if a longer literal masks some shorter literal. Also,
    accidental duplicates are stripped. Here is the implementation of oneOf
    ('|' operators generate a MatchFirst object):

    ..def oneOf( strs, caseless=False ):
    .. """Helper to quickly define a set of alternative Literals, and
    makes sure to do
    .. longest-first testing when there is a conflict, regardless of
    the input order,
    .. but returns a MatchFirst for best performance.
    .. """
    .. if caseless:
    .. isequal = ( lambda a,b: a.upper() == b.upper() )
    .. parseElementClass = CaselessLiteral
    .. else:
    .. isequal = ( lambda a,b: a == b )
    .. parseElementClass = Literal
    ..
    .. symbols = strs.split()
    .. i = 0
    .. while i < len(symbols)-1:
    .. cur = symbols
    .. for j,other in enumerate(symbols[i+1:]):
    .. if ( isequal(other, cur) ):
    .. del symbols[i+j+1]
    .. break
    .. elif ( isequal(other[:len(cur)],cur) ):
    .. del symbols[i+j+1]
    .. symbols.insert(i,other)
    .. cur = other
    .. break
    .. else:
    .. i += 1
    ..
    .. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
    ..

    You could pre-process your regex with something similar, so you can
    avoid having to do this (error-prone) checking yourself.

    -- Paul
     
    Paul McGuire, Mar 3, 2005
    #3
  4. alexk

    alexk Guest

    Thanks, I'll try your solution.
    Alex.
     
    alexk, Mar 4, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Harvey
    Replies:
    0
    Views:
    792
    Harvey
    Jul 16, 2004
  2. Harvey
    Replies:
    1
    Views:
    900
    Daniel
    Jul 16, 2004
  3. EXI-Andrews, Jack

    re.match -- not greedy?

    EXI-Andrews, Jack, Nov 19, 2006, in forum: Python
    Replies:
    6
    Views:
    495
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    170
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    175
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page