greedy match wanted

alexk · Mar 3, 2005

Hi,

I would like to request your help.

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

..... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

Thanks for any help,
Alex.

Kent Johnson · Mar 3, 2005

alexk said:
My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one
pattern completely matches, that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match. In other words, the "|" operator is
never greedy."

So putting "com.to" before "com" does what you want.
'com.to'

Kent

Paul McGuire · Mar 3, 2005

This is very similar to some common problems in developing pyparsing
grammars. One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first. (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.) Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal. Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

..def oneOf( strs, caseless=False ):
.. """Helper to quickly define a set of alternative Literals, and
makes sure to do
.. longest-first testing when there is a conflict, regardless of
the input order,
.. but returns a MatchFirst for best performance.
.. """
.. if caseless:
.. isequal = ( lambda a,b: a.upper() == b.upper() )
.. parseElementClass = CaselessLiteral
.. else:
.. isequal = ( lambda a,b: a == b )
.. parseElementClass = Literal
..
.. symbols = strs.split()
.. i = 0
.. while i < len(symbols)-1:
.. cur = symbols
.. for j,other in enumerate(symbols[i+1:]):
.. if ( isequal(other, cur) ):
.. del symbols[i+j+1]
.. break
.. elif ( isequal(other[:len(cur)],cur) ):
.. del symbols[i+j+1]
.. symbols.insert(i,other)
.. cur = other
.. break
.. else:
.. i += 1
..
.. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
..

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul

alexk · Mar 4, 2005

Thanks, I'll try your solution.
Alex.

action_page.php form	2	Oct 25, 2020
About a regular expression	5	Nov 26, 2007
Validate URL script help	3	May 31, 2004
Documentation? file/open with tea	1	Sep 19, 2004
Possible bug with Range#include?	6	Jun 9, 2007
Eclipse bug: Doesn't check for uninitialized variables when there are "too many" variable declaratio	11	Apr 18, 2006
Coding Style	60	May 12, 2004
Code Performance Issue	3	Feb 13, 2004

greedy match wanted

alexk

Kent Johnson

Paul McGuire

alexk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads