greedy match wanted

A

alexk

Hi,

I would like to request your help.

My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

..... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

Thanks for any help,
Alex.
 
K

Kent Johnson

alexk said:
My problem is as follows. I want to match urls, and therefore I have a
group
of long valid domain names in my regex:

.... (?:com|org|net|biz|info|ac|cc|gs|ms|
sh|st|tc|tf|tj|to|vg|ad|ae|af|ag|
com\.ag|ai|off\.ai|al|an|ao|aq|
com\.ar|net\.ar|org\.ar|as|at|co\.at| ... ) ...

However, for a url like kuku.com.to it matches the kuku.com part,
while I want it to match the whole kuku.com.to. Notice that both "com"
and "com.to" are present in the group above.

1. How do I give precedence for "com.to" over "com" in the above group
?
Maybe I can somehow sort it by lexicographic order and then by length,
or divide it to a set of sub-groups by length ?

According to the docs for re:
"As the target string is scanned, REs separated by "|" are tried from left to right. When one
pattern completely matches, that branch is accepted. This means that once A matches, B will not be
tested further, even if it would produce a longer overall match. In other words, the "|" operator is
never greedy."

So putting "com.to" before "com" does what you want.
'com.to'

Kent
 
P

Paul McGuire

This is very similar to some common problems in developing pyparsing
grammars. One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout'). As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first. (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.) Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal. Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

..def oneOf( strs, caseless=False ):
.. """Helper to quickly define a set of alternative Literals, and
makes sure to do
.. longest-first testing when there is a conflict, regardless of
the input order,
.. but returns a MatchFirst for best performance.
.. """
.. if caseless:
.. isequal = ( lambda a,b: a.upper() == b.upper() )
.. parseElementClass = CaselessLiteral
.. else:
.. isequal = ( lambda a,b: a == b )
.. parseElementClass = Literal
..
.. symbols = strs.split()
.. i = 0
.. while i < len(symbols)-1:
.. cur = symbols
.. for j,other in enumerate(symbols[i+1:]):
.. if ( isequal(other, cur) ):
.. del symbols[i+j+1]
.. break
.. elif ( isequal(other[:len(cur)],cur) ):
.. del symbols[i+j+1]
.. symbols.insert(i,other)
.. cur = other
.. break
.. else:
.. i += 1
..
.. return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
..

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top