regex (?!..) problem

  • Thread starter Wolfgang Rohdewald
  • Start date
W

Wolfgang Rohdewald

Hi,

I want to match a string only if a word (C1 in this example) appears
at most once in it. This is what I tried:
('C1',)

but this should not have matched. Why is the .*? behaving greedy
if followed by (?!.*C1)? I would have expected that re first
evaluates (.*?C1) before proceeding at all.

I also tried:
C1C2C3C4').groups()
('C1b1b1b1 b3b3b3b3 C1',)

with the same problem.

How could this be done?
 
C

Carl Banks

Hi,

I want to match a string only if a word (C1 in this example) appears
at most once in it. This is what I tried:


('C1b1b1b1 b3b3b3b3 C1', '')>>> re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 C1C2C3').groups()

('C1',)

but this should not have matched. Why is the .*? behaving greedy
if followed by (?!.*C1)?

It's not.
I would have expected that re first
evaluates (.*?C1) before proceeding at all.

It does.

What you're not realizing is that if a regexp search comes to a dead
end, it won't simply return "no match". Instead it'll throw away part
of the match, and backtrack to a previously-matched variable-length
subexpression, such as ".*?", and try again with a different length.

That's what happened above. At first the group "(.*?C1)" non-greedily
matched the substring "C1", but it couldn't find a match under those
circumstances, so it backtracked to the ".*?". and looked a longer
match, which it found.

Here's something to keep in mind: except for a few corner cases,
greedy versus non-greedy will not affect the substring matched, it'll
only affect the groups.

I also tried:


C1C2C3C4').groups()
('C1b1b1b1 b3b3b3b3 C1',)

with the same problem.

How could this be done?

Can't be done with regexps.

How you would do this kind of depends on your overall goals, but your
first look should be toward the string methods. If you share details
with us we can help you choose a better strategy.


Carl Banks
 
W

Wolfgang Rohdewald

What you're not realizing is that if a regexp search comes to a
dead end, it won't simply return "no match". Instead it'll throw
away part of the match, and backtrack to a previously-matched
variable-length subexpression, such as ".*?", and try again with a
different length.

well, that explains it. This is contrary to what the documentation
says, though. Should I fill a bug report?
http://docs.python.org/library/re.html

Now back to my original problem: Would you have any idea how
to solve it?

count() is no solution in my case, I need re.search to either
return None or a match.
 
S

Stefan Behnel

Wolfgang said:
I want to match a string only if a word (C1 in this example) appears
at most once in it.

def match(s):
if s.count("C1") > 1:
return None
return s

If this doesn't fit your requirements, you may want to provide some more
details.

Stefan
 
W

Wolfgang Rohdewald

def match(s):
if s.count("C1") > 1:
return None
return s

If this doesn't fit your requirements, you may want to provide some
more details.

Well - the details are simple and already given: I need re.search
to either return None or a match. But I will try to state it
differently:

I have a string representing the results for a player of a board
game (Mah Jongg - not the solitaire but the real one, played by
4 players), and I have a list of scoring rules. Those rules
can be modified by the user, he can also add new rules. Mah Jongg
is played with very different rulesets worldwide.

The rules are written as regular expressions. Since what they
do varies greatly I do not want do treat some of them in a special
way. That would theoretically be possible but it would really
complificate things.

For each rule I simply need to check whether it applies or not.
I do that by calling re.search(rule, gamestring) and by checking
the result against None.

Here you can look at all rules I currently have.
http://websvn.kde.org/trunk/playground/games/kmj/src/predefined.py?view=markup
The rule I want to rewrite is called "Robbing the Kong". Of
course it is more complicated than my example with C1.

Here you can find the documentation for the gamestring:
http://websvn.kde.org/trunk/playground/games/doc/kmj/index.docbook?revision=1030476&view=markup
(get HTML files with "meinproc index.docbook")
 
C

Carl Banks

well, that explains it. This is contrary to what the documentation
says, though. Should I fill a bug report?http://docs.python.org/library/re.html

If you're referring to the section where it explains greedy
qualifiers, it is not wrong per se. re.match does exactly what the
documentation says: it matches as few characters as possible to the
non-greedy pattern.

However, since it's easy to misconstrue that if you don't know about
regexp backtracking, perhaps a little mention of backtracking is is
warranted. IMO it's not a documentation bug, so if you want to file a
bug report I'd recommend filing as a wishlist item.

I will mention that my followup contained an error (which you didn't
quote). I said greedy versus non-greedy doesn't affect the substring
matched. That was wrong, it does affect the substring matched; what
it doesn't affect is whether there is a match found.

Now back to my original problem: Would you have any idea how
to solve it?

count() is no solution in my case, I need re.search to either
return None or a match.

Why do you have to use a regexp at all?

In Python we recommend using string operations and methods whenever
reasonable, and avoiding regexps unless you specifically need their
extra power. String operations can easily do the examples you posted,
so I see no reason to use regexps.

Depending on what you want to do with the result, one of the following
functions should be close to what you need. (I am using "word" to
refer to the string being matched against, "token" to be the thing you
don't want to appear more than once.)


def token_appears_once(word,token):
return word.count(token) == 1

def parts(word,token):
head,sep,tail = word.partition("C1")
if sep == "" or "C1" in tail:
return None
return (head,sep,tail)


If you really need a match object, you should do a search, and then
call the .count method on the matched substring to see if there is
more than one occurrence, like this:

def match_only_if_token_appears_once(pattern,wotd,token):
m = re.search(pattern,word)
if m.group(0).count("C1") != 1:
m = None
return m


Carl Banks
 
W

Wolfgang Rohdewald

Why do you have to use a regexp at all?

not one but many with arbitrary content.

please read my answer to Stefan. Take a look
at the regexes I am using:
http://websvn.kde.org/trunk/playground/games/kmj/src/predefined.py?view=markup

moreover they are saved in a data base. I would
not want to do that with python code.

Actually I already did that - I defined Python classes
which made it possible to write a rule as (example)

self.mjRules.append(Rule('Fourfold Plenty', 'PKong()*4 + Pair()',
limits=1))

see
http://websvn.kde.org/trunk/playground/games/kmj/src/rulesets.py?view=markup&pathrev=998607

this was then executed with eval. But eval seems unsafe
and some rules were simply not writable that way without
specialized hard wired python code. That is not what I
envision. Also regex turned out to have about the same
speed IIRC
 
H

Hans Mulder

Stefan said:
def match(s):
if s.count("C1") > 1:
return None
return s

If this doesn't fit your requirements, you may want to provide some more
details.

That will return a false value if s is the empty string.

How about:

def match(s):
return s.count("C1") <= 1

-- HansM
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,197
Latest member
Sean29G025

Latest Threads

Top