Regular Expression: Matching substring

K

Kevin CH

Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second. The difference is the
first one has unmatched chunk in the beginning while the second at the
end. How's the regex rule work here?

Thanks.
 
L

Leon

Hi Kevin,

You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
most
three characters of a string must not be ``101" while not followed by
an `0'.
After reading the first `1', automata expects `1' or ``00" or ``010"
or ``11",
right?:)


Kevin CH 寫é“:
 
K

Kevin CH

Leon said:
Hi Kevin,

You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
most
three characters of a string must not be ``101" while not followed by
an `0'.
After reading the first `1', automata expects `1' or ``00" or ``010"
or ``11",
right?:)
Why it must expect "010"? Why not say "0110", since 1* can represent 0
or more repetitions.
 
L

Leon

You are right. In fact the procedure is as follows:
The substr ``101101" is no problem, if stop here, match will
successful.
But the tailing `1' occurs, so we may imagine the working automata move
to a state, which according to the regexp's outer most `)', and ready
to repeat
the whole regexp again. In this case, the answer is ``yes" only when
there exists
at least two ``1", but only one here.

BTW, the first string is matched exactly, according to your notion, it
should be written as: _11_0_1011101
 
K

Kevin CH

Oh yea, I see what's my confusion now. In the first string, I didn't
consider 11 and 0 matches the pattern without the repetition. Actually
what I did is I entered the pattern and the test strings into kudos (a
python regexp debugger) and got the match groups, which didn't have 11
and 0 as matches, although they do match the pattern "(0|1(01*0)*1)".

Thank you for the help.
 
J

John Machin

Hi,

I'm currently running into a confusion on regex and hopefully you guys
can clear it up for me.

Suppose I have a regular expression (0|(1(01*0)*1))* and two test
strings: 110_1011101_ and _101101_1. (The underscores are not part of
the string. They are added to show that both string has a substring
that matches the pattern.) Applying a match() function on the first
string returns true while false for the second.

Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.
The difference is the
first one has unmatched chunk in the beginning

With re's match(), the whole string matches.
while the second at the
end.

With re's match(), the part you marked with underscores (at the
*beginning*) matches.

How's the regex rule work here?

Let's abbreviate your pattern as (0|X)*
This means 0 or more occurrences of strings that match either 0 or X.

Case 1 gives us 11 matching X [it's a 1 followed by zero occurrences of
(01*0) followed by a 1], then a 0, then 1011101 matching X [it's a 1
foll. by 1 occ. of (01110) followed by a 1].

Case 2 gives us 101101 matching X [it's a 1 foll. by 1 occ of (0110)
foll by a 1] -- then there's a 1 that doesn't match anything.

Here's some code and its output:

C:\junk>type kevinch.py
import re

rx = re.compile(r"(0|(1(01*0)*1))*")

def doit(n, s):
print "Case", n
m = rx.match(s)
if m:
print "0123456789"
print s
for k in range(4):
print "span(%d) -> %r" % (k, m.span(k))
else:
print "... no match"

s1 = "110_1011101_".replace('_', '')
s2 = "_101101_1".replace('_', '')
doit(1, s1)
doit(2, s2)

C:\junk>kevinch.py
Case 1
0123456789
1101011101
span(0) -> (0, 10)
span(1) -> (3, 10)
span(2) -> (3, 10)
span(3) -> (4, 9)
Case 2
0123456789
1011011
span(0) -> (0, 6)
span(1) -> (0, 6)
span(2) -> (0, 6)
span(3) -> (1, 5)

HTH,
John
 
K

Kevin CH

Thank you for your reply.
Perhaps you are using grep, or you have stumbled on the old deprecated
"regex" module and are using that instead of the "re" module. Perhaps
not as you are using only 2 plain vanilla RE operations which should
work the same way everywhere. Perhaps you are having trouble with
search() versus match() -- if so, read the section on this topic in the
re docs. It's rather hard to tell what you are doing without seeing the
code you are using.

Sorry I should have said it up front. I'm using Kudos (which I'm sure
uses re module) to test these strings on the pattern, and had the match
results as I stated. (search() of course gives me true since the
pattern appears in the substrings of both strings.)
 
F

Fredrik Lundh

:

Thank you for your reply.


Sorry I should have said it up front. I'm using Kudos (which I'm sure
uses re module) to test these strings on the pattern, and had the match
results as I stated. (search() of course gives me true since the
pattern appears in the substrings of both strings.)

Python's "match" function doesn't return "true" or "false"; it returns a match
object if the string matches the pattern, and None if not. since your pattern
can match the empty string, it'll match any target string (all strings start with
an empty string), and will never return false.

looks like the "debugger" does a great job of hiding how things really work...

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top