Regular Expression: Matching substring

Discussion in 'Python' started by Kevin CH, Apr 13, 2006.

  1. Kevin CH

    Kevin CH Guest

    Hi,

    I'm currently running into a confusion on regex and hopefully you guys
    can clear it up for me.

    Suppose I have a regular expression (0|(1(01*0)*1))* and two test
    strings: 110_1011101_ and _101101_1. (The underscores are not part of
    the string. They are added to show that both string has a substring
    that matches the pattern.) Applying a match() function on the first
    string returns true while false for the second. The difference is the
    first one has unmatched chunk in the beginning while the second at the
    end. How's the regex rule work here?

    Thanks.
    Kevin CH, Apr 13, 2006
    #1
    1. Advertising

  2. Kevin CH

    Leon Guest

    Re: Regular Expression: Matching substring

    Hi Kevin,

    You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
    most
    three characters of a string must not be ``101" while not followed by
    an `0'.
    After reading the first `1', automata expects `1' or ``00" or ``010"
    or ``11",
    right?:)


    Kevin CH 寫é“:

    > Hi,
    >
    > I'm currently running into a confusion on regex and hopefully you guys
    > can clear it up for me.
    >
    > Suppose I have a regular expression (0|(1(01*0)*1))* and two test
    > strings: 110_1011101_ and _101101_1. (The underscores are not part of
    > the string. They are added to show that both string has a substring
    > that matches the pattern.) Applying a match() function on the first
    > string returns true while false for the second. The difference is the
    > first one has unmatched chunk in the beginning while the second at the
    > end. How's the regex rule work here?
    >
    > Thanks.
    Leon, Apr 13, 2006
    #2
    1. Advertising

  3. Kevin CH

    Kevin CH Guest

    Re: Regular Expression: Matching substring

    Leon wrote:
    > Hi Kevin,
    >
    > You may notice that, for matching the regex (0|(1(01*0)*1))*, the left
    > most
    > three characters of a string must not be ``101" while not followed by
    > an `0'.
    > After reading the first `1', automata expects `1' or ``00" or ``010"
    > or ``11",
    > right?:)

    Why it must expect "010"? Why not say "0110", since 1* can represent 0
    or more repetitions.


    >
    >
    > Kevin CH 寫é“:
    >
    > > Hi,
    > >
    > > I'm currently running into a confusion on regex and hopefully you guys
    > > can clear it up for me.
    > >
    > > Suppose I have a regular expression (0|(1(01*0)*1))* and two test
    > > strings: 110_1011101_ and _101101_1. (The underscores are not part of
    > > the string. They are added to show that both string has a substring
    > > that matches the pattern.) Applying a match() function on the first
    > > string returns true while false for the second. The difference is the
    > > first one has unmatched chunk in the beginning while the second at the
    > > end. How's the regex rule work here?
    > >
    > > Thanks.
    Kevin CH, Apr 13, 2006
    #3
  4. Kevin CH

    Leon Guest

    Re: Regular Expression: Matching substring

    You are right. In fact the procedure is as follows:
    The substr ``101101" is no problem, if stop here, match will
    successful.
    But the tailing `1' occurs, so we may imagine the working automata move
    to a state, which according to the regexp's outer most `)', and ready
    to repeat
    the whole regexp again. In this case, the answer is ``yes" only when
    there exists
    at least two ``1", but only one here.

    BTW, the first string is matched exactly, according to your notion, it
    should be written as: _11_0_1011101
    Leon, Apr 13, 2006
    #4
  5. Kevin CH

    Kevin CH Guest

    Re: Regular Expression: Matching substring

    Oh yea, I see what's my confusion now. In the first string, I didn't
    consider 11 and 0 matches the pattern without the repetition. Actually
    what I did is I entered the pattern and the test strings into kudos (a
    python regexp debugger) and got the match groups, which didn't have 11
    and 0 as matches, although they do match the pattern "(0|1(01*0)*1)".

    Thank you for the help.


    Leon wrote:
    > You are right. In fact the procedure is as follows:
    > The substr ``101101" is no problem, if stop here, match will
    > successful.
    > But the tailing `1' occurs, so we may imagine the working automata move
    > to a state, which according to the regexp's outer most `)', and ready
    > to repeat
    > the whole regexp again. In this case, the answer is ``yes" only when
    > there exists
    > at least two ``1", but only one here.
    >
    > BTW, the first string is matched exactly, according to your notion, it
    > should be written as: _11_0_1011101
    Kevin CH, Apr 13, 2006
    #5
  6. Kevin CH

    John Machin Guest

    On 13/04/2006 12:33 PM, Kevin CH wrote:
    > Hi,
    >
    > I'm currently running into a confusion on regex and hopefully you guys
    > can clear it up for me.
    >
    > Suppose I have a regular expression (0|(1(01*0)*1))* and two test
    > strings: 110_1011101_ and _101101_1. (The underscores are not part of
    > the string. They are added to show that both string has a substring
    > that matches the pattern.) Applying a match() function on the first
    > string returns true while false for the second.


    Perhaps you are using grep, or you have stumbled on the old deprecated
    "regex" module and are using that instead of the "re" module. Perhaps
    not as you are using only 2 plain vanilla RE operations which should
    work the same way everywhere. Perhaps you are having trouble with
    search() versus match() -- if so, read the section on this topic in the
    re docs. It's rather hard to tell what you are doing without seeing the
    code you are using.

    > The difference is the
    > first one has unmatched chunk in the beginning


    With re's match(), the whole string matches.

    > while the second at the
    > end.


    With re's match(), the part you marked with underscores (at the
    *beginning*) matches.


    > How's the regex rule work here?


    Let's abbreviate your pattern as (0|X)*
    This means 0 or more occurrences of strings that match either 0 or X.

    Case 1 gives us 11 matching X [it's a 1 followed by zero occurrences of
    (01*0) followed by a 1], then a 0, then 1011101 matching X [it's a 1
    foll. by 1 occ. of (01110) followed by a 1].

    Case 2 gives us 101101 matching X [it's a 1 foll. by 1 occ of (0110)
    foll by a 1] -- then there's a 1 that doesn't match anything.

    Here's some code and its output:

    C:\junk>type kevinch.py
    import re

    rx = re.compile(r"(0|(1(01*0)*1))*")

    def doit(n, s):
    print "Case", n
    m = rx.match(s)
    if m:
    print "0123456789"
    print s
    for k in range(4):
    print "span(%d) -> %r" % (k, m.span(k))
    else:
    print "... no match"

    s1 = "110_1011101_".replace('_', '')
    s2 = "_101101_1".replace('_', '')
    doit(1, s1)
    doit(2, s2)

    C:\junk>kevinch.py
    Case 1
    0123456789
    1101011101
    span(0) -> (0, 10)
    span(1) -> (3, 10)
    span(2) -> (3, 10)
    span(3) -> (4, 9)
    Case 2
    0123456789
    1011011
    span(0) -> (0, 6)
    span(1) -> (0, 6)
    span(2) -> (0, 6)
    span(3) -> (1, 5)

    HTH,
    John
    John Machin, Apr 13, 2006
    #6
  7. Kevin CH

    Kevin CH Guest

    Re: Regular Expression: Matching substring

    Thank you for your reply.

    > Perhaps you are using grep, or you have stumbled on the old deprecated
    > "regex" module and are using that instead of the "re" module. Perhaps
    > not as you are using only 2 plain vanilla RE operations which should
    > work the same way everywhere. Perhaps you are having trouble with
    > search() versus match() -- if so, read the section on this topic in the
    > re docs. It's rather hard to tell what you are doing without seeing the
    > code you are using.


    Sorry I should have said it up front. I'm using Kudos (which I'm sure
    uses re module) to test these strings on the pattern, and had the match
    results as I stated. (search() of course gives me true since the
    pattern appears in the substrings of both strings.)
    Kevin CH, Apr 13, 2006
    #7
  8. Re: Regular Expression: Matching substring

    "Kevin CH" wrote:

    news:...
    > Thank you for your reply.
    >
    > > Perhaps you are using grep, or you have stumbled on the old deprecated
    > > "regex" module and are using that instead of the "re" module. Perhaps
    > > not as you are using only 2 plain vanilla RE operations which should
    > > work the same way everywhere. Perhaps you are having trouble with
    > > search() versus match() -- if so, read the section on this topic in the
    > > re docs. It's rather hard to tell what you are doing without seeing the
    > > code you are using.

    >
    > Sorry I should have said it up front. I'm using Kudos (which I'm sure
    > uses re module) to test these strings on the pattern, and had the match
    > results as I stated. (search() of course gives me true since the
    > pattern appears in the substrings of both strings.)


    Python's "match" function doesn't return "true" or "false"; it returns a match
    object if the string matches the pattern, and None if not. since your pattern
    can match the empty string, it'll match any target string (all strings start with
    an empty string), and will never return false.

    looks like the "debugger" does a great job of hiding how things really work...

    </F>
    Fredrik Lundh, Apr 13, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    8
    Views:
    13,822
    John C. Bollinger
    Nov 10, 2005
  2. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    831
    Alan Moore
    Dec 2, 2005
  3. Replies:
    4
    Views:
    390
  4. Replies:
    6
    Views:
    760
    Bengt Richter
    Nov 8, 2005
  5. Replies:
    3
    Views:
    189
    Sherm Pendley
    Aug 3, 2005
Loading...

Share This Page