Regex - where do I make a mistake?

Discussion in 'Python' started by Johny, Feb 16, 2007.

  1. Johny

    Johny Guest

    I have
    string="""<span class="test456">55</span>.
    <td><span class="test123">128</span>
    <span class="test789">170</span>
    """

    where I need to replace
    <span class="test456">55</span>.
    <span class="test789">170</span>

    by space.
    So I tried

    #############
    import re
    string="""<td><span class="test456">55</span>.<span
    class="test123">128</span><span class="test789">170</span>
    """
    Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
    ###########

    But it does NOT work.
    Can anyone explain why?
    Thank you
    L.
    Johny, Feb 16, 2007
    #1
    1. Advertising

  2. Johny

    Peter Otten Guest

    Johny wrote:

    > I have
    > string="""<span class="test456">55</span>.
    > <td><span class="test123">128</span>
    > <span class="test789">170</span>
    > """
    >
    > where I need to replace
    > <span class="test456">55</span>.
    > <span class="test789">170</span>
    >
    > by space.
    > So I tried
    >
    > #############
    > import re
    > string="""<td><span class="test456">55</span>.<span
    > class="test123">128</span><span class="test789">170</span>
    > """
    > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
    > ###########
    >
    > But it does NOT work.
    > Can anyone explain why?


    "(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
    is not followed by "123", but /doesn't/ consume any characters. For your
    regex to match "test" must be /immediately/ followed by a '"'.

    Regular expressions are too lowlevel to use on HTML directly. Go with
    BeautifulSoup instead of trying to fix the above.

    Peter
    Peter Otten, Feb 16, 2007
    #2
    1. Advertising

  3. Johny

    Johny Guest

    On Feb 16, 2:14 pm, Peter Otten <> wrote:
    > Johny wrote:
    > > I have
    > > string="""<span class="test456">55</span>.
    > > <td><span class="test123">128</span>
    > > <span class="test789">170</span>
    > > """

    >
    > > where I need to replace
    > > <span class="test456">55</span>.
    > > <span class="test789">170</span>

    >
    > > by space.
    > > So I tried

    >
    > > #############
    > > import re
    > > string="""<td><span class="test456">55</span>.<span
    > > class="test123">128</span><span class="test789">170</span>
    > > """
    > > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
    > > ###########

    >
    > > But it does NOT work.
    > > Can anyone explain why?

    >
    > "(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
    > is not followed by "123", but /doesn't/ consume any characters. For your
    > regex to match "test" must be /immediately/ followed by a '"'.
    >
    > Regular expressions are too lowlevel to use on HTML directly. Go with
    > BeautifulSoup instead of trying to fix the above.
    >
    > Peter- Hide quoted text -
    >
    > - Show quoted text -


    Yes, I know "(?!123)" is a negative "lookahead assertion",
    but do not know excatly why it does not work.I thought that

    (?!...)
    Matches if ... doesn't match next. For example, Isaac (?!Asimov) will
    match 'Isaac ' only if it's not followed by 'Asimov'.
    Johny, Feb 16, 2007
    #3
  4. Johny

    Peter Otten Guest

    Johny wrote:

    > On Feb 16, 2:14 pm, Peter Otten <> wrote:
    >> Johny wrote:
    >> > I have
    >> > string="""<span class="test456">55</span>.
    >> > <td><span class="test123">128</span>
    >> > <span class="test789">170</span>
    >> > """

    >>
    >> > where I need to replace
    >> > <span class="test456">55</span>.
    >> > <span class="test789">170</span>

    >>
    >> > by space.
    >> > So I tried

    >>
    >> > #############
    >> > import re
    >> > string="""<td><span class="test456">55</span>.<span
    >> > class="test123">128</span><span class="test789">170</span>
    >> > """
    >> > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
    >> > ###########

    >>
    >> > But it does NOT work.
    >> > Can anyone explain why?

    >>
    >> "(?!123)" is a negative "lookahead assertion", i. e. it ensures that
    >> "test" is not followed by "123", but /doesn't/ consume any characters.
    >> For your regex to match "test" must be /immediately/ followed by a '"'.
    >>
    >> Regular expressions are too lowlevel to use on HTML directly. Go with
    >> BeautifulSoup instead of trying to fix the above.
    >>
    >> Peter- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > Yes, I know "(?!123)" is a negative "lookahead assertion",
    > but do not know excatly why it does not work.I thought that
    >
    > (?!...)
    > Matches if ... doesn't match next. For example, Isaac (?!Asimov) will
    > match 'Isaac ' only if it's not followed by 'Asimov'.


    The problem is that your regex does not end with the lookahead assertion and
    there is nothing to consume the '456' or '789'. To illustrate:

    >>> for example in ["before123after", "before234after", "beforeafter"]:

    .... re.findall("before(?!123)after", example)
    ....
    []
    []
    ['beforeafter']
    >>> for example in ["before123after", "before234after", "beforeafter"]:

    .... re.findall(r"before(?!123)\d\d\dafter", example)
    ....
    []
    ['before234after']
    []

    Peter
    Peter Otten, Feb 16, 2007
    #4
  5. On Fri, 2007-02-16 at 05:34 -0800, Johny wrote:
    > On Feb 16, 2:14 pm, Peter Otten <> wrote:
    > > Johny wrote:
    > > > I have
    > > > string="""<span class="test456">55</span>.
    > > > <td><span class="test123">128</span>
    > > > <span class="test789">170</span>
    > > > """

    > >
    > > > where I need to replace
    > > > <span class="test456">55</span>.
    > > > <span class="test789">170</span>

    > >
    > > > by space.
    > > > So I tried

    > >
    > > > #############
    > > > import re
    > > > string="""<td><span class="test456">55</span>.<span
    > > > class="test123">128</span><span class="test789">170</span>
    > > > """
    > > > Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
    > > > ###########

    > >
    > > > But it does NOT work.
    > > > Can anyone explain why?

    > >
    > > "(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
    > > is not followed by "123", but /doesn't/ consume any characters. For your
    > > regex to match "test" must be /immediately/ followed by a '"'.
    > >
    > > Regular expressions are too lowlevel to use on HTML directly. Go with
    > > BeautifulSoup instead of trying to fix the above.
    > >

    > Yes, I know "(?!123)" is a negative "lookahead assertion",
    > but do not know excatly why it does not work.


    It *does* work, it just doesn't do what you think it does.

    The lookahead assertion is a zero-width match that doesn't match any
    actual characters from the subject. It matches an imaginary vertical
    line between two consecutive characters of the subject.

    Nothing in your pattern matches the string of digits that follows
    "test", hence the subject fails to match the pattern.

    Also, please note Peter's advice that Regular Expressions are almost
    always the wrong tool for working with HTML. It may work in very limited
    cases, and maybe you have such a limited case, but you'd better make
    sure that you'll never ever have to handle anything beyond this limited
    case.

    -Carsten
    Carsten Haese, Feb 16, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. \(beta-\) Frank Nitzsche

    where is the mistake?

    \(beta-\) Frank Nitzsche, Jun 25, 2004, in forum: VHDL
    Replies:
    4
    Views:
    526
  2. Lad
    Replies:
    2
    Views:
    227
    Tim Chase
    Oct 19, 2006
  3. cong
    Replies:
    2
    Views:
    367
  4. Replies:
    4
    Views:
    505
  5. Replies:
    3
    Views:
    716
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page