Regex - where do I make a mistake?

J

Johny

I have
string="""<span class="test456">55</span>.
<td><span class="test123">128</span>
<span class="test789">170</span>
"""

where I need to replace
<span class="test456">55</span>.
<span class="test789">170</span>

by space.
So I tried

#############
import re
string="""<td><span class="test456">55</span>.<span
class="test123">128</span><span class="test789">170</span>
"""
Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
###########

But it does NOT work.
Can anyone explain why?
Thank you
L.
 
P

Peter Otten

Johny said:
I have
string="""<span class="test456">55</span>.
<td><span class="test123">128</span>
<span class="test789">170</span>
"""

where I need to replace
<span class="test456">55</span>.
<span class="test789">170</span>

by space.
So I tried

#############
import re
string="""<td><span class="test456">55</span>.<span
class="test123">128</span><span class="test789">170</span>
"""
Newstring=re.sub(r'<span class="test(?!123)">.*</span>'," ",string)
###########

But it does NOT work.
Can anyone explain why?

"(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
is not followed by "123", but /doesn't/ consume any characters. For your
regex to match "test" must be /immediately/ followed by a '"'.

Regular expressions are too lowlevel to use on HTML directly. Go with
BeautifulSoup instead of trying to fix the above.

Peter
 
J

Johny

"(?!123)" is a negative "lookahead assertion", i. e. it ensures that "test"
is not followed by "123", but /doesn't/ consume any characters. For your
regex to match "test" must be /immediately/ followed by a '"'.

Regular expressions are too lowlevel to use on HTML directly. Go with
BeautifulSoup instead of trying to fix the above.

Peter- Hide quoted text -

- Show quoted text -

Yes, I know "(?!123)" is a negative "lookahead assertion",
but do not know excatly why it does not work.I thought that

(?!...)
Matches if ... doesn't match next. For example, Isaac (?!Asimov) will
match 'Isaac ' only if it's not followed by 'Asimov'.
 
P

Peter Otten

Johny said:
Yes, I know "(?!123)" is a negative "lookahead assertion",
but do not know excatly why it does not work.I thought that

(?!...)
Matches if ... doesn't match next. For example, Isaac (?!Asimov) will
match 'Isaac ' only if it's not followed by 'Asimov'.

The problem is that your regex does not end with the lookahead assertion and
there is nothing to consume the '456' or '789'. To illustrate:
for example in ["before123after", "before234after", "beforeafter"]:
.... re.findall("before(?!123)after", example)
....
[]
[]
['beforeafter']
for example in ["before123after", "before234after", "beforeafter"]:
.... re.findall(r"before(?!123)\d\d\dafter", example)
....
[]
['before234after']
[]

Peter
 
C

Carsten Haese

Yes, I know "(?!123)" is a negative "lookahead assertion",
but do not know excatly why it does not work.

It *does* work, it just doesn't do what you think it does.

The lookahead assertion is a zero-width match that doesn't match any
actual characters from the subject. It matches an imaginary vertical
line between two consecutive characters of the subject.

Nothing in your pattern matches the string of digits that follows
"test", hence the subject fails to match the pattern.

Also, please note Peter's advice that Regular Expressions are almost
always the wrong tool for working with HTML. It may work in very limited
cases, and maybe you have such a limited case, but you'd better make
sure that you'll never ever have to handle anything beyond this limited
case.

-Carsten
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top