Regexp

G

gervaz

Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
b>)?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

Thanks
 
M

MRAB

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
b>)?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'
If the name is followed by "<" then just match the name with [^<]+:

href="(?P said:
> b>)?</a>

I've also changed mysite.com to mysite\.com because . will match any
character, but what you probably want to match is ".".
 
D

Diez B. Roggisch

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
b>)?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents.

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
if tag["href"].startswith("http://mysite.com"):
res.append(tag["href"])


Not so hard, and *much* more robust.

Diez
 
P

Peter Otten

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
b>)?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

Have considered BeautifulSoup?

from BeautifulSoup import BeautifulSoup
from urlparse import urlparse

for a in BeautifulSoup(page)("a"):
try:
href = a["href"]
except KeyError:
pass
else:
url = urlparse(href)
if url.hostname == "mysite.com":
print href

Peter
 
A

Ant

A 0-width positive lookahead is probably what you want here:
.... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
a>
....
.... """
p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
m = re.search(p, s)
m.group(1) 'http://mysite.com/blah.html'
m.group(2)
'Test <i>String</i> OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...
 
G

gervaz

A 0-width positive lookahead is probably what you want here:

... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
a>
...
... """>>> p = r'href="(http://mysite.com/[^ said:
+)">(.*)(?="]
m = re.search(p, s)
m.group(1)

'http://mysite.com/blah.html'>>> m.group(2)

'Test <i>String</i> OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...

Ok, thank you all, I'll take a look at beautiful soup, albeit the
lookahead solution fits better for the little I have to do.
 
D

Diez B. Roggisch

gervaz said:
A 0-width positive lookahead is probably what you want here:

... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</
a>
...
... """>>> p = r'href="(http://mysite.com/[^ said:
+)">(.*)(?="]
m = re.search(p, s)
m.group(1)

'http://mysite.com/blah.html'>>> m.group(2)

'Test <i>String</i> OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...

Ok, thank you all, I'll take a look at beautiful soup, albeit the
lookahead solution fits better for the little I have to do.

Little things tend to get out of hand quickly... This is the reason why so
many gave you the hint.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,151
Latest member
JaclynMarl
Top