Regexp

gervaz · Jan 19, 2009

Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">()?(?P<name>[^</a>]+)()?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

Thanks

MRAB · Jan 19, 2009

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">()?(?P<name>[^</a>]+)()?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

If the name is followed by "<" then just match the name with [^<]+:

href="(?P said:
> b>)?</a>

I've also changed mysite.com to mysite\.com because . will match any
character, but what you probably want to match is ".".

Diez B. Roggisch · Jan 19, 2009

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">()?(?P<name>[^</a>]+)()?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents.

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
if tag["href"].startswith("http://mysite.com"):
res.append(tag["href"])

Not so hard, and *much* more robust.

Diez

Peter Otten · Jan 19, 2009

gervaz said:
Hi all, I need to find all the address in a html source page, I'm
using:
'href="(?P<url>http://mysite.com/[^"]+)">()?(?P<name>[^</a>]+)()?</a>'
but the [^</a>]+ pattern retrieve all the strings not containing <
or / or a etc, although I just not want the word "</a>". How can I
specify: 'do not search the string "blabla"?'

Have considered BeautifulSoup?

from BeautifulSoup import BeautifulSoup
from urlparse import urlparse

for a in BeautifulSoup(page)("a"):
try:
href = a["href"]
except KeyError:
pass
else:
url = urlparse(href)
if url.hostname == "mysite.com":
print href

Peter

Ant · Jan 19, 2009

A 0-width positive lookahead is probably what you want here:
.... hdhd <a href="http://mysite.com/blah.html">Test String OK</
a>
....
.... """

p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)'
m = re.search(p, s)
m.group(1) 'http://mysite.com/blah.html'
m.group(2)

Click to expand...

Click to expand...

'Test String OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...

gervaz · Jan 19, 2009

... """>>> p = r'href="(http://mysite.com/[^ said:
A 0-width positive lookahead is probably what you want here:

... hdhd <a href="http://mysite.com/blah.html">Test String OK</
a>
...

... """>>> p = r'href="(http://mysite.com/[^ said:

+)">(.*)(?="]

m = re.search(p, s)
m.group(1)

Click to expand...

Click to expand...

'http://mysite.com/blah.html'>>> m.group(2)

'Test String OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...

Ok, thank you all, I'll take a look at beautiful soup, albeit the
lookahead solution fits better for the little I have to do.

Diez B. Roggisch · Jan 19, 2009

gervaz said:
A 0-width positive lookahead is probably what you want here:

s = """

Click to expand...

... hdhd <a href="http://mysite.com/blah.html">Test String OK</
a>
...

... """>>> p = r'href="(http://mysite.com/[^ said:

+)">(.*)(?="]
m = re.search(p, s)
m.group(1)

Click to expand...

'http://mysite.com/blah.html'>>> m.group(2)

'Test String OK'

The (?=...) bit is the lookahead, and won't consume any of the string
you are searching. I've binned the named groups for clarity.

The beautiful soup answers are a better bet though - they've already
done the hard work, and after all, you are trying to roll your own
partial HTML parser here, which will struggle with badly formed html...

Click to expand...

Ok, thank you all, I'll take a look at beautiful soup, albeit the
lookahead solution fits better for the little I have to do.

Little things tend to get out of hand quickly... This is the reason why so
many gave you the hint.

Diez

I dont get this. Please help me!!	2	Jan 24, 2023
How to store data from a sign up form on a website into an sql databse	1	Sep 9, 2022
RegExp - Match specific words, but not if they're inside parenthesis (with or without other words within)	6	Jan 29, 2023
Only one table shows up with the information	2	Mar 29, 2023
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
I keep getting this error when im trying to show category name.	0	Dec 26, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Song requests	4	Aug 16, 2023

Regexp

gervaz

MRAB

Diez B. Roggisch

Peter Otten

Ant

gervaz

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads