high performance hyperlink extraction

A

Adam Monsen

The following script is a high-performance link (<a
href="...">...</a>) extractor. I'm posting to this list in hopes that
anyone interested will offer constructive
criticism/suggestions/comments/etc. Mainly I'm curious what comments
folks have on my regular expressions. Hopefully someone finds this
kind of thing as interesting like I do! :)

My design goals were as follows:
* extract links from text (most likey valid HTML)
* work faster than BeautifulSoup, sgmllib, or other markup parsing
libraries
* return accurate results

The basic idea is to:
1. find anchor ('a') tags within some HTML text that contain 'href'
attributes (I assume these are hyperlinks)
2. extract all attributes from each 'a' tag found as name, value pairs


import re
import urllib

whiteout = re.compile(r'\s+')

# grabs hyperlinks from text
href_re = re.compile(r'''
<a(?P<attrs>[^>]* # start of tag
href=(?P<delim>["']) # delimiter
(?P<link>[^"']*) # link
(?P=delim) # delimiter
[^>]*)> # rest of start tag
(?P<content>.*?) # link content
</a> # end tag
''', re.VERBOSE | re.IGNORECASE)

# grabs attribute name, value pairs
attrs_re = re.compile(r'''
(?P<name>\w+)= # attribute name
(?P<delim>["']) # delimiter
(?P<value>[^"']*) # attribute value
(?P=delim) # delimiter
''', re.VERBOSE)

def getLinks(html_data):
newdata = whiteout.sub(' ', html_data)
matches = href_re.finditer(newdata)
ancs = []
for match in matches:
d = match.groupdict()
a = {}
a['href'] = d.get('link', None)
a['content'] = d.get('content', None)
attr_matches = attrs_re.finditer(d.get('attrs', None))
for match in attr_matches:
da = match.groupdict()
name = da.get('name', None)
a[name] = da.get('value', None)
ancs.append(a)
return ancs

if __name__ == '__main__':
opener = urllib.FancyURLopener()
url = 'http://adammonsen.com/tut/libgladeTest.html'
html_data = opener.open(url).read()
for a in getLinks(html_data): print a
 
F

felipevaldez

pretty nice, however, u wont capture the more and more common
javascripted redirections, like



<b onclick='location.href="http://www.nowhere.com"'>click me</b>

nor

<form action="http://www.yahoo.com">
<input type=submit value="clickme">
</form>

nor

<form action="http://www.yahoo.com" name=x>
<input type=button value="clickme" onclick=document.x.submit()>
</form>

..

im guessing it also wont handle correctly thing like:

<a href='javascript:alert("...")'>click</a>


but you probably already knew all this stuff, didnt you?


well, anyway, my 2 cents are, that instead of parsing the html looking
for
urls, like http://XXXX.XXXXXX.XXX/XXX?xXXx=xXx#x

or something like that.


//f3l
 
B

Bryan Olson

Adam said:
> The following script is a high-performance link (<a
> href="...">...</a>) extractor. [...]
> * extract links from text (most likey valid HTML) [...]
> import re
> import urllib
>
> whiteout = re.compile(r'\s+')
>
> # grabs hyperlinks from text
> href_re = re.compile(r'''
> <a(?P<attrs>[^>]* # start of tag
> href=(?P<delim>["']) # delimiter
> (?P<link>[^"']*) # link
> (?P=delim) # delimiter
> [^>]*)> # rest of start tag
> (?P<content>.*?) # link content
> </a> # end tag
> ''', re.VERBOSE | re.IGNORECASE)

A few notes:

The single or double quote delimiters are optional in some cases
(and frequently omitted even when required by the current
standard).

Where blank-spaces may appear is HTML entities is not so clear.
To follow the standard, one would have to acquire the SGML
standard, which costs money. Popular browsers allow end tags
such as "</a >" which the RE above will reject.


I'm not good at reading RE's, but it looks like the first line
will greedily match the entire start tag, and then back-track to
find the href attribute. There appear to many other good
opportunities for a cleverly-constructed input to force big-time
backtracking; for example, a '>' will end the start-tag, but in
within the delimiters, it's just another character. Can anyone
show a worst-case run-time?

Writing a Python RE to match all and only legal anchor tags may
not be possible. Writing a regular expression to do so is
definitely not possible.

[...]
> def getLinks(html_data):
> newdata = whiteout.sub(' ', html_data)
> matches = href_re.finditer(newdata)
> ancs = []
> for match in matches:
> d = match.groupdict()
> a = {}
> a['href'] = d.get('link', None)

The statement above doesn't seem necessary. The 'href' just gets
over-written below, as just another attribute.
> a['content'] = d.get('content', None)
> attr_matches = attrs_re.finditer(d.get('attrs', None))
> for match in attr_matches:
> da = match.groupdict()
> name = da.get('name', None)
> a[name] = da.get('value', None)
> ancs.append(a)
> return ancs
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top