webspider, regexp not working, why?

notnorwegian · May 23, 2008

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
{1}
([\w\-]+\.)+
([\w]{2,5}))

[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?
(&
\w+=\w+)*)?")

why isnt this url catching something like:

<link rel="alternate" type="application/rss+xml" title="Python
Screencasts"
href="http://www.showmedo.com/latestVideoFeed/rss2.0?
tag=python" />

site = urllib.urlopen("http://www.python.org")
for row in site:
obj = url.search(row)
if obj != None:
print "url: ", obj.group()

i know it works because it can catch
www.hello.com in a txt-file and i can catch emails of websites with
another regexp.

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

i see now that it has to match the beginning of the row or something,
because:
hi www.google.com
doesnt match but
www.google.com hi
matches.

i though a regexp would search a row/file and when it finds an
occurence report it, so a regexp of "lo" would match in lopez.

Reedick, Andrew · May 23, 2008

-----Original Message-----
From: [email protected] [mailtoython-
[email protected]] On Behalf Of
(e-mail address removed)
Sent: Friday, May 23, 2008 12:43 PM
To: (e-mail address removed)
Subject: webspider, regexp not working, why?

url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]

search and match yields the same results.

but when you put something like href= in front of it it doesnt work.

a) '^' matches at the beginning of a line. So if 'href=' is at the
beginning of the line...

b) Regexes are hard enough to read as is. (http|ftp|https) is more
readable than ((ht|f)tp(s?).

c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621

alex23 · May 24, 2008

c) If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml. Many other
regex gurus have learned this lesson. Myself included. =)

Agreed. The BeautifulSoup approach is particularly nice (although not
part of stdlib):

import urllib
from BeautifulSoup import BeautifulSoup
html = urllib.urlopen('http://www.python.org/').read()
soup = BeautifulSoup(html)
links = [link['href'] for link in soup('link')]
links[0]

Click to expand...

Click to expand...

u'http://www.python.org/channews.rdf'

- alex23

spider, why isnt it finding the url?	1	May 23, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Prototype 1.6--Somebody Stop These People	6	Dec 24, 2009
Ruby Weekly News 6th - 12th June 2005	0	Jun 14, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.java.gui FAQ	0	Sep 13, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

webspider, regexp not working, why?

notnorwegian

Reedick, Andrew

alex23

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads