need some debug-infos on a simple regex

Martin Kaspar · Nov 13, 2010

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for more than ten hours now! And i
have several books here
- but they do not help at the moment. This code runs like a charme!!

import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)">(.*?)</a> (.*?)', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto

.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target: http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a> <img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

thx matze

Martin Gregorie · Nov 13, 2010

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

This doesn't directly help with your problem, but the tool at this URL:
http://www.solmetra.com/scripts/regex/

may be useful when you're experimenting with regexes or testing them.
Perl regexes are similar enough to Python regexes for this tool to be
useful here.

Without examples of text that the regex is intended to match its
difficult to say more.

MRAB · Nov 13, 2010

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for more than ten hours now! And i
have several books here
- but they do not help at the moment. This code runs like a charme!!

import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)">(.*?)</a> (.*?)', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target: http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a> <img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

Don't just say "does not run" or "does not work". That's not very
helpful. It's like saying "My car doesn't work. How should I fix it?".

When writing regexes it's recommended that you use raw string literals.

Your first regex contains 'asp?', which is saying that 'p' is optional.
I think you meant 'asp\?'. Also, '.' will match any character except
'\n'. If want to match an actual '.' then use '\.'.

Your second regex contains a closing parenthesis ')' but no opening
parenthesis '('.

Steve Holden · Nov 13, 2010

This doesn't directly help with your problem, but the tool at this URL:
http://www.solmetra.com/scripts/regex/

may be useful when you're experimenting with regexes or testing them.
Perl regexes are similar enough to Python regexes for this tool to be
useful here.

Without examples of text that the regex is intended to match its
difficult to say more.

Or you could look at the Kodos tool, which is written in Python and will
tell you exactly what a Python pattern will and will not match.

http://kodos.sourceforge.net/

regards
Steve

Need help with this script	4	Mar 12, 2023
Regex Help	4	Sep 22, 2008
Using Xpath to parse a Yahoo Finance page	4	Dec 3, 2012
getting debug from urllib2	1	Jul 31, 2006
An unknown bug doesn't allow the quotes app to work. What's the issue?	3	Apr 23, 2023
Need help with my 1st python program	5	May 8, 2010
problem with my regex?	2	May 22, 2006
Help on thread pool	3	May 17, 2008

need some debug-infos on a simple regex

Martin Kaspar

Martin Gregorie

MRAB

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads