need some debug-infos on a simple regex

M

Martin Kaspar

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for more than ten hours now! And i
have several books here
- but they do not help at the moment. This code runs like a charme!!


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto:(.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target: http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...

thx matze
 
M

Martin Gregorie

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.
This doesn't directly help with your problem, but the tool at this URL:
http://www.solmetra.com/scripts/regex/

may be useful when you're experimenting with regexes or testing them.
Perl regexes are similar enough to Python regexes for this tool to be
useful here.

Without examples of text that the regex is intended to match its
difficult to say more.
 
M

MRAB

hello dear list!

i'm very new to programming and self teaching myself. I'm having a
problem with a little project.

I'm trying to preform an fetch-process, but every time i try it i runs
into errors.
i have read the Python-documents for more than ten hours now! And i
have several books here
- but they do not help at the moment. This code runs like a charme!!


import urllib
import urlparse
import re

url = "http://search.cpan.org/author/?W"
html = urllib.urlopen(url).read()
for lk, capname, name in re.findall('<a
href="(/~.*?/)"><b>(.*?)</b></a><br/><small>(.*?)</small>', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'name':name, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto:(.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Note the above mentioned code runs very very good. All is nice. Now i
want to apply it on a new target. I can learn alot with this ...Let us
say on this swiss-site:educa.ch:

What is aimed: I want to adopt it on a new target to learn mor about
regex and to do some homework - (working as a teacher - and
collecting some data bout colleagues) How should we fetch the sites -
that is the problem..i want to learn while applying the
code...What is necessary to apply the example on the target!?

the target: http://www.educa.ch/dyn/79362.asp?action=search

But the code (see below) does not run - i tried several things to
debug - can yozu help me!?
BTW - should i fetch the pages and load them into an array or should i
loop over the

http://www.educa.ch/dyn/79376.asp?id=2635
http://www.educa.ch/dyn/79376.asp?id=3493
and so on...

see the code that does not work!?

import urllib
import urlparse
import re

url = "http://www.educa.ch/dyn/"
html = urllib.urlopen("http://www.educa.ch/dyn/79362.asp?
action=search").read()
for capname, lk in re.findall('<a name="\d+"></a><br><img [^>]+>([^<]
+).*?<a href="#\d+" onclick="javascript: window.open\(\'(\d+.asp?id=\d
+)\'', html):
alk = urlparse.urljoin(url, lk)

data = { 'url':alk, 'cname':capname }

phtml = urllib.urlopen(alk).read()
memail = re.search('<a href="mailto.*?)">', phtml)
if memail:
data['email'] = memail.group(1)

print data

Look forward to get some starting points...
Don't just say "does not run" or "does not work". That's not very
helpful. It's like saying "My car doesn't work. How should I fix it?".
:)

When writing regexes it's recommended that you use raw string literals.

Your first regex contains 'asp?', which is saying that 'p' is optional.
I think you meant 'asp\?'. Also, '.' will match any character except
'\n'. If want to match an actual '.' then use '\.'.

Your second regex contains a closing parenthesis ')' but no opening
parenthesis '('.
 
S

Steve Holden

This doesn't directly help with your problem, but the tool at this URL:
http://www.solmetra.com/scripts/regex/

may be useful when you're experimenting with regexes or testing them.
Perl regexes are similar enough to Python regexes for this tool to be
useful here.

Without examples of text that the regex is intended to match its
difficult to say more.
Or you could look at the Kodos tool, which is written in Python and will
tell you exactly what a Python pattern will and will not match.

http://kodos.sourceforge.net/

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top