better regular expression?

V

Vivek

Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.

Basically I want the attern to not match for URLs that are not on my
host.

The following statement satisfies numbers 1 and 2, but not 3:

line =
re.sub(r'(href=")(http?://'+hostname+'[/]?|/)([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

An improvement that also partially satisfies number 3 is

line =
re.sub(r'(href=")(http?://'+hostname+'[/]?|/|[^h][^t][^t][^p][^:][^/][^/])([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)

This is not complete because if the relative url is less than seven
characters, than it will not match.

Any suggestions?

Thanx.
 
A

Andy Gross

Check out the 'urlparse' module, in the standard library, unless for
some reason you *have* to use regular expressions.

/arg
 
R

Roy Smith

Vivek said:
Hi,

I am trying to construct a regular expression using the re module that
matches for
1. my hostname
2. absolute from the root URLs including just "/"
3. relative URLs.

Is your goal to learn more about regexes, or to parse URLs? If the
latter, my suggestion would be to look at the urlparse module; the hard
work has already been done for you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top