Regex for URL extracting

N

Nikita the Spider

"Johny said:
Does anyone know about a good regular expression for URL extracting?

Extracting URLs from what?

If it is HTML, then I'd look at some existing HTML parsing modules like
Beautiful Soup and Barnes' HTMLData.
 
C

Chris Mellon

Google turns this up:

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

But I've seen other re's for this problem that are hundreds of
characters long.

-- Paul

These are the regexps that gnome-terminal uses for it's URL
auto-recognition, and I have shamelessly stolen them for use in one of
my own apps:

urlfinders = [
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):)[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:mad:&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):)[0-9]*)?"),
re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:mad:&=\\?/~\\#\\%]|\\\\
)+"),
re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top