More regex help

Support Desk · Sep 24, 2008

I am working on a python webcrawler, that will extract all links from an
html page, and add them to a queue, The problem I am having is building
absolute links from relative links, as there are so many different types of
relative links. If I just append the relative links to the current url, some
websites will send it into a never-ending loop.

What I am looking for is a regexp that will extract the root url from any
url string I pass to it, such as

'http://example.com/stuff/stuff/morestuff/index.html'

Regexp = http:example.com

'http://anotherexample.com/stuff/index.php

Regexp = 'http://anotherexample.com/

'http://example.com/stuff/stuff/

Regext = 'http://example.com'

Support Desk · Sep 24, 2008

Kirk,

That's exactly what I needed. Thx!

-----Original Message-----
From: Kirk Strauser [mailto:[email protected]]
Sent: Wednesday, September 24, 2008 11:42 AM
To: (e-mail address removed)
Subject: Re: More regex help

Support Desk said:
I am working on a python webcrawler, that will extract all links from an
html page, and add them to a queue, The problem I am having is building
absolute links from relative links, as there are so many different types of
relative links. If I just append the relative links to the current url, some
websites will send it into a never-ending loop.

'/foo')
'http://www.example.com/foo' 'http://slashdot.org/foo')
'http://slashdot.org/foo'

Can't wrap text around image and one more	1	Jul 25, 2025
Regex Help	4	Sep 22, 2008
Help please	8	Jul 6, 2023
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
HELP WITH MediaSource	2	Dec 6, 2024
I need help with a Gemini prompt	1	May 14, 2025
Help with my responsive home page	2	Dec 14, 2022
Urgent Help Needed: Supporting Education and Family Through Hard Work	0	Nov 22, 2024

More regex help

Support Desk

Support Desk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads