LWP: Any Easy Way to Use Relative Links?

Hal Vaughan · Mar 22, 2005

I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.

Is there any module out there for keeping track of domains and handling
relative URLs?

I thought about writing a program to look for them, but it seems rather hard
to distinguish if a string is a domain name (I'd look for periods, but
can't be sure it'll include a .com, .gov, or anything else unless I check
all TLDs), and some URLs might not have a slash (if it's a domain name
only, or just a file in the same directory), so I can't think of a way to
be sure a string includes a domain and full path or is a relative URL
(other than trying to load it, and checking the error messag).

I would think there's a module or something to help handle this either by
tracking links used OR by easily determining if a link is absolute or
relative.

Thanks!

Hal

Gunnar Hjalmarsson · Mar 22, 2005

Hal said:
I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.

Is there any module out there for keeping track of domains and handling
relative URLs?

Maybe you are looking for URI::WithBase.

Hal Vaughan · Mar 22, 2005

Gunnar said:
Maybe you are looking for URI::WithBase.

Pretty close. I didn't know about that, and your comment led me to it, and
from the docs on CPAN, that lead me to URI. After experimenting with
URI::WithBase, I realized I can't always tell if a link is relative or not,
and URI::WithBase seems to expect you to know. URI includes uri->scheme,
which will return http for an http connection, and nothing if it's
relative, which is a major help, and lets me detect if a URL is relative or
not.

Hal

Jay Tilton · Mar 22, 2005

: I'm exploring LWP and trying to write a program that will pull down some web
: pages. When I read one page, I use regular expressions to find the links
: for other pages I want to download.

Regex-parsing HTML? Yuck.

: Sometimes the links are relative
: (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
: name. I don't see anything in the doc files about any consistency from one
: connection to another.
:
: Is there any module out there for keeping track of domains and handling
: relative URLs?

HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

John Bokma · Mar 22, 2005

Hal said:
Pretty close. I didn't know about that, and your comment led me to
it, and from the docs on CPAN, that lead me to URI. After
experimenting with URI::WithBase, I realized I can't always tell if a
link is relative or not, and URI::WithBase seems to expect you to
know. URI includes uri->scheme, which will return http for an http
connection, and nothing if it's relative, which is a major help, and
lets me detect if a URL is relative or not.

$uri = URI->new_abs( $str, $base_uri )

This constructs a new absolute URI object. The $str argument can denote a
relative or absolute URI. If relative, then it will be absolutized using
$base_uri as base. The $base_uri must be an absolute URI.

No need for fancy scheme detection.

And the base_uri you know, since you just fetched it :-D.

perl -e "use URI; print URI->new_abs('../baz', 'http://castleamber.com/foo/bar/')"
http://castleamber.com/foo/baz

perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',

'http://castleamber.com/foo/bar/')"
http://johnbokma.com/perl/

Hal Vaughan · Mar 22, 2005

John said:
$uri = URI->new_abs( $str, $base_uri )

This constructs a new absolute URI object. The $str argument can denote a
relative or absolute URI. If relative, then it will be absolutized using
$base_uri as base. The $base_uri must be an absolute URI.

No need for fancy scheme detection.

Great! It works even better. I must not have tested it properly, since I
missed it first time around.

Thanks!

Hal

Hal Vaughan · Mar 22, 2005

Jay said:
: I'm exploring LWP and trying to write a program that will pull down some
: web
: pages. When I read one page, I use regular expressions to find the
: links for other pages I want to download.

Regex-parsing HTML? Yuck.

: Sometimes the links are relative
: (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
: name. I don't see anything in the doc files about any consistency from
: one connection to another.
:
: Is there any module out there for keeping track of domains and handling
: relative URLs?

HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

I had never seen that before. I'll look into it. For this project, though,
I scan each page for specific links with a specific phrase as the displayed
text part of the link. Once I get that link, I pull out the url. From
what I see in HTML::LinkExtor, I'd still have to do it close to what I do.

As of now, I do this:

$page =~ s/\n//g; #kill all cr's
(@links) = $page =~ /(<a href.*?<\/a>)/gi; #get all links

Then I page through each link and see if it includes the text part I want.
With HTML::LinkExtor, I'd still have to loop through all the links.

It will be useful on the next project I'm doing, so thanks!

Hal

John Bokma · Mar 22, 2005

Hal said:
I had never seen that before. I'll look into it. For this project,
though, I scan each page for specific links with a specific phrase as
the displayed text part of the link.

Haven't used HTML::LinkExtor, but I have used HTML::TreeBuilder a lot,
see also HTML::Element for some specific documentation.

I guess look_down( _tag => 'a', ... ) will make several things a bit
easier, especially if you want only specific links.

Bart Lateur · Mar 23, 2005

Jay said:
HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

See also HTML::SimpleLinkExtor for a similar module with a (maybe)
simpler API.

LWP::Simple - relative or absolute path?	3	Jan 17, 2011
getting full URL from relative links	4	Apr 19, 2010
Caching robots.txt in LWP::RobotUA	1	Mar 15, 2010
Update to FAQ - Relative URLs for In-page Links and Links to Notes	14	May 24, 2009
Relative URLs	6	Jul 23, 2008
about relative path in asp.net	0	Oct 1, 2011
Not Getting Cookies in LWP	0	Mar 5, 2004
menu bar and banner responsive issues....any guidance is appreciated!	0	Apr 5, 2016

LWP: Any Easy Way to Use Relative Links?

Hal Vaughan

Gunnar Hjalmarsson

Hal Vaughan

Jay Tilton

John Bokma

Hal Vaughan

Hal Vaughan

John Bokma

Bart Lateur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads