LWP: Any Easy Way to Use Relative Links?

H

Hal Vaughan

I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.

Is there any module out there for keeping track of domains and handling
relative URLs?

I thought about writing a program to look for them, but it seems rather hard
to distinguish if a string is a domain name (I'd look for periods, but
can't be sure it'll include a .com, .gov, or anything else unless I check
all TLDs), and some URLs might not have a slash (if it's a domain name
only, or just a file in the same directory), so I can't think of a way to
be sure a string includes a domain and full path or is a relative URL
(other than trying to load it, and checking the error messag).

I would think there's a module or something to help handle this either by
tracking links used OR by easily determining if a link is absolute or
relative.

Thanks!

Hal
 
G

Gunnar Hjalmarsson

Hal said:
I'm exploring LWP and trying to write a program that will pull down some web
pages. When I read one page, I use regular expressions to find the links
for other pages I want to download. Sometimes the links are relative
(like /cgi/link.pl or subdir/newfile.html) instead of including a domain
name. I don't see anything in the doc files about any consistency from one
connection to another.

Is there any module out there for keeping track of domains and handling
relative URLs?

Maybe you are looking for URI::WithBase.
 
H

Hal Vaughan

Gunnar said:
Maybe you are looking for URI::WithBase.

Pretty close. I didn't know about that, and your comment led me to it, and
from the docs on CPAN, that lead me to URI. After experimenting with
URI::WithBase, I realized I can't always tell if a link is relative or not,
and URI::WithBase seems to expect you to know. URI includes uri->scheme,
which will return http for an http connection, and nothing if it's
relative, which is a major help, and lets me detect if a URL is relative or
not.

Hal
 
J

Jay Tilton

: I'm exploring LWP and trying to write a program that will pull down some web
: pages. When I read one page, I use regular expressions to find the links
: for other pages I want to download.

Regex-parsing HTML? Yuck.

: Sometimes the links are relative
: (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
: name. I don't see anything in the doc files about any consistency from one
: connection to another.
:
: Is there any module out there for keeping track of domains and handling
: relative URLs?

HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.
 
J

John Bokma

Hal said:
Pretty close. I didn't know about that, and your comment led me to
it, and from the docs on CPAN, that lead me to URI. After
experimenting with URI::WithBase, I realized I can't always tell if a
link is relative or not, and URI::WithBase seems to expect you to
know. URI includes uri->scheme, which will return http for an http
connection, and nothing if it's relative, which is a major help, and
lets me detect if a URL is relative or not.

$uri = URI->new_abs( $str, $base_uri )

This constructs a new absolute URI object. The $str argument can denote a
relative or absolute URI. If relative, then it will be absolutized using
$base_uri as base. The $base_uri must be an absolute URI.

No need for fancy scheme detection.

And the base_uri you know, since you just fetched it :-D.
perl -e "use URI; print URI->new_abs('../baz', 'http://castleamber.com/foo/bar/')"
http://castleamber.com/foo/baz

perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',
'http://castleamber.com/foo/bar/')"
http://johnbokma.com/perl/
 
H

Hal Vaughan

John said:
$uri = URI->new_abs( $str, $base_uri )

This constructs a new absolute URI object. The $str argument can denote a
relative or absolute URI. If relative, then it will be absolutized using
$base_uri as base. The $base_uri must be an absolute URI.

No need for fancy scheme detection.

Great! It works even better. I must not have tested it properly, since I
missed it first time around.

Thanks!

Hal
 
H

Hal Vaughan

Jay said:
: I'm exploring LWP and trying to write a program that will pull down some
: web
: pages. When I read one page, I use regular expressions to find the
: links for other pages I want to download.

Regex-parsing HTML? Yuck.

: Sometimes the links are relative
: (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
: name. I don't see anything in the doc files about any consistency from
: one connection to another.
:
: Is there any module out there for keeping track of domains and handling
: relative URLs?

HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

I had never seen that before. I'll look into it. For this project, though,
I scan each page for specific links with a specific phrase as the displayed
text part of the link. Once I get that link, I pull out the url. From
what I see in HTML::LinkExtor, I'd still have to do it close to what I do.

As of now, I do this:

$page =~ s/\n//g; #kill all cr's
(@links) = $page =~ /(<a href.*?<\/a>)/gi; #get all links

Then I page through each link and see if it includes the text part I want.
With HTML::LinkExtor, I'd still have to loop through all the links.

It will be useful on the next project I'm doing, so thanks!

Hal
 
J

John Bokma

Hal said:
I had never seen that before. I'll look into it. For this project,
though, I scan each page for specific links with a specific phrase as
the displayed text part of the link.

Haven't used HTML::LinkExtor, but I have used HTML::TreeBuilder a lot,
see also HTML::Element for some specific documentation.

I guess look_down( _tag => 'a', ... ) will make several things a bit
easier, especially if you want only specific links.
 
B

Bart Lateur

Jay said:
HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
better than your regex can, and it can return all links in a
fully-qualified form if given a base URL.

See also HTML::SimpleLinkExtor for a similar module with a (maybe)
simpler API.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top