LWP: Any Easy Way to Use Relative Links?

Discussion in 'Perl Misc' started by Hal Vaughan, Mar 22, 2005.

  1. Hal Vaughan

    Hal Vaughan Guest

    I'm exploring LWP and trying to write a program that will pull down some web
    pages. When I read one page, I use regular expressions to find the links
    for other pages I want to download. Sometimes the links are relative
    (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
    name. I don't see anything in the doc files about any consistency from one
    connection to another.

    Is there any module out there for keeping track of domains and handling
    relative URLs?

    I thought about writing a program to look for them, but it seems rather hard
    to distinguish if a string is a domain name (I'd look for periods, but
    can't be sure it'll include a .com, .gov, or anything else unless I check
    all TLDs), and some URLs might not have a slash (if it's a domain name
    only, or just a file in the same directory), so I can't think of a way to
    be sure a string includes a domain and full path or is a relative URL
    (other than trying to load it, and checking the error messag).

    I would think there's a module or something to help handle this either by
    tracking links used OR by easily determining if a link is absolute or
    relative.

    Thanks!

    Hal
     
    Hal Vaughan, Mar 22, 2005
    #1
    1. Advertising

  2. Hal Vaughan wrote:
    > I'm exploring LWP and trying to write a program that will pull down some web
    > pages. When I read one page, I use regular expressions to find the links
    > for other pages I want to download. Sometimes the links are relative
    > (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
    > name. I don't see anything in the doc files about any consistency from one
    > connection to another.
    >
    > Is there any module out there for keeping track of domains and handling
    > relative URLs?


    Maybe you are looking for URI::WithBase.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Mar 22, 2005
    #2
    1. Advertising

  3. Hal Vaughan

    Hal Vaughan Guest

    Gunnar Hjalmarsson wrote:

    > Hal Vaughan wrote:
    >> I'm exploring LWP and trying to write a program that will pull down some
    >> web
    >> pages. When I read one page, I use regular expressions to find the links
    >> for other pages I want to download. Sometimes the links are relative
    >> (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
    >> name. I don't see anything in the doc files about any consistency from
    >> one connection to another.
    >>
    >> Is there any module out there for keeping track of domains and handling
    >> relative URLs?

    >
    > Maybe you are looking for URI::WithBase.
    >


    Pretty close. I didn't know about that, and your comment led me to it, and
    from the docs on CPAN, that lead me to URI. After experimenting with
    URI::WithBase, I realized I can't always tell if a link is relative or not,
    and URI::WithBase seems to expect you to know. URI includes uri->scheme,
    which will return http for an http connection, and nothing if it's
    relative, which is a major help, and lets me detect if a URL is relative or
    not.

    Hal
     
    Hal Vaughan, Mar 22, 2005
    #3
  4. Hal Vaughan

    Jay Tilton Guest

    Hal Vaughan <> wrote:

    : I'm exploring LWP and trying to write a program that will pull down some web
    : pages. When I read one page, I use regular expressions to find the links
    : for other pages I want to download.

    Regex-parsing HTML? Yuck.

    : Sometimes the links are relative
    : (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
    : name. I don't see anything in the doc files about any consistency from one
    : connection to another.
    :
    : Is there any module out there for keeping track of domains and handling
    : relative URLs?

    HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
    better than your regex can, and it can return all links in a
    fully-qualified form if given a base URL.
     
    Jay Tilton, Mar 22, 2005
    #4
  5. Hal Vaughan

    John Bokma Guest

    Hal Vaughan wrote:

    > Pretty close. I didn't know about that, and your comment led me to
    > it, and from the docs on CPAN, that lead me to URI. After
    > experimenting with URI::WithBase, I realized I can't always tell if a
    > link is relative or not, and URI::WithBase seems to expect you to
    > know. URI includes uri->scheme, which will return http for an http
    > connection, and nothing if it's relative, which is a major help, and
    > lets me detect if a URL is relative or not.


    $uri = URI->new_abs( $str, $base_uri )

    This constructs a new absolute URI object. The $str argument can denote a
    relative or absolute URI. If relative, then it will be absolutized using
    $base_uri as base. The $base_uri must be an absolute URI.

    No need for fancy scheme detection.

    And the base_uri you know, since you just fetched it :-D.

    >perl -e "use URI; print URI->new_abs('../baz',

    'http://castleamber.com/foo/bar/')"
    http://castleamber.com/foo/baz

    >perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',

    'http://castleamber.com/foo/bar/')"
    http://johnbokma.com/perl/

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 22, 2005
    #5
  6. Hal Vaughan

    Hal Vaughan Guest

    John Bokma wrote:

    > Hal Vaughan wrote:
    >
    >> Pretty close. I didn't know about that, and your comment led me to
    >> it, and from the docs on CPAN, that lead me to URI. After
    >> experimenting with URI::WithBase, I realized I can't always tell if a
    >> link is relative or not, and URI::WithBase seems to expect you to
    >> know. URI includes uri->scheme, which will return http for an http
    >> connection, and nothing if it's relative, which is a major help, and
    >> lets me detect if a URL is relative or not.

    >
    > $uri = URI->new_abs( $str, $base_uri )
    >
    > This constructs a new absolute URI object. The $str argument can denote a
    > relative or absolute URI. If relative, then it will be absolutized using
    > $base_uri as base. The $base_uri must be an absolute URI.
    >
    > No need for fancy scheme detection.


    Great! It works even better. I must not have tested it properly, since I
    missed it first time around.

    Thanks!

    Hal

    > And the base_uri you know, since you just fetched it :-D.
    >
    >>perl -e "use URI; print URI->new_abs('../baz',

    > 'http://castleamber.com/foo/bar/')"
    > http://castleamber.com/foo/baz
    >
    >>perl -e "use URI; print URI->new_abs('http://johnbokma.com/perl/',

    > 'http://castleamber.com/foo/bar/')"
    > http://johnbokma.com/perl/
    >
     
    Hal Vaughan, Mar 22, 2005
    #6
  7. Hal Vaughan

    Hal Vaughan Guest

    Jay Tilton wrote:

    > Hal Vaughan <> wrote:
    >
    > : I'm exploring LWP and trying to write a program that will pull down some
    > : web
    > : pages. When I read one page, I use regular expressions to find the
    > : links for other pages I want to download.
    >
    > Regex-parsing HTML? Yuck.
    >
    > : Sometimes the links are relative
    > : (like /cgi/link.pl or subdir/newfile.html) instead of including a domain
    > : name. I don't see anything in the doc files about any consistency from
    > : one connection to another.
    > :
    > : Is there any module out there for keeping track of domains and handling
    > : relative URLs?
    >
    > HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
    > better than your regex can, and it can return all links in a
    > fully-qualified form if given a base URL.


    I had never seen that before. I'll look into it. For this project, though,
    I scan each page for specific links with a specific phrase as the displayed
    text part of the link. Once I get that link, I pull out the url. From
    what I see in HTML::LinkExtor, I'd still have to do it close to what I do.

    As of now, I do this:

    $page =~ s/\n//g; #kill all cr's
    (@links) = $page =~ /(<a href.*?<\/a>)/gi; #get all links

    Then I page through each link and see if it includes the text part I want.
    With HTML::LinkExtor, I'd still have to loop through all the links.

    It will be useful on the next project I'm doing, so thanks!

    Hal
     
    Hal Vaughan, Mar 22, 2005
    #7
  8. Hal Vaughan

    John Bokma Guest

    Hal Vaughan wrote:

    > Jay Tilton wrote:
    >
    >> Hal Vaughan <> wrote:
    >>
    >> : I'm exploring LWP and trying to write a program that will pull down
    >> : some web
    >> : pages. When I read one page, I use regular expressions to find the
    >> : links for other pages I want to download.
    >>
    >> Regex-parsing HTML? Yuck.
    >>
    >> : Sometimes the links are relative
    >> : (like /cgi/link.pl or subdir/newfile.html) instead of including a
    >> : domain name. I don't see anything in the doc files about any
    >> : consistency from one connection to another.
    >> :
    >> : Is there any module out there for keeping track of domains and
    >> : handling relative URLs?
    >>
    >> HTML::LinkExtor is your one-stop answer. It can snatch links from
    >> HTML better than your regex can, and it can return all links in a
    >> fully-qualified form if given a base URL.

    >
    > I had never seen that before. I'll look into it. For this project,
    > though, I scan each page for specific links with a specific phrase as
    > the displayed text part of the link.


    Haven't used HTML::LinkExtor, but I have used HTML::TreeBuilder a lot,
    see also HTML::Element for some specific documentation.

    I guess look_down( _tag => 'a', ... ) will make several things a bit
    easier, especially if you want only specific links.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 22, 2005
    #8
  9. Hal Vaughan

    Bart Lateur Guest

    Jay Tilton wrote:

    >HTML::LinkExtor is your one-stop answer. It can snatch links from HTML
    >better than your regex can, and it can return all links in a
    >fully-qualified form if given a base URL.


    See also HTML::SimpleLinkExtor for a similar module with a (maybe)
    simpler API.

    --
    Bart.
     
    Bart Lateur, Mar 23, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JerryK
    Replies:
    3
    Views:
    2,182
    bluepolion
    Apr 11, 2011
  2. =?Utf-8?B?UGF1bA==?=

    any easy way to do this?-put items in table

    =?Utf-8?B?UGF1bA==?=, Jun 2, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    352
  3. Kaidi
    Replies:
    2
    Views:
    3,792
    Ferenc Hechler
    Nov 26, 2004
  4. jwcarlton

    LWP::Simple - relative or absolute path?

    jwcarlton, Jan 17, 2011, in forum: Perl Misc
    Replies:
    3
    Views:
    192
    Peter J. Holzer
    Jan 17, 2011
  5. Garrett Smith
    Replies:
    14
    Views:
    308
    David Mark
    May 26, 2009
Loading...

Share This Page