Need Perl module to get <TITLE> tag of a web page

Discussion in 'Perl Misc' started by J.D. Baldwin, Mar 17, 2008.

  1. J.D. Baldwin

    J.D. Baldwin Guest

    I've spent an hour searching on CPAN and there are simply too many
    web-related modules and no good way (that I can think of) to search
    for this in terms that aren't so broad that pretty much all of them
    are returned as hits. So here I am asking.

    Simple problem. Given a URL http://www.example.com/some/file/here.html,
    retrieve and extract the title of the web page -- i.e., the content
    of the <title> tag. Is there an equally simple solution? Thanks in
    advance for any advice.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
    J.D. Baldwin, Mar 17, 2008
    #1
    1. Advertising

  2. On Mar 17, 12:32 pm, (J.D.
    Baldwin) wrote:
    > Simple problem.  Given a URLhttp://www.example.com/some/file/here.html,
    > retrieve and extract the title of the web page -- i.e., the content
    > of the <title> tag.  Is there an equally simple solution?  Thanks in
    > advance for any advice.


    #!/usr/bin/perl

    use strict;
    use LWP::Simple;

    my $url = $ARGV[0] || die "Specify URL on the cmd line";
    my $html = get ($url);
    $html =~ m{<TITLE>(.*?)</TITLE>}gism;

    print "$1\n";

    Koszalek
    Koszalek Opalek, Mar 17, 2008
    #2
    1. Advertising

  3. Koszalek Opalek wrote:
    > On Mar 17, 12:32 pm, (J.D.
    > Baldwin) wrote:
    >> Simple problem. Given a URLhttp://www.example.com/some/file/here.html,
    >> retrieve and extract the title of the web page -- i.e., the content
    >> of the <title> tag. Is there an equally simple solution? Thanks in
    >> advance for any advice.

    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use LWP::Simple;
    >
    > my $url = $ARGV[0] || die "Specify URL on the cmd line";
    > my $html = get ($url);
    > $html =~ m{<TITLE>(.*?)</TITLE>}gism;
    >
    > print "$1\n";


    Why the /g and /m modifiers?
    What if the <title> element contains attributes?

    Improved (I hope) code:

    $html =~ m{<TITLE.*?>(.*?)</TITLE>}is;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Mar 17, 2008
    #3
  4. J.D. Baldwin

    J.D. Baldwin Guest

    In the previous article, Keith Keller
    <-francisco.ca.us> wrote, quoting me:
    > > Simple problem. Given a URL http://www.example.com/some/file/here.html,
    > > retrieve and extract the title of the web page -- i.e., the content
    > > of the <title> tag. Is there an equally simple solution? Thanks in
    > > advance for any advice.

    >
    > Use LWP to retrieve the page, HTML::TreeBuilder to build a syntax
    > tree, and HTML::Element's find_by_tag_name method to find the
    > element with the title tag. It sounds like more work than it is.


    Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

    And thanks also to Koszalek Opalek for his answer elsethread.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
    J.D. Baldwin, Mar 17, 2008
    #4
  5. J.D. Baldwin

    malec Guest

    http://www.perlnow.com/cgi-bin/l.CGI?file=getitle.cgi

    On Mar 17, 4:32 pm, (J.D. Baldwin)
    wrote:
    > I've spent an hour searching on CPAN and there are simply too many
    > web-related modules and no good way (that I can think of) to search
    > for this in terms that aren't so broad that pretty much all of them
    > are returned as hits. So here I am asking.
    >
    > Simple problem. Given a URLhttp://www.example.com/some/file/here.html,
    > retrieve and extract the title of the web page -- i.e., the content
    > of the <title> tag. Is there an equally simple solution? Thanks in
    > advance for any advice.
    > --
    > _+_ From the catapult of |If anyone disagrees with any statement I make, I
    > _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    > \ / |to deny under oath that I ever made it. -T. Lehrer
    > ***~~~~-----------------------------------------------------------------------
    malec, Mar 17, 2008
    #5
  6. J.D. Baldwin

    J.D. Baldwin Guest

    LWP::Simple yields "protocol" error (was Re: Need Perl module to get <TITLE> tag of a web page)

    The other replies to my post suggested LWP with other tools. Now I
    cannot get LWP to work with a valid proxy setting. My script will
    work with the http_proxy variable unset ... but I'm still curious why
    this should be so (perl 5.8.8, LWP::Simple 1.4.1):

    $ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
    <!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
    <head>
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

    [...]

    $ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
    501 Protocol scheme '' is not supported <URL:http://www.sn.no>
    $

    Googling is no help, examining the RC is no help. Any suggestions?
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
    J.D. Baldwin, Mar 18, 2008
    #6
  7. J.D. Baldwin

    Ben Morrow Guest

    Re: LWP::Simple yields "protocol" error (was Re: Need Perl module to get <TITLE> tag of a web page)

    Quoth :
    >
    > The other replies to my post suggested LWP with other tools. Now I
    > cannot get LWP to work with a valid proxy setting. My script will
    > work with the http_proxy variable unset ... but I'm still curious why
    > this should be so (perl 5.8.8, LWP::Simple 1.4.1):
    >
    > $ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
    > <!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien
    > sier:1000004--><!--Cookien sier:1000004--><!--Cookien
    > sier:1000004--><!--Cookien sier:1000004--><!--Cookien
    > sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC
    > "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
    > <head>
    > <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    > <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    >
    > [...]
    >
    > $ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint
    > "http://www.sn.no"' | head
    > 501 Protocol scheme '' is not supported <URL:http://www.sn.no>
    > $
    >
    > Googling is no help, examining the RC is no help. Any suggestions?


    http_proxy needs to be a full URL, without path, such as
    'http://proxy-host:3128'. This is for consistency with ftp_proxy, which
    allows either ftp:// or http:// proxies; and presumably so that one
    could use an https:// proxy for HTTP requests.

    Ben
    Ben Morrow, Mar 18, 2008
    #7
  8. J.D. Baldwin

    J.D. Baldwin Guest

    Re: LWP::Simple yields "protocol" error (was Re: Need Perl module to get <TITLE> tag of a web page)

    In the previous article, Ben Morrow <> wrote:
    > http_proxy needs to be a full URL, without path, such as
    > 'http://proxy-host:3128'. This is for consistency with ftp_proxy,
    > which allows either ftp:// or http:// proxies; and presumably so
    > that one could use an https:// proxy for HTTP requests.


    Ah, you know, I've run into that with other utilities, but I tested
    against wget, which handles a plain hostname just fine. Thanks for
    the reminder.

    No Perl content remaining here, nothing to see ... move along ...
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
    J.D. Baldwin, Mar 18, 2008
    #8
  9. Re: LWP::Simple yields "protocol" error (was Re: Need Perl moduleto get <TITLE> tag of a web page)

    J.D. Baldwin wrote:
    >
    > The other replies to my post suggested LWP with other tools. Now I
    > cannot get LWP to work with a valid proxy setting. My script will
    > work with the http_proxy variable unset ... but I'm still curious why
    > this should be so (perl 5.8.8, LWP::Simple 1.4.1):
    >
    > $ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
    > <!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
    > <head>
    > <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    > <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    >
    > [...]
    >
    > $ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
    > 501 Protocol scheme '' is not supported <URL:http://www.sn.no>

    if you had set http_proxy, it would have printed something inside the
    single quotes rather than ''.
    Also make sure you specify http_proxy as starting with "http://", not
    just the proxy name.
    Brian Helterline, Mar 18, 2008
    #9
  10. J.D. Baldwin

    brian d foy Guest

    [[ This message was both posted and mailed: see
    the "To," "Cc," and "Newsgroups" headers for details. ]]

    In article <frmrj2$ceu$>, J.D. Baldwin
    <> wrote:

    > In the previous article, Keith Keller
    > <-francisco.ca.us> wrote, quoting me:
    > > > Simple problem. Given a URL http://www.example.com/some/file/here.html,
    > > > retrieve and extract the title of the web page -- i.e., the content
    > > > of the <title> tag. Is there an equally simple solution? Thanks in
    > > > advance for any advice.



    > Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!


    If you just want to get the title, HTML::HeadParser is what you need.
    It already does all of the hard work for you.
    brian d foy, Mar 18, 2008
    #10
  11. J.D. Baldwin

    J.D. Baldwin Guest

    In the previous article, brian d foy <> wrote:
    > > > > Simple problem. Given a URL http://www.example.com/some/file/here.html,
    > > > > retrieve and extract the title of the web page -- i.e., the content
    > > > > of the <title> tag. Is there an equally simple solution? Thanks in
    > > > > advance for any advice.

    >
    >
    > > Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

    >
    > If you just want to get the title, HTML::HeadParser is what you need.
    > It already does all of the hard work for you.


    Glancing at it, it looks simple and powerful. I've already
    implemented it the other way, but I'll file HeadParser away in my bag
    of tricks, so thanks.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
    J.D. Baldwin, Mar 19, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andreas Klemt
    Replies:
    1
    Views:
    437
    Steve C. Orr, MCSD
    Aug 10, 2003
  2. shruds
    Replies:
    1
    Views:
    757
    John C. Bollinger
    Jan 27, 2006
  3. Replies:
    0
    Views:
    502
  4. Replies:
    1
    Views:
    521
    David
    Apr 24, 2007
  5. Tomas
    Replies:
    1
    Views:
    367
    George
    Aug 11, 2008
Loading...

Share This Page