Need Perl module to get <TITLE> tag of a web page

J

J.D. Baldwin

I've spent an hour searching on CPAN and there are simply too many
web-related modules and no good way (that I can think of) to search
for this in terms that aren't so broad that pretty much all of them
are returned as hits. So here I am asking.

Simple problem. Given a URL http://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag. Is there an equally simple solution? Thanks in
advance for any advice.
 
K

Koszalek Opalek

Simple problem.  Given a URLhttp://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag.  Is there an equally simple solution?  Thanks in
advance for any advice.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = $ARGV[0] || die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

Koszalek
 
G

Gunnar Hjalmarsson

Koszalek said:
Simple problem. Given a URLhttp://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag. Is there an equally simple solution? Thanks in
advance for any advice.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = $ARGV[0] || die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

Why the /g and /m modifiers?
What if the <title> element contains attributes?

Improved (I hope) code:

$html =~ m{<TITLE.*?>(.*?)</TITLE>}is;
 
J

J.D. Baldwin

In the previous article, Keith Keller
Use LWP to retrieve the page, HTML::TreeBuilder to build a syntax
tree, and HTML::Element's find_by_tag_name method to find the
element with the title tag. It sounds like more work than it is.

Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

And thanks also to Koszalek Opalek for his answer elsethread.
 
J

J.D. Baldwin

The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>
$

Googling is no help, examining the RC is no help. Any suggestions?
 
B

Ben Morrow

Quoth (e-mail address removed):
The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien
sier:1000004--><!--Cookien sier:1000004--><!--Cookien
sier:1000004--><!--Cookien sier:1000004--><!--Cookien
sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint
"http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>
$

Googling is no help, examining the RC is no help. Any suggestions?

http_proxy needs to be a full URL, without path, such as
'http://proxy-host:3128'. This is for consistency with ftp_proxy, which
allows either ftp:// or http:// proxies; and presumably so that one
could use an https:// proxy for HTTP requests.

Ben
 
J

J.D. Baldwin

In the previous article said:
http_proxy needs to be a full URL, without path, such as
'http://proxy-host:3128'. This is for consistency with ftp_proxy,
which allows either ftp:// or http:// proxies; and presumably so
that one could use an https:// proxy for HTTP requests.

Ah, you know, I've run into that with other utilities, but I tested
against wget, which handles a plain hostname just fine. Thanks for
the reminder.

No Perl content remaining here, nothing to see ... move along ...
 
B

Brian Helterline

J.D. Baldwin said:
The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!--Cookien sier:1000004--><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>
if you had set http_proxy, it would have printed something inside the
single quotes rather than ''.
Also make sure you specify http_proxy as starting with "http://", not
just the proxy name.
 
B

brian d foy

[[ This message was both posted and mailed: see
the "To," "Cc," and "Newsgroups" headers for details. ]]

J.D. Baldwin said:
In the previous article, Keith Keller


Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

If you just want to get the title, HTML::HeadParser is what you need.
It already does all of the hard work for you.
 
J

J.D. Baldwin

In the previous article said:
If you just want to get the title, HTML::HeadParser is what you need.
It already does all of the hard work for you.

Glancing at it, it looks simple and powerful. I've already
implemented it the other way, but I'll file HeadParser away in my bag
of tricks, so thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top