Need Perl module to get <TITLE> tag of a web page

J.D. Baldwin · Mar 17, 2008

I've spent an hour searching on CPAN and there are simply too many
web-related modules and no good way (that I can think of) to search
for this in terms that aren't so broad that pretty much all of them
are returned as hits. So here I am asking.

Simple problem. Given a URL http://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag. Is there an equally simple solution? Thanks in
advance for any advice.

Koszalek Opalek · Mar 17, 2008

Simple problem. Given a URLhttp://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag. Is there an equally simple solution? Thanks in
advance for any advice.

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = $ARGV[0] || die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

Koszalek

Gunnar Hjalmarsson · Mar 17, 2008

Koszalek said:
Simple problem. Given a URLhttp://www.example.com/some/file/here.html,
retrieve and extract the title of the web page -- i.e., the content
of the <title> tag. Is there an equally simple solution? Thanks in
advance for any advice.

Click to expand...

#!/usr/bin/perl

use strict;
use LWP::Simple;

my $url = $ARGV[0] || die "Specify URL on the cmd line";
my $html = get ($url);
$html =~ m{<TITLE>(.*?)</TITLE>}gism;

print "$1\n";

Why the /g and /m modifiers?
What if the <title> element contains attributes?

Improved (I hope) code:

$html =~ m{<TITLE.*?>(.*?)</TITLE>}is;

J.D. Baldwin · Mar 17, 2008

In the previous article, Keith Keller

Use LWP to retrieve the page, HTML::TreeBuilder to build a syntax
tree, and HTML::Element's find_by_tag_name method to find the
element with the title tag. It sounds like more work than it is.

Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

And thanks also to Koszalek Opalek for his answer elsethread.

malec · Mar 17, 2008

http://www.perlnow.com/cgi-bin/l.CGI?file=getitle.cgi

J.D. Baldwin · Mar 18, 2008

The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>
$

Googling is no help, examining the RC is no help. Any suggestions?

Ben Morrow · Mar 18, 2008

Quoth (e-mail address removed):

The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint
"http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>
$

Googling is no help, examining the RC is no help. Any suggestions?

http_proxy needs to be a full URL, without path, such as
'http://proxy-host:3128'. This is for consistency with ftp_proxy, which
allows either ftp:// or http:// proxies; and presumably so that one
could use an https:// proxy for HTTP requests.

Ben

J.D. Baldwin · Mar 18, 2008

In the previous article said:
http_proxy needs to be a full URL, without path, such as
'http://proxy-host:3128'. This is for consistency with ftp_proxy,
which allows either ftp:// or http:// proxies; and presumably so
that one could use an https:// proxy for HTTP requests.

Ah, you know, I've run into that with other utilities, but I tested
against wget, which handles a plain hostname just fine. Thanks for
the reminder.

No Perl content remaining here, nothing to see ... move along ...

Brian Helterline · Mar 18, 2008

J.D. Baldwin said:
The other replies to my post suggested LWP with other tools. Now I
cannot get LWP to work with a valid proxy setting. My script will
work with the http_proxy variable unset ... but I'm still curious why
this should be so (perl 5.8.8, LWP::Simple 1.4.1):

$ http_proxy='' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

[...]

$ http_proxy='valid_host' perl -MLWP::Simple -e 'getprint "http://www.sn.no"' | head
501 Protocol scheme '' is not supported <URL:http://www.sn.no>

if you had set http_proxy, it would have printed something inside the
single quotes rather than ''.
Also make sure you specify http_proxy as starting with "http://", not
just the proxy name.

brian d foy · Mar 18, 2008

[[ This message was both posted and mailed: see
the "To," "Cc," and "Newsgroups" headers for details. ]]

J.D. Baldwin said:
In the previous article, Keith Keller

Oh, my, "TreeBuilder" is *exactly* what I needed. Thank you!

If you just want to get the title, HTML::HeadParser is what you need.
It already does all of the hard work for you.

J.D. Baldwin · Mar 19, 2008

In the previous article said:
If you just want to get the title, HTML::HeadParser is what you need.
It already does all of the hard work for you.

Glancing at it, it looks simple and powerful. I've already
implemented it the other way, but I'll file HeadParser away in my bag
of tricks, so thanks.

How To Make A Title Tag That Search Engines Will Love	1	Jan 8, 2008
Get Rid of '-- Web Page Dialog' in Title Bar	6	Jul 13, 2004
How to retrieve TITLE value of a html page.	2	Aug 7, 2006
I use TK to show some chinese web page, I get nothing,why?	1	Nov 26, 2006
need a knock off of "post to del.icio.us" link	2	Mar 23, 2006
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
How to set the src of a html <img> tag to a string returned from a jsp page?	7	Nov 13, 2003
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Need Perl module to get <TITLE> tag of a web page

J.D. Baldwin

Koszalek Opalek

Gunnar Hjalmarsson

J.D. Baldwin

malec

J.D. Baldwin

Ben Morrow

J.D. Baldwin

Brian Helterline

brian d foy

J.D. Baldwin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads