HTTP::TokeParser for a web page?

P

P.R.Brady

TokeParser looks a really useful tool for parsing HTML but will it only
take input from a file? Is it possible to get it to munge a web page
directly or even a scalar holding the page content (eg previously
grabbed with get)?

This works:

use warnings;
use HTML::TokeParser;
$file='c:/Perl/html/index.html';
$p = HTML::TokeParser->new($file) ||
die "Can't open: $!";
while (my $token = $p->get_token) {
print ${$token}[0],"\n";
# etc
}

but not:
$file='file:///c:/Perl/html/index.html';
or
$file='http://www.bangor.ac.uk/';

I'm running version v5.6.1 under Windoze.

Regards
Phil
 
P

Paul Lalli

TokeParser looks a really useful tool for parsing HTML but will it only
take input from a file? Is it possible to get it to munge a web page
directly or even a scalar holding the page content (eg previously
grabbed with get)?

From the documentation (perldoc HTML::TokeParser):


$p = HTML::TokeParser->new( \$document );
If the argument is a reference to a plain scalar, then this scalar is
taken to be the literal document to parse. The value of this scalar
should not be changed before all tokens have been extracted.


So in a word, yes.

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;

my $doc = get("http://www.yahoo.com");
my $parser = HTML::TokeParser->new(\$doc);

if ($parser->get_tag("title")) {
my $title = $parser->get_trimmed_text;
print "Title: $title\n";
}
__END__
Title: Yahoo!

Paul Lalli
 
B

Brian Gough

P.R.Brady said:
TokeParser looks a really useful tool for parsing HTML but will it only
take input from a file? Is it possible to get it to munge a web page
directly or even a scalar holding the page content (eg previously
grabbed with get)?

According to the documentation (perldoc HTML::TokeParser.pm) it
accepts either a filename, file handle, or string containing the
document (as a reference).
 
P

P.R.Brady

Paul said:
From the documentation (perldoc HTML::TokeParser):


$p = HTML::TokeParser->new( \$document );
If the argument is a reference to a plain scalar, then this scalar is
taken to be the literal document to parse. The value of this scalar
should not be changed before all tokens have been extracted.


So in a word, yes.

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;

my $doc = get("http://www.yahoo.com");
my $parser = HTML::TokeParser->new(\$doc);

if ($parser->get_tag("title")) {
my $title = $parser->get_trimmed_text;
print "Title: $title\n";
}
__END__
Title: Yahoo!

Paul Lalli

Great! Thanks Paul.
Phil
 
M

Michele Dondi

TokeParser looks a really useful tool for parsing HTML but will it only
take input from a file? Is it possible to get it to munge a web page
directly or even a scalar holding the page content (eg previously

You've already been told that in fact this is possible, so what I'm
about to say is completely OT and possibly misleading in that you may
think of using this tecnique where it wouldn't be necessary. So you
stand warned! Anyway here it comes: if it *were* not possible, then
you can always open() an in-memory file as in:


#!/usr/bin/perl

use strict;
use warnings;

open my $fh, '<', \<<"EOT";
foo
bar
baz
EOT

print while <$fh>;

__END__


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top